Skip to content

Stage 4: build bundle

uv run python -m examples.medlit.scripts.build_bundle \ --merged-dir merged --bundles-dir extracted \ --output-dir bundle --pmc-xmls-dir pmc_xmls

Stages 1 and 2 are embarrassingly parallel at the paper level and can be run with multiple workers. Stage 3 parallelizes authority lookups (MeSH, UniProt, HGNC) within a single run using `asyncio.gather` with a semaphore, so HTTP calls to authority APIs happen concurrently rather than sequentially. Stage 4 is fast and single-threaded -- its main network cost is the batched NCBI `esummary` call for cited-paper titles, which takes only a handful of requests regardless of corpus size.

### MCP Tool

The MCP tool is a convenience for a user who wants to pull in one or a few papers during a query session and have them available immediately. It is not intended for bulk operations.

The tool calls the same underlying pipeline stages as the batch CLI -- fetch, extract, ingest, build bundle -- but runs them synchronously in a background thread so the MCP server stays responsive. If the paper has already been ingested, it returns immediately with a status message. For large lists of papers, use the batch CLI instead; the MCP tool is optimized for single-paper interactive use.

```python
@mcp.tool()
async def ingest_paper(pmcid: str) -> dict:
    """
    Fetch, extract, and ingest a single PMC paper into the knowledge graph.
    Runs the full pipeline (fetch → extract → ingest → build_bundle) and
    makes the paper available for querying on return.
    """
    # Check if already ingested
    existing = check_ingest_status(pmcid)
    if existing == "done":
        return {"status": "already_ingested", "pmcid": pmcid}

    # Run all pipeline stages via the same function used by the batch worker
    await asyncio.get_event_loop().run_in_executor(
        None, _run_pass2_pass3_load, workspace, bundles_dir, merged_dir, output_dir
    )
    return {"status": "ingested", "pmcid": pmcid}

The _run_pass2_pass3_load function is the shared implementation used by both the MCP tool and the background ingest worker. It runs ingest, build_bundle, and load_bundle_incremental in sequence, so that after it returns the paper is live in the graph without a server restart.

Extraction Output Format

The artifact file captures what the LLM decided, before the identity server assigns any entity IDs. Mentions and evidence strings include their location in the source document so that any claim can be verified against the original text.

raw_text is stored as the original PMC XML rather than stripped plain text. PMC XML has explicit section labels (<sec>, <title>, <p>) that make section and paragraph extraction reliable, and preserving the structure means location references remain valid if the artifact is re-ingested later.

Most paper metadata and cited references are available as structured fields in the PMC XML and are parsed directly by the fetch stage rather than extracted by the LLM. This makes them reliable and cheap — no prompt engineering required. Each cited PMC ID is also a candidate for further ingestion, making the reference list a natural source for corpus expansion.

{
  "pmcid": "PMC12345",
  "extracted_at": "2026-03-17T14:23:00Z",
  "model": "claude-sonnet-xxx",
  "metadata": {
    "title": "Serum cortisol as a diagnostic marker for hypercortisolism",
    "authors": [
      { "name": "Jane A. Smith",  "institution": "Massachusetts General Hospital" },
      { "name": "Robert T. Chen", "institution": "Harvard Medical School" }
    ],
    "publication_date": "2024-09-15",
    "journal": {
      "name": "Journal of Clinical Endocrinology & Metabolism",
      "issn": "0021-972X",
      "volume": "109",
      "issue": "4",
      "pages": "1123-1131"
    }
  },
  "references": [
    {
      "pmcid": "PMC98765",
      "doi": "10.1210/clinem/dgad001",
      "authors": ["Johnson B", "Lee K"],
      "title": "Urinary free cortisol in Cushing syndrome diagnosis",
      "journal": "Endocrine Reviews",
      "year": "2023"
    }
  ],
  "entities": [
    {
      "mention": "cortisol",
      "type": "Biomarker",
      "locations": [
        { "section": "Abstract", "paragraph": 1, "sentence": 2 },
        { "section": "Results",  "paragraph": 2, "sentence": 1 },
        { "section": "Results",  "paragraph": 4, "sentence": 3 }
      ],
      "attributes": {
        "value": "elevated",
        "specimen": "serum"
      }
    }
  ],
  "relationships": [
    {
      "subject_mention": "cortisol",
      "predicate": "indicates",
      "object_mention": "hypercortisolism",
      "evidence": "Serum cortisol levels were elevated in all patients diagnosed with hypercortisolism.",
      "evidence_locations": [
        { "section": "Results", "paragraph": 2, "sentence": 1 }
      ]
    }
  ]
}

Key properties:

  • Mentions, not IDs. The artifact records what the LLM said. Entity IDs are assigned by identity_server.resolve() during ingest and are not present here.
  • All locations for each entity. The same entity may be mentioned many times across a paper. Recording all locations supports usage_count computation and provides the full evidentiary basis for the entity.
  • Evidence strings with location. The verbatim text supporting each relationship, with its position in the document. evidence_locations is a list to handle cases where the supporting evidence spans multiple sentences or where the antecedent of a pronoun appears in a prior sentence.
  • Model and timestamp. Records exactly what produced this output, which is essential when comparing extractions before and after a prompt or model change.
  • No graph state. Nothing about merges, status, or canonical IDs. That is all rebuilt fresh by the identity server on each ingest.

Shared Pipeline Code

The four pipeline stages share their core logic across both the batch CLI and the MCP/server path. The batch CLI calls stage scripts directly; the MCP tool and background ingest worker call _run_pass2_pass3_load, which sequences the ingest, build_bundle, and load_bundle_incremental steps using the same underlying functions:

def _run_pass2_pass3_load(
    workspace_root: Path,
    bundles_dir: Path,
    merged_dir: Path,
    output_dir: Path,
) -> None:
    """Run ingest, build_bundle, and load_bundle_incremental. Raises on failure."""
    run_ingest(bundle_dir=bundles_dir, output_dir=merged_dir, ...)
    run_build_bundle(merged_dir, bundles_dir, output_dir)
    load_storage.load_bundle_incremental(manifest, str(output_dir))

run_ingest is the identity-server deduplication stage. run_build_bundle assembles the kgbundle including NCBI title fetching. load_bundle_incremental pushes the new bundle into the live graph storage without a restart.

The vocabulary and extraction stages are not in this shared path -- they are CLI-only for batch runs, since interactive single-paper ingestion via the MCP tool skips the vocabulary pass (the vocabulary built from the existing corpus is already embedded in the seeded synonym cache). For the MCP use case, the paper is extracted with the current vocabulary as context, then ingested, and the bundle is rebuilt and reloaded.

Extraction Prompt Template

The extraction prompt is a Jinja2 template with three injection points and a small number of structural rules that apply regardless of domain. What follows is an abstracted version that shows the structure; the full medlit domain instructions serve as a worked example.

You are a knowledge extraction expert. Extract entities
and relationships from the given text and return a single
JSON object with this structure (use exact keys):

- "entities": array of {
    "id"        (string, unique within this response),
    "class"     (entity type from the list below),
    "name"      (canonical surface form),
    "synonyms"  (array of alternate names)
  }

- "evidence_entities": array of {
    "id"     (format: paper_id:section:para_idx:method),
    "class": "Evidence",
    "text"   (verbatim passage from the source)
  }

- "relationships": array of {
    "subject"          (id from entities array),
    "predicate"        (from predicate list below),
    "object"           (id from entities array),
    "evidence_ids"     (array of evidence entity ids),
    "confidence"       (0.0–1.0),
    "linguistic_trust" ("asserted"|"suggested"|"speculative")
  }

CRITICAL: "subject" and "object" must be the "id" of an
entry in the "entities" array. If an entity appears in a
relationship but is not yet in the entities array, add it
first. Never use a free-form name as subject or object.

Return ONLY valid JSON. No markdown, no commentary.

{{ domain_instructions }}

Entity types: {{ entity_types }}
Predicates:   {{ predicates }}
{{ vocab_section }}

The three injection points:

  • {{ entity_types }} -- rendered from the domain spec; one line per type with label, description, and any classification guidance. In medlit this is the full list (Disease, Gene, Drug, Protein, ...) with concise definitions and edge-case rules (e.g. "if both Hormone and Protein, classify as Hormone").

  • {{ predicates }} -- rendered from the domain spec; one line per predicate with description and domain/range guidance. Listing domain and range steers the model toward specific predicates rather than generic fallbacks like ASSOCIATED_WITH.

  • {{ vocab_section }} -- injected when a vocabulary pass has been run; lists preferred names for entities seen across the current batch. When present, the model uses consistent surface forms, reducing deduplication noise downstream. When absent (single-paper MCP use), the section is empty and the model names entities as it sees fit.

The domain_instructions block is where domain-specific classification rules and output conventions live. In medlit:

This domain covers peer-reviewed medical literature.
Prefer established terminology over colloquial.
When in doubt about entity type, prefer the more specific.
Connect Author and Institution to the graph via
relationships; do not leave them as standalone entities.

#### Entity type classification
Classify at the most specific functional role. If an
entity is both a hormone and a protein, classify as
Hormone. Enzymes should be Enzyme, not Protein.
Extract pathological processes (hyperplasia, hypertrophy,
atrophy, etc.) as Symptom entities.

#### Predicates
Use the predicate list from the config. For SAME_AS,
use "resolution": null and "note" in the output.
When text describes a hormone "causing" a pathological
change "of" an anatomical structure (e.g. "ACTH
determines hyperplasia of the adrenal cortex"), extract:
(1) AGENT CAUSES SYMPTOM
(2) SYMPTOM LOCATED_IN ANATOMICAL_STRUCTURE

#### Linguistic trust
For each relationship, classify linguistic trust:
asserted (direct statement), suggested (soft language),
speculative (hedged).

#### Evidence format
Evidence id format:
  {paper_id}:{section}:{paragraph_idx}:llm
Use ==CURRENT_PAPER== as paper_id when PMC ID is unknown.

The domain instructions block is the place for rules that are domain-specific: what to do about entities that span multiple types, which predicates to prefer for common patterns in the literature, how to handle missing identifiers. Keep it short and declarative. Rules the model must actually follow during extraction belong here; background on why those rules exist does not.