Chapter12

Chapter 12: The Ingestion Pipeline¶

The Framework's Five Abstractions¶

The kgraph framework defines five pluggable interfaces, each implemented per domain. Every ingestion pipeline -- regardless of domain -- is built from these five pieces:

DocumentParserInterface -- converts raw bytes (XML, PDF, plain text) into a structured document with section boundaries and metadata.
EntityExtractorInterface -- takes a document and returns entity mentions: text spans with type classifications, before any identity resolution.
EntityResolverInterface -- maps mentions to canonical or provisional entities, calling the identity server and updating the synonym cache.
RelationshipExtractorInterface -- takes a document and the resolved entity set from Pass 1, and returns typed relationships between them.
Bundle export -- merges per-document results, aggregates evidence across sources, and writes the kgbundle format that the query layer loads.

Domain code implements these interfaces; the framework orchestrates them. The medlit and sherlock examples each provide their own implementations. Medlit uses an LLM for both entity and relationship extraction, a JATS/PMC XML parser, and authority lookup against UMLS, HGNC, and RxNorm. Sherlock uses a simpler text parser and a lightweight LLM extractor with no external authority lookup. Same interfaces, different implementations, same output format.

Why Two Passes¶

The temptation is to do extraction end-to-end in one shot: send the document to the model, get back entities and relationships, done. That approach fails at scale for reasons that are worth stating explicitly. A single monolithic pass has a single point of failure -- if anything goes wrong, you restart from scratch. It produces output that is hard to debug, because you can't inspect intermediate states. And it conflates concerns that are better handled separately: entity extraction, identity resolution, relationship extraction, and assembly are different problems with different failure modes and different recovery strategies.

The two-pass architecture addresses this. Pass 1 (entity extraction and resolution) produces a stable, deduplicated entity vocabulary before Pass 2 runs. Pass 2 (relationship extraction) refers to canonical entity IDs rather than raw text spans, which improves consistency and enables cross-document linking. Each pass has a well-defined input and output. Failures are recoverable: if Pass 1 fails on document 47, you fix the issue and rerun from document 47. Intermediate artifacts are inspectable. The per-document bundle becomes the natural unit of work: each document produces a bundle that can be validated, cached, and merged independently. None of this is medlit-specific. It's good pipeline design for any extraction problem at non-trivial scale.

The medlit batch pipeline exposes these passes as four concrete stages, each a Python script with a well-defined artifact at its output:

Vocabulary (fetch_vocab): optional LLM pass over all papers to build a shared vocabulary of canonical entity names and their aliases. Output: vocab.json and a seeded synonym cache.
Extract (extract): LLM entity and relationship extraction for each paper, using the vocabulary as context. Output: per-paper paper_*.json artifact files in the extracted/ directory.
Ingest (ingest): identity-server-based deduplication and canonical ID assignment across all extracted bundles. Output: entities.json, relationships.json, and an ID map in merged/.
Build bundle (build_bundle): assembles the kgbundle -- the loadable artifact for the query layer. Fetches titles for cited papers from NCBI esummary. Output: entities.jsonl, relationships.jsonl, and supporting files in bundle/.

That ordering matters. You need a consistent entity vocabulary before extraction can use it. You need resolved entity IDs before you can aggregate relationships across documents. And you need the aggregated merged output before you can build the final bundle. Other orderings are possible, but the principle holds: separate concerns, make each pass debuggable, design for partial failure and restart.

Parsing: Getting to Text¶

Whatever your source format, you need to get to structured text before you can extract anything. The model reads text; it doesn't read PDF layout or XML tags. JATS XML -- the format used by PubMed Central -- is medlit's case: a structured representation of journal articles with metadata, abstract, and body sections. Yours might be PDFs, HTML, EPUB, proprietary formats, or plain text that's already clean. The parser's job is to produce a document representation that preserves structure the extractor can use: section boundaries, paragraph boundaries, and the actual text content.

Two decisions matter regardless of format. First, how do you identify section boundaries? In scientific papers, the distinction between Methods, Results, and Discussion carries semantic weight -- a claim in Results is different from a claim in Discussion. In legal documents, sections and subsections matter for citation. The parser should expose this structure so downstream passes can use it. Second, how do you chunk for extraction? Documents are often too long to send to the model in one call. Chunk too small and you lose context -- the referent of "it" or "the compound" may be in the previous chunk. Chunk too large and you exceed model context limits, dilute the signal, or hit token budgets that make the run expensive. Overlapping chunks can help: each sentence appears in at least one chunk, so no sentence is orphaned at a boundary. Sentence boundaries are a practical constraint worth respecting -- splitting mid-sentence produces fragments that are harder for the model to interpret correctly.

Extraction: The LLM Pass¶

This is where your schema meets the text. The extraction prompt is not a generic "extract entities and relationships" request. It is a binding of your entity types and relationship types to natural language, written so the model understands exactly what to look for.

What every extraction prompt needs. At minimum: the closed entity-type list (definitions, prompt guidance, and classification rules for edge cases); the predicate list with domain and range guidance, explicitly steering the model toward specific predicates over generic ones like ASSOCIATED_WITH; corpus vocabulary as preferred names for the entities seen in the batch (injected to suppress surface variation before deduplication); and domain-specific instructions for classification edge cases and output format. The vocabulary section is optional but materially reduces deduplication noise in any field where the same concept has many names.

The closed-world constraint. Every relationship subject and object must be the ID of an entity extracted in the same response. The model cannot assert a relationship involving a participant it didn't also classify and type. This is enforced in the prompt and validated downstream. It prevents the model from emitting relationships that reference entities not in the extracted set -- a form of hallucination that is otherwise difficult to detect and expensive to repair.

Staging tradeoffs. One combined prompt -- entities and relationships together -- is simpler, often sufficient, and lower latency. Splitting into two sequential calls (entities first, then relationships over the resolved entity set) reduces hallucination surface for complex documents: the model sees a clean entity list before constructing relationships, rather than reasoning about both simultaneously. The cost is latency and complexity. Ancillary metadata (study design, population characteristics, author affiliations) can be pulled in separate lightweight calls without affecting the main extraction. Most pipelines start with a single combined call and add staged passes only when inspection reveals that splitting would help.

Required output contract. Per entity: a local ID (stable within the response), entity type, surface name, synonyms, and any authority ID hints the model can infer. Per relationship: subject ID, predicate, object ID, evidence span ID, confidence, and linguistic trust (asserted / suggested / speculative). Per evidence span: passage text, section name, and paragraph index. The linguistic trust field is what allows downstream consumers to weight hedged claims differently from direct assertions. It is worth requiring it from the start rather than retrofitting -- provenance is painful to add after a pipeline is in production.

The prompt is the place where domain expertise gets translated into extraction behavior. A clinician reviewing a well-written prompt should be able to assess whether it captures the domain correctly. Schema changes -- adding an entity type, splitting a predicate into two more specific ones -- require editing the prompt, not retraining a model. Iteration over the prompt is the design method: run extraction on a sample, inspect the output, adjust, repeat. Appendix A shows an abstracted version of the medlit extraction prompt, illustrating how entity types, predicates, linguistic trust, and the closed-world constraint slot into a template.

Vocabulary: Building a Shared Terminology¶

Before extraction, medlit runs a dedicated vocabulary pass over all the papers in the batch. The idea: before you try to resolve "BRCA1," "breast cancer gene 1," and "BRCA1 protein" to the same entity, you establish a shared vocabulary of entity names and their variants. A vocabulary pass asks the LLM a narrower, cheaper question than full extraction -- "given the text of this paper, list the distinct named entities you see and their common aliases" -- and aggregates the answers across all papers into a canonical name list.

The output of the vocabulary pass feeds directly into extraction. When the extraction prompt runs for each paper, the shared vocabulary is injected as context: "these are the preferred names for entities seen across the corpus; use them." This keeps extraction consistent across workers and across time. Without it, two papers that both mention GPX4 might extract it as "GPX4," "glutathione peroxidase 4," and "phospholipid hydroperoxide glutathione peroxidase" in three different bundles, and identity resolution must sort them out later. With the vocabulary priming the extraction prompt, the model tends to use a consistent preferred form, reducing the deduplication burden downstream.

Not every domain needs a vocabulary pass. If your corpus uses consistent terminology, extraction may produce sufficiently normalized output without it. But medicine, law, and chemistry -- any field where the same concept has many names and many names map to the same concept -- will see a measurable reduction in deduplication noise. Think of it as schema binding at the lexical level: you're telling the model what things are called before asking it to extract relationships among them.

Deduplication¶

The same entity extracted from many documents will appear under slightly different names. "Aspirin," "acetylsalicylic acid," "ASA," and "2-acetoxybenzoic acid" are one drug. "Type 2 diabetes," "T2DM," "diabetes mellitus type 2," and "adult-onset diabetes" are one disease. The deduplication stage groups mentions, resolves them to canonical forms, and handles the ambiguous cases. This is where the gap between "a list of extracted facts" and "a coherent graph" starts to close.

The details vary by domain. In medicine, authority lookup -- MeSH, RxNorm, HGNC, and the rest -- does much of the work: many apparent synonyms resolve to the same canonical ID automatically. What remains after authority lookup is the residue: novel entities, institution-specific abbreviations, terms that aren't in any vocabulary yet. For those, you need other signals. Semantic similarity can help: mentions whose meanings are close -- as measured by comparing their numeric representations -- may be the same entity. So can co-occurrence: if "compound X" and "imatinib" appear in the same document and the context suggests they're the same, that's evidence. The hard cases are the ambiguous ones -- "ACE" could be angiotensin-converting enzyme or the gene, "CRF" could be corticotropin-releasing factor or chronic renal failure. Resolving those may require context, domain heuristics, or human review. The universal part: you need a deduplication strategy, and it should run before or alongside relationship extraction so that relationships reference resolved entities, not raw strings.

Assembly¶

Once you have per-document extractions -- entities resolved, relationships extracted -- you need to merge them into a coherent whole. Assembly is not just concatenation. When multiple documents assert the same relationship, that's meaningful signal. "Drug A treats Disease B" from one paper is weaker than "Drug A treats Disease B" from five independent papers. The assembly stage should aggregate evidence across sources: one relationship record with a provenance list, not five duplicate edges. That aggregation is what makes the graph useful for reasoning -- you can weight relationships by how many sources support them, filter by evidence type, and detect when sources conflict.

The structure of the final bundle is worth thinking about carefully before you start. What does a "document" in your graph look like? Is it a node with metadata and outgoing edges to the relationships it supports? Are relationships first-class with document references, or are documents first-class with relationship references? The choice affects query patterns, provenance traversal, and how you handle updates when you re-ingest a document with corrections. Changing the bundle structure later, once you have data and downstream consumers, is expensive. Get it right early.

Progress Tracking and Resumability¶

Large ingestion runs fail partway through. A run over 100,000 documents will hit rate limits, network timeouts, model outages, or your own mistakes. If the pipeline has no notion of progress, you restart from zero every time. That's acceptable for a research prototype. It's not acceptable for something you run regularly.

Design for restartability from the beginning. Each document should have a processing status: not started, in progress, completed, failed. The pipeline should record which documents have been fully processed and which haven't. On restart, it should skip completed documents and resume from the first incomplete one. Checkpointing within a document -- if a single document requires multiple LLM calls, record which chunks have been processed -- can help for very long documents, though the document is usually the right granularity. The progress store should be persistent and survive process restarts. This isn't glamorous work. It's the difference between a pipeline you can run once as a demo and a pipeline you can run every week as part of your workflow.

Design Principles¶

The concepts above translate directly into four implementation commitments.

Dedup-on-write. Identity resolution and synonym detection happen incrementally as each entity is written; there is no global deduplication pass over the corpus. Papers can be ingested concurrently.

Per-paper atomicity. Each paper moves through stages independently. A failure at any stage leaves the paper at its last committed status; the next available worker picks it up and retries. No paper's failure affects any other.

Durable checkpoints. Raw fetched text and raw LLM extraction output are stored durably before any graph writes. A schema change, extraction bug, or infrastructure failure can be recovered from without re-fetching or re-paying LLM costs.

Shared pipeline code. The MCP tool and the batch runner call the same stage functions. There is no separate implementation for interactive versus batch use.

Work Queue, Artifact Files, and Reference Implementation¶

The medlit implementation uses Postgres as a work queue (via SKIP LOCKED for distributed job claiming), per-paper artifact files for durability and retraction support, and a shared set of pipeline functions used by both the batch CLI and the MCP tool. The full SQL schema, shell invocations, Python snippets, and extraction output JSON format are in Appendix A.