Part III: Building It¶

Chapter 9: Diagnostic Tools¶

The single most valuable diagnostic tool is a good visualization of the graph. Force-directed layout -- where nodes and edges are treated like masses and springs floating in a 2-D space -- makes structural problems legible before they surface in logs. A well-designed graph explorer does more than render nodes and edges; it gives you the controls to interrogate what you've built and diagnose what's in it.

Force-Directed Visualization¶

The medlit reference implementation uses the D3.js force layout. Nodes are colored by entity type, edges carry relationship labels, and clicking a node opens a detail panel. At a glance you can see whether the graph has the expected topology: well-connected disease and drug nodes, distinct clusters by domain, bridges between subfields. Zoom, pan, and reset complete the interface.

Entity type legend and color coding. Nodes colored by entity type -- disease, drug, gene, protein, symptom, procedure -- make the graph interpretable at a glance. You can see whether a subgraph is dominated by one type, whether the expected mix is present, and whether any node is misclassified. A gene colored as a disease, or a drug colored as a symptom, is an extraction error that becomes visible immediately.

Relationship labels on edges. The edges should show their predicate types: "treats," "inhibits," "causes," "associated with." That visibility is diagnostic. You can spot relationships that are too generic ("associated with" everywhere suggests the extraction prompt is not distinguishing types) or predicates that are wrong for the domain.

Clicking a node opens its full metadata: canonical ID, entity type, synonyms, source papers, confidence scores. The canonical ID should link directly to the authority source -- MeSH:D000072716 to the MeSH Browser, HGNC:4556 to the HGNC gene page -- turning the detail panel into a live bridge between your graph and the sources it draws from. Provisional entities with prov: IDs display as plain text.

Live statistics. Node count, edge count -- these numbers give an immediate sense of density. Unexpectedly low counts can indicate extraction gaps or resolution failures; unexpectedly high counts can indicate over-merging or spurious extraction.

What to Look For¶

Force-directed layout reliably surfaces several classes of problem:

Duplicate nodes appear as pairs or clusters with near-identical labels that should have resolved to one entity. A disease called "T2DM" and another called "Type 2 diabetes" sitting as separate nodes means your synonym cache or authority lookup did not merge them. The visual makes this obvious; scanning entities.jsonl for it would not.

High-degree hubs that shouldn't exist often indicate an under-specified entity type. If one "process" node connects to fifty drugs and thirty genes, it may be absorbing mentions that should have resolved to several distinct entities.

Disconnected components -- islands of nodes with no edges to the rest of the graph -- suggest extraction gaps. Either the relevant relationships were not extracted, or the entities in that cluster failed identity resolution and are sitting as orphaned provisionals.

Predicate distribution anomalies. If the great majority of edges carry "associated with" rather than more specific predicates, the extraction prompt is probably not steering the model toward the predicate vocabulary you defined.

Visualization as Pipeline Signal¶

The visualization is not just a display layer -- it is a diagnostic instrument that surfaces pipeline failures before they appear in logs. Unexpectedly long evidence spans, malformed relationship metadata, and identity resolution failures all produce visible anomalies: a sprawling hub where a tight cluster should be, a node with no edges that should be central, or a pair of nodes that should be one.

The feedback loop is tight during development: run a small extraction batch, load it into the visualization, identify what looks wrong, adjust the prompt or schema, repeat. This is faster than unit-testing every failure mode in advance. The visualization makes "is this working?" a question you can answer by looking.

Chapter 10: Design Priorities¶

This chapter examines the design decisions that matter most before you start building — specifically around provenance, which is the one you're most likely to underinvest in and most painful to retrofit. The examples are drawn from my own project, a knowledge graph to capture the contents of medical papers. I call this project "kgraph" in its domain-agnostic form, and "medlit" is the extension of the knowledge graph into the domain of medical literature.

These notes should be regarded not as a recipe or specification that you must follow, but rather as a checklist of things worth considering. The approach suggested here might meet your needs, it might not. Thinking about them will help you reason through the details of your own design.

Provenance¶

Provenance is not equally important in every domain. Be honest with yourself about how much provenance infrastructure your domain actually requires before you build it.

In medicine, the difference between "this drug inhibits this enzyme, from a single case report" and "this drug inhibits this enzyme, replicated across forty randomized controlled trials" is the difference between a hypothesis and a clinical fact. A clinician making a treatment decision needs to know which. A researcher generating hypotheses might care less about the evidence grade and more about the structural pattern. But for any application where the quality of the claim matters -- clinical decision support, regulatory submission, litigation -- provenance is not optional. You need to know where each relationship came from, what kind of study supported it, and how confident the extraction was.

In other domains, the bar is lower. A knowledge graph of theatrical productions -- which plays were performed where, when, by whom -- might be perfectly useful without tracing every edge to a specific source. If the graph says "Hamlet was performed at the Globe in 1601," you might care that it's correct, but you probably don't need a provenance record that distinguishes "extracted from a primary source" versus "extracted from a secondary source" versus "inferred from context." The graph is useful even if you can't trace every edge. Building elaborate provenance infrastructure for a domain that doesn't need it is over-engineering.

The honest question: what will your users do with the graph? If they will make high-stakes decisions based on it, provenance matters. If they will use it for exploration, discovery, or casual reference, a lighter touch may suffice. Get that right before you invest in schema design and pipeline complexity.

When provenance does matter, it matters at an architectural level, not as an add-on. You can't retrofit provenance into a schema that didn't anticipate it without re-ingesting your corpus. The relationship model, the bundle structure, and the extraction prompts all need to capture provenance from the start. Think about it carefully before you build.

When it matters, it matters a lot. A relationship with provenance -- specific source document, section or passage, extraction method, confidence score, study design if applicable -- is evidence. Without provenance, a relationship in your graph is an assertion of unknown quality. You can't verify it. You can't weight it against conflicting claims. You can't explain to a user why the graph says what it says. You're left with "the graph says drug A treats disease B" and no way to assess whether that's a well-replicated finding or a single speculative mention.

Provenance is also what lets you audit extraction quality. When a relationship looks wrong, you need to trace it back: which document, which passage, what did the extractor see? That trace is how you find prompt bugs, schema ambiguities, and source documents that are genuinely ambiguous. Without it, you're debugging in the dark.

It's also how you debug pipeline regressions. If extraction quality drops after a prompt change, provenance lets you compare: what did the old pipeline produce for this document versus the new one? Which relationships disappeared, which appeared, which changed? Provenance turns "something got worse" into "here's exactly what changed and where."

And it's how you answer the question every serious user will eventually ask: "Why does the graph say this?" A graph that can't answer that question is a black box. A graph that can produce, for any relationship, the list of sources that support it, with enough detail to verify, is a tool you can trust.

Confidence as a Signal, Not a Guarantee¶

LLMs can be prompted to produce confidence scores alongside their extractions. "How confident are you that this relationship holds, on a scale of 1 to 5?" The model will give you a number. That number is useful. It correlates with extraction quality: relationships the model is confident about tend to be more reliable than ones it's uncertain about. You can use it to filter, rank, or weight results. It's worth capturing.

It is not a calibrated probability. A model that says "90% confident" does not mean that 90% of such extractions will be correct. The model has no access to ground truth; it's producing a subjective assessment of its own certainty, which is a different thing. The scores are ordinal -- higher usually means more reliable -- but the mapping from score to actual accuracy is unknown and varies by domain, relationship type, and prompt.

The honest framing: confidence is one input to trust, not trust itself. A high-confidence extraction from a single case report is still weak evidence. A low-confidence extraction replicated across twenty randomized trials might be strong evidence. Confidence tells you something about the extraction process. Provenance tells you something about the underlying evidence. You need both, and you need to combine them with domain judgment rather than treating either as a sufficient statistic.

Multi-Source Relationships¶

When multiple independent sources assert the same relationship, that's meaningful signal. "Drug A inhibits enzyme B" from one paper is a datum. "Drug A inhibits enzyme B" from five papers, each with different authors, methods, and populations, is a finding. The replication across sources is evidence that the relationship is robust, not an artifact of one study's design or one extraction's error.

Designing your data model to aggregate evidence across sources -- rather than storing one relationship record per source -- is worth the extra complexity if your corpus is large enough that the same relationship will appear many times. Instead of five edges from document A, B, C, D, and E, you have one edge with a provenance list of five sources. That aggregation enables queries like "how many sources support this relationship?" and "what's the evidence grade for this claim?" It also keeps the graph from exploding in size: the number of unique relationships in a domain is much smaller than the number of relationship mentions across documents.

The aggregation logic needs to handle nuance. Two papers from the same author group might not be independent; you might want to weight them differently than two papers from unrelated labs. A paper that retracts a finding should reduce the evidence count, not leave a stale relationship in place. A paper that asserts the opposite -- "drug A does not inhibit enzyme B" -- creates a conflict that your model should represent rather than silently merging. These are design choices that depend on your domain and how you plan to use the graph.

Provenance at Query Time¶

The point of capturing provenance is being able to use it. A graph server that can answer "what's the evidence for this relationship?" rather than just "does this relationship exist?" is a qualitatively different tool. The first supports verification, weighting, and explanation. The second supports only retrieval.

Whether you need that capability depends on your domain and your users. A researcher exploring a graph for hypothesis generation might be satisfied with "drug A is connected to disease B" and not need to drill into sources. A clinician considering a treatment decision needs the evidence. A regulatory submission requires traceability. Design for the most demanding use case you anticipate.

If you do need provenance at query time, design for it from the start. The query interface should support "give me this relationship and its provenance" as a first-class operation. The API response should include source documents, passages, confidence, and whatever else your schema captures. Retrofitting this into a schema that stored relationships without provenance, or into a server that never exposed it, is painful. You'd need to re-ingest to capture what wasn't captured, and you'd need to extend the API to return what wasn't designed to be returned. Get it right early.

Chapter 11: The Identity Server¶

The identity server is the component responsible for entity identity across the knowledge graph: resolving a mention to a canonical ID, tracking provisional entities until they can be confirmed, detecting synonyms, and merging duplicates. Its full architecture -- domain plugin contract, authority lookup chain, synonym cache, promotion policy, Docker deployment -- is covered in the companion volume The Identity Server: Canonical Identity for Knowledge Graphs. This chapter covers only what the ingestion pipeline needs to know about calling it.

Identity Is Load-Bearing¶

Canonical entities with canonical IDs are the design decision that most separates a useful knowledge graph from a sophisticated extraction exercise. Without identity resolution, you have a collection of mentions that look like a graph but don't support cross-document reasoning. With it, "Drug A treats Disease B" means the same thing whether it came from a 2010 review article or a 2024 clinical trial, because both assertions resolve to the same nodes.

The Pipeline's View¶

The ingestion pipeline calls the identity server as a black box. After the LLM extraction pass produces raw mentions, the ingest stage calls resolve(mention, entity_type) for each one and receives back a stable ID -- canonical if an authority matched, provisional otherwise. Provisional IDs are valid graph nodes: relationships referencing them are valid edges and evidence accumulates against them through any later promotion or merge. The pipeline does not handle provisional entities specially.

Papers, authors, and citations from document metadata enter with their canonical ID already known (a PMC ID, an ORCID) and bypass the lookup chain entirely. The citation graph this produces -- CITES(Paper, Paper) edges derived directly from reference lists, confidence 1.0 -- is a built-in corpus expansion mechanism: frequently cited papers not yet ingested become natural candidates for the next ingest run.

Chapter 12: The Ingestion Pipeline¶

The Framework's Five Abstractions¶

The kgraph framework defines five pluggable interfaces, each implemented per domain. Every ingestion pipeline -- regardless of domain -- is built from these five pieces:

DocumentParserInterface -- converts raw bytes (XML, PDF, plain text) into a structured document with section boundaries and metadata.
EntityExtractorInterface -- takes a document and returns entity mentions: text spans with type classifications, before any identity resolution.
EntityResolverInterface -- maps mentions to canonical or provisional entities, calling the identity server and updating the synonym cache.
RelationshipExtractorInterface -- takes a document and the resolved entity set from Pass 1, and returns typed relationships between them.
Bundle export -- merges per-document results, aggregates evidence across sources, and writes the kgbundle format that the query layer loads.

Domain code implements these interfaces; the framework orchestrates them. The medlit and sherlock examples each provide their own implementations. Medlit uses an LLM for both entity and relationship extraction, a JATS/PMC XML parser, and authority lookup against UMLS, HGNC, and RxNorm. Sherlock uses a simpler text parser and a lightweight LLM extractor with no external authority lookup. Same interfaces, different implementations, same output format.

Why Two Passes¶

The temptation is to do extraction end-to-end in one shot: send the document to the model, get back entities and relationships, done. That approach fails at scale for reasons that are worth stating explicitly. A single monolithic pass has a single point of failure -- if anything goes wrong, you restart from scratch. It produces output that is hard to debug, because you can't inspect intermediate states. And it conflates concerns that are better handled separately: entity extraction, identity resolution, relationship extraction, and assembly are different problems with different failure modes and different recovery strategies.

The two-pass architecture addresses this. Pass 1 (entity extraction and resolution) produces a stable, deduplicated entity vocabulary before Pass 2 runs. Pass 2 (relationship extraction) refers to canonical entity IDs rather than raw text spans, which improves consistency and enables cross-document linking. Each pass has a well-defined input and output. Failures are recoverable: if Pass 1 fails on document 47, you fix the issue and rerun from document 47. Intermediate artifacts are inspectable. The per-document bundle becomes the natural unit of work: each document produces a bundle that can be validated, cached, and merged independently. None of this is medlit-specific. It's good pipeline design for any extraction problem at non-trivial scale.

The medlit batch pipeline exposes these passes as four concrete stages, each a Python script with a well-defined artifact at its output:

Vocabulary (fetch_vocab): optional LLM pass over all papers to build a shared vocabulary of canonical entity names and their aliases. Output: vocab.json and a seeded synonym cache.
Extract (extract): LLM entity and relationship extraction for each paper, using the vocabulary as context. Output: per-paper paper_*.json artifact files in the extracted/ directory.
Ingest (ingest): identity-server-based deduplication and canonical ID assignment across all extracted bundles. Output: entities.json, relationships.json, and an ID map in merged/.
Build bundle (build_bundle): assembles the kgbundle -- the loadable artifact for the query layer. Fetches titles for cited papers from NCBI esummary. Output: entities.jsonl, relationships.jsonl, and supporting files in bundle/.

That ordering matters. You need a consistent entity vocabulary before extraction can use it. You need resolved entity IDs before you can aggregate relationships across documents. And you need the aggregated merged output before you can build the final bundle. Other orderings are possible, but the principle holds: separate concerns, make each pass debuggable, design for partial failure and restart.

Parsing: Getting to Text¶

Whatever your source format, you need to get to structured text before you can extract anything. The model reads text; it doesn't read PDF layout or XML tags. JATS XML -- the format used by PubMed Central -- is medlit's case: a structured representation of journal articles with metadata, abstract, and body sections. Yours might be PDFs, HTML, EPUB, proprietary formats, or plain text that's already clean. The parser's job is to produce a document representation that preserves structure the extractor can use: section boundaries, paragraph boundaries, and the actual text content.

Two decisions matter regardless of format. First, how do you identify section boundaries? In scientific papers, the distinction between Methods, Results, and Discussion carries semantic weight -- a claim in Results is different from a claim in Discussion. In legal documents, sections and subsections matter for citation. The parser should expose this structure so downstream passes can use it. Second, how do you chunk for extraction? Documents are often too long to send to the model in one call. Chunk too small and you lose context -- the referent of "it" or "the compound" may be in the previous chunk. Chunk too large and you exceed model context limits, dilute the signal, or hit token budgets that make the run expensive. Overlapping chunks can help: each sentence appears in at least one chunk, so no sentence is orphaned at a boundary. Sentence boundaries are a practical constraint worth respecting -- splitting mid-sentence produces fragments that are harder for the model to interpret correctly.

Extraction: The LLM Pass¶

This is where your schema meets the text. The extraction prompt is not a generic "extract entities and relationships" request. It is a binding of your entity types and relationship types to natural language, written so the model understands exactly what to look for.

What every extraction prompt needs. At minimum: the closed entity-type list (definitions, prompt guidance, and classification rules for edge cases); the predicate list with domain and range guidance, explicitly steering the model toward specific predicates over generic ones like ASSOCIATED_WITH; corpus vocabulary as preferred names for the entities seen in the batch (injected to suppress surface variation before deduplication); and domain-specific instructions for classification edge cases and output format. The vocabulary section is optional but materially reduces deduplication noise in any field where the same concept has many names.

The closed-world constraint. Every relationship subject and object must be the ID of an entity extracted in the same response. The model cannot assert a relationship involving a participant it didn't also classify and type. This is enforced in the prompt and validated downstream. It prevents the model from emitting relationships that reference entities not in the extracted set -- a form of hallucination that is otherwise difficult to detect and expensive to repair.

Staging tradeoffs. One combined prompt -- entities and relationships together -- is simpler, often sufficient, and lower latency. Splitting into two sequential calls (entities first, then relationships over the resolved entity set) reduces hallucination surface for complex documents: the model sees a clean entity list before constructing relationships, rather than reasoning about both simultaneously. The cost is latency and complexity. Ancillary metadata (study design, population characteristics, author affiliations) can be pulled in separate lightweight calls without affecting the main extraction. Most pipelines start with a single combined call and add staged passes only when inspection reveals that splitting would help.

Required output contract. Per entity: a local ID (stable within the response), entity type, surface name, synonyms, and any authority ID hints the model can infer. Per relationship: subject ID, predicate, object ID, evidence span ID, confidence, and linguistic trust (asserted / suggested / speculative). Per evidence span: passage text, section name, and paragraph index. The linguistic trust field is what allows downstream consumers to weight hedged claims differently from direct assertions. It is worth requiring it from the start rather than retrofitting -- provenance is painful to add after a pipeline is in production.

The prompt is the place where domain expertise gets translated into extraction behavior. A clinician reviewing a well-written prompt should be able to assess whether it captures the domain correctly. Schema changes -- adding an entity type, splitting a predicate into two more specific ones -- require editing the prompt, not retraining a model. Iteration over the prompt is the design method: run extraction on a sample, inspect the output, adjust, repeat. Appendix A shows an abstracted version of the medlit extraction prompt, illustrating how entity types, predicates, linguistic trust, and the closed-world constraint slot into a template.

Vocabulary: Building a Shared Terminology¶

Before extraction, medlit runs a dedicated vocabulary pass over all the papers in the batch. The idea: before you try to resolve "BRCA1," "breast cancer gene 1," and "BRCA1 protein" to the same entity, you establish a shared vocabulary of entity names and their variants. A vocabulary pass asks the LLM a narrower, cheaper question than full extraction -- "given the text of this paper, list the distinct named entities you see and their common aliases" -- and aggregates the answers across all papers into a canonical name list.

The output of the vocabulary pass feeds directly into extraction. When the extraction prompt runs for each paper, the shared vocabulary is injected as context: "these are the preferred names for entities seen across the corpus; use them." This keeps extraction consistent across workers and across time. Without it, two papers that both mention GPX4 might extract it as "GPX4," "glutathione peroxidase 4," and "phospholipid hydroperoxide glutathione peroxidase" in three different bundles, and identity resolution must sort them out later. With the vocabulary priming the extraction prompt, the model tends to use a consistent preferred form, reducing the deduplication burden downstream.

Not every domain needs a vocabulary pass. If your corpus uses consistent terminology, extraction may produce sufficiently normalized output without it. But medicine, law, and chemistry -- any field where the same concept has many names and many names map to the same concept -- will see a measurable reduction in deduplication noise. Think of it as schema binding at the lexical level: you're telling the model what things are called before asking it to extract relationships among them.

Deduplication¶

The same entity extracted from many documents will appear under slightly different names. "Aspirin," "acetylsalicylic acid," "ASA," and "2-acetoxybenzoic acid" are one drug. "Type 2 diabetes," "T2DM," "diabetes mellitus type 2," and "adult-onset diabetes" are one disease. The deduplication stage groups mentions, resolves them to canonical forms, and handles the ambiguous cases. This is where the gap between "a list of extracted facts" and "a coherent graph" starts to close.

The details vary by domain. In medicine, authority lookup -- MeSH, RxNorm, HGNC, and the rest -- does much of the work: many apparent synonyms resolve to the same canonical ID automatically. What remains after authority lookup is the residue: novel entities, institution-specific abbreviations, terms that aren't in any vocabulary yet. For those, you need other signals. Semantic similarity can help: mentions whose meanings are close -- as measured by comparing their numeric representations -- may be the same entity. So can co-occurrence: if "compound X" and "imatinib" appear in the same document and the context suggests they're the same, that's evidence. The hard cases are the ambiguous ones -- "ACE" could be angiotensin-converting enzyme or the gene, "CRF" could be corticotropin-releasing factor or chronic renal failure. Resolving those may require context, domain heuristics, or human review. The universal part: you need a deduplication strategy, and it should run before or alongside relationship extraction so that relationships reference resolved entities, not raw strings.

Assembly¶

Once you have per-document extractions -- entities resolved, relationships extracted -- you need to merge them into a coherent whole. Assembly is not just concatenation. When multiple documents assert the same relationship, that's meaningful signal. "Drug A treats Disease B" from one paper is weaker than "Drug A treats Disease B" from five independent papers. The assembly stage should aggregate evidence across sources: one relationship record with a provenance list, not five duplicate edges. That aggregation is what makes the graph useful for reasoning -- you can weight relationships by how many sources support them, filter by evidence type, and detect when sources conflict.

The structure of the final bundle is worth thinking about carefully before you start. What does a "document" in your graph look like? Is it a node with metadata and outgoing edges to the relationships it supports? Are relationships first-class with document references, or are documents first-class with relationship references? The choice affects query patterns, provenance traversal, and how you handle updates when you re-ingest a document with corrections. Changing the bundle structure later, once you have data and downstream consumers, is expensive. Get it right early.

Progress Tracking and Resumability¶

Large ingestion runs fail partway through. A run over 100,000 documents will hit rate limits, network timeouts, model outages, or your own mistakes. If the pipeline has no notion of progress, you restart from zero every time. That's acceptable for a research prototype. It's not acceptable for something you run regularly.

Design for restartability from the beginning. Each document should have a processing status: not started, in progress, completed, failed. The pipeline should record which documents have been fully processed and which haven't. On restart, it should skip completed documents and resume from the first incomplete one. Checkpointing within a document -- if a single document requires multiple LLM calls, record which chunks have been processed -- can help for very long documents, though the document is usually the right granularity. The progress store should be persistent and survive process restarts. This isn't glamorous work. It's the difference between a pipeline you can run once as a demo and a pipeline you can run every week as part of your workflow.

Design Principles¶

The concepts above translate directly into four implementation commitments.

Dedup-on-write. Identity resolution and synonym detection happen incrementally as each entity is written; there is no global deduplication pass over the corpus. Papers can be ingested concurrently.

Per-paper atomicity. Each paper moves through stages independently. A failure at any stage leaves the paper at its last committed status; the next available worker picks it up and retries. No paper's failure affects any other.

Durable checkpoints. Raw fetched text and raw LLM extraction output are stored durably before any graph writes. A schema change, extraction bug, or infrastructure failure can be recovered from without re-fetching or re-paying LLM costs.

Shared pipeline code. The MCP tool and the batch runner call the same stage functions. There is no separate implementation for interactive versus batch use.

Work Queue, Artifact Files, and Reference Implementation¶

The medlit implementation uses Postgres as a work queue (via SKIP LOCKED for distributed job claiming), per-paper artifact files for durability and retraction support, and a shared set of pipeline functions used by both the batch CLI and the MCP tool. The full SQL schema, shell invocations, Python snippets, and extraction output JSON format are in Appendix A.