Part II: LLMs Change the Equation¶
Chapter 6: LLMs Make This Practical Now¶
Chapter 5 closed with the economics argument: the marginal cost of a new extraction task is a prompt, not a research project. That shift is assumed here. This chapter is about what LLMs actually are for extraction purposes, why prompts work as schema binding, what they handle well, and where the remaining limitations are.
What LLMs Actually Are, For This Purpose¶
Chapter 1 established what LLMs are not: not databases, not reasoning systems, not reliable reporters of ground truth. They are pattern-completion engines trained on large text corpora, and their outputs are statistically plausible continuations of their inputs, not verified facts. This is the source of hallucination, and hallucination does not go away in the extraction context -- it takes a specific form there that we'll address directly.
What LLMs are, for the purposes of extraction, is something more specific and more useful than "a neural network trained on text."
A large language model has, in its training distribution, an enormous amount of human text that includes natural language descriptions of relationships between things: what drugs treat, what genes encode, what historical events caused other historical events, what legal provisions imply, what symptoms suggest. The model has learned, in a diffuse statistical sense, what it means for these relationships to hold -- not as explicit logical rules, but as patterns of co-occurrence, context, and usage that are deeply embedded in the model's weights.
When you write a prompt that says "extract all 'treats' relationships between drugs and diseases from the following text, where 'treats' means a drug is used therapeutically to address a disease," you are not teaching the model what "treats" means. You are binding the model's existing, diffuse understanding of that concept to your schema. The model already has a representation of the difference between "ibuprofen treats headache" and "ibuprofen does not treat bacterial infection." Your prompt tells it that this particular distinction, expressed in terms of your schema's relationship type, is what you want surfaced.
This is what the outline calls "the prompt as schema binding," and it's the conceptual key to why LLM-based extraction works differently from classical systems. You're not training a model to recognize patterns. You're directing a model that already has deep, broad pattern knowledge to apply that knowledge to your specific representational task. The schema description in your prompt is a set of instructions to a very knowledgeable reader, not a specification for a statistical classifier.
The implications of this are large, and we'll work through them in the next section. For now, the key point is that "LLMs understand language" -- a claim that deserves skepticism in many contexts -- is, for the specific task of extraction, a practically useful approximation. The model doesn't need to understand language in the philosophical sense. It needs to be able to identify, in a passage of text, whether a relationship of a given semantic type holds between two given entities. It can do this because it has learned, from an enormous amount of human language use, what those relationships look like when they're present and when they're absent. That's enough.
The Prompt as Schema Binding¶
The central practical difference between classical NLP extraction and LLM-based extraction is this: in classical systems, the schema is baked into the model architecture and the training data. In LLM-based systems, the schema is in the prompt.
This sounds like an implementation detail. It isn't.
When the schema is baked into the model, changing the schema means changing the model. New entity types require new training data and new training runs. Renamed relationship types confuse a model trained with the old names. Distinctions that turned out to matter -- the difference between "directly inhibits" and "allosterically inhibits," say, if that distinction turns out to be clinically significant for your application -- require new annotations and new training. The schema is frozen at training time, and thawing it is expensive.
When the schema is in the prompt, changing the schema means changing the prompt. You add a new entity type by describing it. You clarify a relationship type by adding a sentence that explains the distinction. You can make these changes and run an extraction batch within the hour to see how the change affects output quality. Schema evolution, which any serious knowledge graph project will go through, stops being a recurring research project and becomes a routine iteration loop.
The other consequence is that schema design becomes a collaborative, legible activity. A classical NLP pipeline encodes schema decisions in training data and model weights that domain experts can't directly inspect or critique. A prompt encodes schema decisions in natural language that anyone who can read can evaluate. A cardiologist reviewing a proposed schema for a cardiovascular knowledge graph can read a well-written extraction prompt, notice that the distinction between "increases risk of" and "causes" is not being drawn, and say so. She doesn't need to understand machine learning. She needs to understand her domain, and she does.
This changes the knowledge engineering relationship in a way that the classical systems never managed. The expert systems of the 1980s aspired to this: the vision was always that domain experts would be able to inspect and correct the knowledge base. The execution required knowledge engineers as intermediaries, because the representation languages were not natural language. LLM-based extraction achieves, in the extraction domain, what the knowledge engineers were supposed to achieve in the representation domain: it removes the translation layer between what domain experts know and what the system can use.
There are limits. Prompts that ask for overly subtle distinctions -- semantic differences that would challenge even a careful human reader -- produce inconsistent output. Prompts that are ambiguous produce ambiguous extractions. The quality of the schema description in the prompt directly determines the quality of the extraction output, and "write a good schema description" is harder than it sounds; the whole of Chapter 10 is devoted to it. The point is not that prompts are easy to write, but that the feedback loop between "what I described" and "what I got" is short enough to be useful.
Handling What Classical Systems Couldn't¶
Let's be concrete about the specific failure modes of classical NLP extraction that LLMs handle well, because "LLMs are better at language" is too vague to be useful.
Hedging and negation. The sentence "drug X did not inhibit pathway Y in this model" contains an "inhibits" relationship that is negated. A classical relation extraction system trained to recognize inhibition relationships would frequently fire on this sentence anyway, because "did not inhibit" and "inhibits" look statistically similar. The word "not" is short and common enough to be underweighted in most feature representations. A large language model, prompted to extract inhibition relationships and to exclude negated ones, handles this correctly in the vast majority of cases. The model has learned from an enormous amount of text that negation changes meaning, and it applies that knowledge. This is not a small improvement: negated relationships are common in scientific literature, and a graph that includes them as positive edges is systematically wrong in ways that are hard to detect.
Hedged claims. Related but distinct. "Drug X may inhibit pathway Y" and "drug X inhibits pathway Y" are different claims with different epistemic weights. Classical systems often collapsed these into the same extraction, losing the hedge. A prompted LLM can be instructed to track hedging as part of the provenance record -- to note whether a claim is stated as fact, hypothesis, or speculation -- and it will do so with reasonable consistency. This is directly relevant to provenance design, which Chapter 8 covers.
Implicit relationships. Some relationships are never stated directly in the text but are clear to a knowledgeable reader. "Patients receiving drug X showed a 40% reduction in tumor burden compared to controls" does not contain the word "treats." It asserts, in the language of clinical reporting, that drug X has therapeutic activity against the relevant tumor type. A classical system without an explicit "treats" pattern for this construction would miss it. A large language model, prompted to extract treatment relationships and given a description of what counts as evidence for one, will recognize this as an instance of the pattern. This matters a great deal in biomedical literature, where direct assertions are often replaced by results-focused constructions that any domain expert reads as making a relational claim.
Cross-sentence dependencies. "The compound was tested against a panel of cancer cell lines. It showed selective activity against BRCA1-mutant cells, with IC50 values in the nanomolar range." These two sentences together assert a relationship -- the compound has activity against a specific molecular subtype of cancer -- that doesn't fully exist in either sentence alone. Classical architectures with limited context windows would process each sentence independently and might miss the connection. A large language model operating over both sentences simultaneously -- or, in the chunking strategies we'll cover in Chapter 10, over a passage that includes both -- will recognize the implicit antecedent and extract the relationship correctly.
Domain jargon. A drug referred to by a trial identifier, a gene referred to by a lab-specific shorthand, a syndrome referred to by the name of its first describer -- these appear constantly in specialist literature and were frequently invisible to classical systems trained on corpora where the standard terminology dominated. LLMs trained on broad scientific text have seen a much wider range of how concepts are referred to, and they tolerate terminological variation better. This isn't complete -- genuinely novel terminology, or highly specialized jargon outside the training distribution, can still cause failures -- but the robustness is substantially better.
None of these improvements mean that LLM extraction is reliable without engineering. They mean that the specific failure modes that made classical extraction brittle in complex domains are substantially mitigated. The failure modes that remain are different in character, and the engineering response to them is different.
The Remaining Limitations, Honestly¶
Chapter 1 established hallucination as a structural feature of LLMs, not a bug. Chapter 5 was honest about what classical NLP couldn't do. This chapter should be equally honest about what LLMs can't do, because the engineering in Part III is largely a response to these limitations.
Hallucination in extraction. The model can invent entity names that don't appear in the source text. It can assert relationships that the text implies but doesn't actually support. It can misattribute provenance, assigning a claim to a passage that doesn't contain it. These are not rare edge cases -- they occur with meaningful frequency even in well-prompted models, and the frequency increases as the task gets harder (more complex sentences, more ambiguous relationships, longer passages). Validation against the source is not optional. Chapter 12 covers the validation pipeline.
Context window limits. Scientific papers are often tens of thousands of words. The relationships that matter may span sections written pages apart -- an introduction that defines a hypothesis, a methods section that describes a test, a results section that reports an outcome. The model's context window is finite, and even the largest current context windows don't fully solve this problem; performance tends to degrade on information that appears far from the relevant extraction target within a long context. The chunking strategies in Chapter 10 are a pragmatic response: break documents into manageable passages, extract relationships within each, and handle cross-chunk dependencies with a separate pass. This works, but it introduces its own complications, including relationships that span chunk boundaries and may be missed.
Cost at scale. A single extraction call against a single paper costs a small amount. Across a corpus of hundreds of thousands of papers, the cost accumulates. The economics are better than classical NLP at the prototype scale -- dramatically so -- and require more careful management at production scale. Caching, batching, and tiered extraction (do cheap passes first, expensive passes only where needed) are all part of managing this.
Non-determinism. The same prompt, run twice against the same text, may produce different output. This matters for reproducibility: if your pipeline produces different graphs on different runs, it's difficult to debug, compare, and maintain. Caching extraction results addresses this directly and is both an efficiency measure and a reproducibility measure.
What the model knows and doesn't know. LLM extraction reflects what the model learned from its training data. Very recent developments, highly specialized terminology that appears rarely in broad scientific text, and concepts that are standard in one community but not in the general scientific literature can all cause degraded performance. In practice, this means that extraction quality should be evaluated on your specific domain and corpus, not assumed from general benchmarks.
The point of this honest accounting is not to undercut the argument that LLMs make knowledge graph construction newly practical. They do. The point is that "newly practical" means "practical if you engineer it carefully," not "practical if you just call the API." Part III is the careful engineering.
Why This Moment¶
One more question deserves an answer before we get into the engineering: why now? The transformer architecture was introduced in 2017. GPT-2 was released in 2019. Why is this moment -- roughly 2023 through the present -- the right time to build?
The answer is the convergence of three things that had to arrive together.
The first is model capability. The capability of general-purpose language models for complex semantic understanding crossed a threshold somewhere around the GPT-4 generation. Before that threshold, prompted extraction was possible but brittle on complex constructions; after it, the failures became manageable with good engineering. Earlier models were impressive but required more hand-holding. Current models handle the hard cases -- hedging, negation, implicit relationships, cross-sentence dependencies -- with the consistency needed for production pipelines.
The second is API accessibility. Using a capable language model for extraction before the current generation of public APIs meant running your own infrastructure, which meant GPU clusters, model serving, and the full operational stack. This was possible for large organizations; it wasn't feasible for a researcher, a small company, or an individual practitioner. The existence of stable, affordable, well-documented APIs for capable models changes who can build this. You don't need infrastructure. You need an API key and a credit card to get started.
The third is the tooling ecosystem. The infrastructure for building knowledge graph pipelines -- graph databases with native support for the relevant query patterns, databases for similarity-based lookup, orchestration frameworks for multi-step LLM pipelines -- arrived, matured, and became accessible at roughly the same time as the models. A knowledge graph pipeline in 2020 would have required assembling a stack of immature tools and writing substantial infrastructure code. Today the stack exists and the components are well-documented.
These three things had to be true simultaneously, and they are now. That matters for the argument this book is making. This is not a forecast that something will soon be possible. This is a description of what is possible today, and the rest of the book is how to do it.
There is also a fourth consideration that is more speculative but worth naming. We are at an early moment in the adoption of this capability. The knowledge graphs that will have the largest impact on medicine, law, materials science, and other complex domains don't exist yet. The tooling is new enough that best practices are still being established. The organizations that will benefit most from these systems are still figuring out that this is possible. The researchers who will do the most interesting work with these systems haven't started yet.
Early maps of large territories are valuable precisely because they're early. What follows in this book is one such map, drawn from working code and real corpora. It's not complete. It's not the last word. But the territory is real, the tools are here, and the problems are interesting.
Chapter 7: The Free KG Cases¶
Not every knowledge graph requires extraction. Some are built from sources that are already structured when they arrive -- from lab instruments, databases, ontologies, or human-curated encyclopedias. Understanding these "free" cases sharpens the argument for extraction by making clear what the hard problem actually is. It also sets a quality benchmark: the graphs built without extraction tend to be high-precision over their coverage, precisely because the structure was already there. The extraction problem is the problem of accessing the knowledge that wasn't structured -- and in most interesting domains, that's the majority of what's known.
When You Don't Need Extraction¶
The central claim of this book is that extraction from unstructured text is the bottleneck that has limited knowledge graphs for decades, and that LLMs have finally made it tractable. But that claim only applies when the knowledge lives in unstructured form. In many situations, the knowledge is already structured when it reaches you, and extraction to a knowledge graph is simply a mechanical reformatting, perhaps a short shell or Python script. Structured web sources (schema.org markup, government open data) and well-documented APIs fall into the same category.
Perhaps the data comes from lab equipment that outputs well-defined records -- a sequencer, a spectrometer, a sensor network. Perhaps it comes from a database table with columns and foreign keys. Perhaps it comes from an ontology or formal specification that was designed to be machine-readable from the start. In these cases, the mapping from source to graph is often straightforward: a short script, an ETL pipeline, or a direct translation of the source schema into graph form. No LLM is required. No extraction prompt. No ambiguity about what the text meant. The structure is given.
This chapter examines several such cases. The goal is not to dismiss them -- they are important and widely used -- but to clarify the boundary. When does a knowledge graph require extraction, and when does it not? The answer determines whether the techniques in the rest of this book apply to a reader's problem, and it also illuminates what extraction is for: bridging the gap between the knowledge that is already structured and the knowledge that is not.
The practical question is: which case are you in? A few signals help.
If your domain has established, maintained, machine-readable authorities -- ontologies, registries, curated databases -- and your knowledge questions can be answered from those sources alone, you may not need extraction at all. The authority is the graph, or close enough that a mapping script is all that stands between them.
If your domain's knowledge lives primarily in structured form but you need to connect across sources -- linking a genomics database to a drug database to a clinical outcomes registry -- you need identity resolution and schema alignment, but not LLM-based extraction. The challenge is integration, not interpretation.
If the knowledge you need is in the literature -- in the prose of papers, reports, case notes, specifications -- then extraction is not optional. The structured sources don't have it. The only way to get it into the graph is to read the text and interpret it, and that is what the rest of this book is about.
Misidentifying your case is expensive in both directions. Building an extraction pipeline for a domain where structured authorities already cover your needs wastes time and introduces noise you didn't have to create. Assuming a structured source covers your domain when the real knowledge lives in the prose means building a graph with systematic blind spots you may not notice until a user asks the wrong question.
Lab Instruments and Measured Data¶
Genomics provides the canonical example. DNA sequencers, mass spectrometers, and high-throughput assay platforms produce structured data almost by definition. A sequencer outputs base calls with quality scores; a protein-protein interaction assay outputs pairs of identifiers with confidence metrics. The data has a schema. It has identifiers that map to established ontologies. It can be loaded into a graph with minimal interpretation.
Graphs like STRING (protein-protein interactions), BioGRID (genetic and physical interactions), and IntAct (molecular interactions) are populated primarily from such experimental measurements. They aggregate data from thousands of published studies, but the aggregation is over structured datasets that the authors deposited in repositories -- not over the free text of the papers. The extraction problem, in this context, is largely solved by the experimental design: the scientist produces structured output, and the graph ingests it.
What these graphs are good for is clear. They support queries like "what proteins interact with BRCA1?" or "what pathways involve this gene?" with high precision, because the edges correspond to actual experimental observations. What they miss is equally important: the experiment that wasn't done, or wasn't published as structured data, or was described only in the discussion section of a paper, is not in the graph. The knowledge that lives in the prose -- the mechanistic interpretation, the caveats, the connections to other domains that the authors didn't formalize -- remains inaccessible. For many biomedical questions, that's the majority of what's known. Lab instruments and measured data give you a high-quality graph over a subset of the domain.
Generated and Synthetic Graphs¶
A related class of knowledge graphs is constructed from databases, ontologies, or formal specifications rather than from text or instruments. The Gene Ontology is a curated hierarchy of molecular function, biological process, and cellular component terms -- it is a graph by design, with typed relationships (is_a, part_of, regulates) between terms. Drug interaction databases like DrugBank or STITCH map drugs to targets, indications, and interactions using structured records. Legal code can be represented as a graph of sections, references, and amendments. In each case, the source was already formal enough that conversion to graph form is a matter of schema mapping, not interpretation.
The precision of these graphs is high. When a relationship is asserted in the Gene Ontology, it has been reviewed by curators and validated against evidence. When a drug-target interaction appears in DrugBank, it has been extracted and verified by the database maintainers. The boundary between a knowledge graph and a very well-structured relational database starts to blur here -- and that blurriness is instructive. A graph is often just a different view of the same data, with traversal and path-finding as the primary operations instead of joins and aggregations.
The limitation is coverage. These graphs contain what was explicitly encoded. They do not contain what was left implicit, or what was stated in natural language and never formalized, or what was discovered after the last curation pass. For domains where the authoritative knowledge is already in structured form, that may be sufficient. For domains where the knowledge lives in the literature, it is not.
Curated Graphs at Scale¶
Wikidata, Freebase before it, and DBpedia represent a different model: human curation at scale. Millions of entities, millions of relationships, maintained by a community of contributors who add facts, correct errors, and resolve disputes through discussion and consensus. The result is a graph that spans many domains, with reasonable quality where the community has focused effort, and gaps where it has not.
A single query can retrieve structured information about a person, a place, a chemical, a historical event -- with identifiers that are stable, with relationships that are typed, with provenance that points to sources. For many applications, that is enough. The cost is the cost of human labor: curation does not scale to the full breadth of human knowledge, and it scales least well to domains where the knowledge is technical, specialized, or rapidly evolving. Encyclopedia articles can be curated. The full text of the medical literature cannot.
DBpedia illustrates the boundary clearly. It is extracted from Wikipedia infoboxes and structured elements -- but the extraction is from semi-structured templates, not from free prose. The infobox for a drug might have a "mechanism" field; the body of the article might have three paragraphs of nuanced explanation that never made it into the template. DBpedia has the former. It does not have the latter. Curated graphs at scale work where the community can structure the knowledge. They do not work where the knowledge lives in papers, reports, and documents that no one has the bandwidth to formalize.
What These Cases Teach Us¶
The common thread across lab instruments, generated graphs, and curated encyclopedias is this: structured sources give you high-precision graphs over the knowledge that was already structured. The extraction problem is precisely the problem of accessing the knowledge that wasn't -- which, in most interesting domains, is the majority of what's known.
The free KG cases set a quality benchmark worth aiming for. When a graph is built from structured sources, the edges tend to be correct, the entities tend to be well-identified, and the schema tends to be coherent. An extraction-based graph should aspire to that level of precision where it can. The gap between that benchmark and what extraction typically achieves is the gap that schema design, prompt engineering, and pipeline architecture are trying to close.
The free cases also illustrate the shape of the gap. It is not that extraction is impossible or that extracted graphs are inherently low quality. It is that extraction is a different kind of problem -- one that requires interpreting natural language, resolving ambiguity, and making judgments about what the text implies. The tools for that problem have improved dramatically. The problem itself has not gone away.
The gap is worth being specific about. A structured-source graph rarely has a wrong relationship direction -- the schema defines which way an edge runs and the data conforms. An extracted graph will have some inverted edges, particularly for asymmetric relationships stated in passive voice. A structured-source graph has stable entity identity -- the authority assigns the ID and it doesn't drift. An extracted graph has provisional entities that may be duplicated, merged incorrectly, or resolved differently across pipeline runs. A structured-source graph has relationships with known epistemic type -- a DrugBank interaction is always a curated assertion, not a model inference. An extracted graph mixes extraction confidence levels, and the metadata that tracks them requires deliberate schema design to preserve. None of these gaps are fatal. They are all addressable by the architecture in Part III. But they are real, and knowing they exist is what motivates the engineering choices ahead.
Hybrid Approaches¶
Most real knowledge graphs combine extraction with structured sources. The typical pattern: extract entities and relationships from text, then link the extracted entities to authoritative identifiers from a curated database or ontology. A drug mention in a paper is resolved to an RxNorm code. A disease mention is resolved to a MeSH term. A gene mention is resolved to an HGNC identifier. The extraction provides the coverage -- the connections that appear in the literature. The authority lookup provides the identity -- the canonical form that makes those connections comparable across sources.
This hybrid approach does not make the extraction problem go away. It constrains it. Instead of inventing identifiers from scratch, the extractor (or a downstream resolution pass) maps to an existing vocabulary. That mapping is itself a form of extraction -- the model must recognize that "ketoconazole" in the text refers to the same thing as RxNorm's concept for ketoconazole -- but it is a narrower problem than inventing a full schema and populating it from scratch. The schema, or at least the entity vocabulary, is given by the authority. The extraction fills in the relationships and the provenance.
The medical literature example in Part III follows this pattern. Entities are extracted from papers and resolved to MeSH, HGNC, RxNorm, or provisional IDs when no authority match exists. The Gene Ontology and other ontologies provide a backbone for relationship types. The result is a graph that combines the coverage of extraction with the identity resolution of curated sources. It is not a free KG -- extraction is central -- but it is not a purely extracted KG either. The hybrid is the norm, and understanding both the extraction side and the structured side is necessary to build one well.
The Goal¶
The free KG cases are not a detour. They are the target.
The chapters that follow describe how to build a knowledge graph from unstructured text -- how to design a schema, run an extraction pipeline, resolve entity identity, track provenance, and serve the result. That engineering is substantial. But the reason to do it carefully is precisely that you want to arrive somewhere close to where the free cases already are: high-precision edges, stable entity identities, queryable provenance, confident relationship types. The structured sources set the standard. Extraction is the attempt to meet that standard in the domains where structured sources don't reach.
That gap -- between what the structured sources cover and what the literature knows -- is where the most interesting knowledge lives. The well-curated ontologies cover the established, the agreed-upon, the formalized. The literature covers the recent, the contested, the preliminary, the cross-disciplinary connection that nobody has formalized yet. A graph built from extraction can in principle reach all of it. Whether it reaches it at the quality level required for reliable reasoning depends on the decisions in Parts II and III.
This is the bridge the chapter is standing on. Behind it: decades of attempts to build knowledge representations, the bottleneck that stopped them, and the LLM-based tools that finally make extraction tractable. Ahead: the engineering of a system that tries to capture the knowledge in the text at a quality level worth reasoning over. The free cases show what that looks like when you get there.
Chapter 8: Designing Your Schema¶
This is the chapter that might surprise readers who came for the engineering. Schema design is not primarily a technical exercise. It is where decisions are made about what the domain is -- what counts as an entity, what kinds of relationships matter, what level of granularity is useful. These are epistemological decisions, and they determine everything downstream. A poorly designed schema will produce a graph that cannot support the reasoning it was built for, no matter how good the extraction pipeline. A well-designed schema makes extraction easier, validation straightforward, and evolution manageable.
Schema Design as Intellectual Work¶
The temptation is to treat schema design as a matter of picking entity types and relationship names from a list -- a checklist exercise that can be delegated or done quickly. That approach produces schemas that look reasonable on paper and fail in practice. The real work is understanding the domain well enough to know what distinctions matter for the questions the graph will answer.
Consider a medical literature knowledge graph. Should "dosage" be an entity type or a property of a relationship? If it's an entity, the graph can support queries like "what dosages of this drug have been studied?" and "how do recommended dosages vary across indications?" If it's a property, those queries become harder or impossible. The decision depends on what the graph is for. A graph built to support drug repurposing might not need dosage as an entity; a graph built to support clinical decision support might need it badly. There is no universal right answer. The schema must reflect the use case.
The same applies to relationship types. "Inhibits" and "is associated with" are not interchangeable. One supports reasoning about mechanism; the other supports retrieval but not much else. The schema designer must decide how fine-grained the relationship vocabulary should be -- fine enough to support the intended reasoning, coarse enough that extraction can reliably distinguish the types. That balance is domain-specific and use-case-specific. It cannot be looked up in a reference. It has to be thought through.
Entities: What Gets to Be a Thing¶
The question of what deserves a node in the graph is foundational. Genes, drugs, diseases, symptoms, procedures, proteins, pathways, patient populations, study designs -- which of these are entities and which are properties of entities? The answer is neither obvious nor universal. It depends on the questions the graph is meant to answer.
A useful heuristic: an entity is something that can be referred to across multiple sources and that can participate in relationships as a subject or object. "Breast cancer" is an entity because it can be the subject of "is treated by" and the object of "targets." "50 mg" is typically a property -- it describes a dosage, but it does not itself have relationships to other things in the same way. The boundary is not always sharp. "Cohort of postmenopausal women" might be an entity if the graph needs to reason about study populations; it might be a property of a study if the graph only needs to retrieve studies by population criteria.
The medical literature example makes specific choices: Disease, Drug, Gene, Protein, Symptom, Procedure, Publication, and a few others. Evidence is an entity -- not just a property of a relationship, but a node with its own identity, because provenance needs to be queryable and traversable. The reasoning behind each choice is worth examining. Genes and proteins are both entities because the literature distinguishes them and the relationships differ -- a drug might target a protein but regulate a gene. Symptoms are entities because they connect to diseases in ways that matter for differential diagnosis. The schema reflects the structure of the domain as it appears in the sources.
Relationships: Meaning and Direction¶
The difference between "co-occurs with," "inhibits," "causes," and "is associated with" is enormous. Collapsing them into a single "related to" type produces a graph that can support retrieval but not reasoning. The graph can answer "what is connected to this drug?" but not "what does this drug inhibit?" or "what causes this symptom?" The relationship type carries the meaning. Lose the type, and you lose the ability to reason from it.
The tradeoff is between semantic precision and extraction recall. Relationship types that are too generic -- "associated with" -- are easy to extract and never wrong, but they are almost never useful. Relationship types that are too narrow -- "allosterically inhibits" versus "competitively inhibits" -- may be impossible for an extraction model to distinguish reliably. The right level is where the types are meaningfully distinct, expressible in natural language to the model, and actually present in the sources. That level varies by domain. It is discovered through iteration, not specified in advance.
Direction matters as well. "Drug A treats Disease B" and "Disease B is treated by Drug A" are the same fact, but in a directed graph they are different edges. Consistency across extraction runs and across sources requires that direction be part of the type definition -- not a convention but a constraint. The schema should specify, for each relationship type, what the subject and object types are and which way the relationship points. That discipline prevents the subtle errors that accumulate when direction is left implicit.
Hierarchy and Inheritance¶
When should entity types form a hierarchy, and when is a flat list better? A hierarchy allows "Protein" to be a subtype of "Gene product," which allows queries over all gene products to include proteins. It also adds complexity: the extractor must decide whether a mention is a protein or a gene product, and the schema must define the inheritance rules. For some domains, that complexity pays off. For others, a flat list of types is simpler and sufficient.
The temptation to over-ontologize is real. Elaborate taxonomies look impressive and suggest rigor. They also make extraction harder -- the model must choose among many similar types -- and they can make the graph harder to query if the hierarchy is deep and the inheritance semantics are unclear. The principle: add hierarchy only when it supports reasoning that a flat schema cannot. If the main use case is "find all entities of type X," and X is a leaf in the hierarchy, the hierarchy may not be earning its keep.
Provenance as a First-Class Schema Concern¶
Chapter 3 argued that provenance is essential -- that every claim in a knowledge graph needs to be traceable to its source, with confidence and evidence type. In schema design, that argument becomes concrete. How is provenance represented? Is it a property on each edge? A separate entity type? A linked structure of evidence nodes?
The choice affects everything downstream. If provenance is a property, it is easy to add but hard to query across sources. If it is an entity type, it becomes traversable: "show me all evidence for this relationship" becomes a graph query. The medical literature example uses evidence as an entity -- each piece of evidence is a node with its own identity, linked to the relationship it supports. That design makes it possible to aggregate evidence across sources, to filter by confidence, and to detect when two sources conflict.
The fields that are often regretted later, when they were not included from the start: source document, section, passage offset, extraction method, confidence score, evidence type (e.g., RCT vs. case report). Adding them after the fact requires backfilling or accepting that older extractions lack them. Designing them in from the beginning avoids that debt.
Designing for Extraction¶
Schema choices affect how easy or hard extraction is. Some relationship types are natural to express in language -- "treats," "causes," "inhibits" -- and extraction models handle them well. Others are awkward -- "has mechanism" or "exerts effect via" -- and the model may struggle to recognize them consistently. Entity types that are genuinely ambiguous even to a human reader -- is "oxidative stress" a process or a condition? -- will produce inconsistent extraction no matter how good the prompt.
The feedback loop between schema design and extraction quality is tight. A schema that is easy to extract from will produce better results. A schema that is hard to extract from will produce noise, and the natural response is to simplify the schema -- which may sacrifice the reasoning capability the graph was built for. The alternative is to iterate on both together: adjust the schema when extraction consistently fails on a type, and adjust the extraction prompt when the schema is right but the model is not. Finalizing the schema before extraction begins is often a mistake. The schema and the extraction prompt should co-evolve.
Designing for Evolution¶
The schema will change. New entity types will be needed that were not anticipated. Relationship types will turn out to be too coarse or too fine. Distinctions that seemed unimportant will matter more than expected. The question is how to design so that evolution is manageable rather than catastrophic.
Versioning helps. When the schema changes, the extraction output format changes, and downstream consumers need to know which version they are looking at. A manifest or metadata that records the schema version with each bundle or each graph snapshot makes it possible to migrate incrementally and to reason about compatibility.
Migration is the hard part. Adding a new entity type is usually straightforward. Splitting an existing type, or changing the semantics of a relationship, can require backfilling or re-extraction. The argument for keeping the schema as simple as possible for as long as possible is practical: every entity type and relationship type is a commitment. Add only what is clearly needed. Defer the rest until the need is demonstrated.
A single source of truth for domain specifics pays off here. In the medlit reference implementation, this is domain_spec.py: entity types, predicates, prompt instructions, and vocabulary guidance all defined in one place. The extraction prompt, validation logic, and deduplication rules all import from it. A change in one place propagates everywhere; there is no second copy to forget. Splitting these across YAML configs, Python code, and prompt templates invites drift: an entity type added in one file is forgotten in the prompt, or a predicate constraint tightened in the schema is not reflected in the dedup stage, or an entity type added to the schema but forgotten in the extraction prompt produces silent gaps. One module, one edit, no drift. Readers building their own systems will find that maintaining a single domain specification module -- whether in code, config, or a schema language -- reduces the class of errors that come from inconsistent copies of the same information.