Chapter08
Chapter 8: Designing Your Schema¶
This is the chapter that might surprise readers who came for the engineering. Schema design is not primarily a technical exercise. It is where decisions are made about what the domain is -- what counts as an entity, what kinds of relationships matter, what level of granularity is useful. These are epistemological decisions, and they determine everything downstream. A poorly designed schema will produce a graph that cannot support the reasoning it was built for, no matter how good the extraction pipeline. A well-designed schema makes extraction easier, validation straightforward, and evolution manageable.
Schema Design as Intellectual Work¶
The temptation is to treat schema design as a matter of picking entity types and relationship names from a list -- a checklist exercise that can be delegated or done quickly. That approach produces schemas that look reasonable on paper and fail in practice. The real work is understanding the domain well enough to know what distinctions matter for the questions the graph will answer.
Consider a medical literature knowledge graph. Should "dosage" be an entity type or a property of a relationship? If it's an entity, the graph can support queries like "what dosages of this drug have been studied?" and "how do recommended dosages vary across indications?" If it's a property, those queries become harder or impossible. The decision depends on what the graph is for. A graph built to support drug repurposing might not need dosage as an entity; a graph built to support clinical decision support might need it badly. There is no universal right answer. The schema must reflect the use case.
The same applies to relationship types. "Inhibits" and "is associated with" are not interchangeable. One supports reasoning about mechanism; the other supports retrieval but not much else. The schema designer must decide how fine-grained the relationship vocabulary should be -- fine enough to support the intended reasoning, coarse enough that extraction can reliably distinguish the types. That balance is domain-specific and use-case-specific. It cannot be looked up in a reference. It has to be thought through.
Entities: What Gets to Be a Thing¶
The question of what deserves a node in the graph is foundational. Genes, drugs, diseases, symptoms, procedures, proteins, pathways, patient populations, study designs -- which of these are entities and which are properties of entities? The answer is neither obvious nor universal. It depends on the questions the graph is meant to answer.
A useful heuristic: an entity is something that can be referred to across multiple sources and that can participate in relationships as a subject or object. "Breast cancer" is an entity because it can be the subject of "is treated by" and the object of "targets." "50 mg" is typically a property -- it describes a dosage, but it does not itself have relationships to other things in the same way. The boundary is not always sharp. "Cohort of postmenopausal women" might be an entity if the graph needs to reason about study populations; it might be a property of a study if the graph only needs to retrieve studies by population criteria.
The medical literature example makes specific choices: Disease, Drug, Gene, Protein, Symptom, Procedure, Publication, and a few others. Evidence is an entity -- not just a property of a relationship, but a node with its own identity, because provenance needs to be queryable and traversable. The reasoning behind each choice is worth examining. Genes and proteins are both entities because the literature distinguishes them and the relationships differ -- a drug might target a protein but regulate a gene. Symptoms are entities because they connect to diseases in ways that matter for differential diagnosis. The schema reflects the structure of the domain as it appears in the sources.
Relationships: Meaning and Direction¶
The difference between "co-occurs with," "inhibits," "causes," and "is associated with" is enormous. Collapsing them into a single "related to" type produces a graph that can support retrieval but not reasoning. The graph can answer "what is connected to this drug?" but not "what does this drug inhibit?" or "what causes this symptom?" The relationship type carries the meaning. Lose the type, and you lose the ability to reason from it.
The tradeoff is between semantic precision and extraction recall. Relationship types that are too generic -- "associated with" -- are easy to extract and never wrong, but they are almost never useful. Relationship types that are too narrow -- "allosterically inhibits" versus "competitively inhibits" -- may be impossible for an extraction model to distinguish reliably. The right level is where the types are meaningfully distinct, expressible in natural language to the model, and actually present in the sources. That level varies by domain. It is discovered through iteration, not specified in advance.
Direction matters as well. "Drug A treats Disease B" and "Disease B is treated by Drug A" are the same fact, but in a directed graph they are different edges. Consistency across extraction runs and across sources requires that direction be part of the type definition -- not a convention but a constraint. The schema should specify, for each relationship type, what the subject and object types are and which way the relationship points. That discipline prevents the subtle errors that accumulate when direction is left implicit.
Hierarchy and Inheritance¶
When should entity types form a hierarchy, and when is a flat list better? A hierarchy allows "Protein" to be a subtype of "Gene product," which allows queries over all gene products to include proteins. It also adds complexity: the extractor must decide whether a mention is a protein or a gene product, and the schema must define the inheritance rules. For some domains, that complexity pays off. For others, a flat list of types is simpler and sufficient.
The temptation to over-ontologize is real. Elaborate taxonomies look impressive and suggest rigor. They also make extraction harder -- the model must choose among many similar types -- and they can make the graph harder to query if the hierarchy is deep and the inheritance semantics are unclear. The principle: add hierarchy only when it supports reasoning that a flat schema cannot. If the main use case is "find all entities of type X," and X is a leaf in the hierarchy, the hierarchy may not be earning its keep.
Provenance as a First-Class Schema Concern¶
Chapter 3 argued that provenance is essential -- that every claim in a knowledge graph needs to be traceable to its source, with confidence and evidence type. In schema design, that argument becomes concrete. How is provenance represented? Is it a property on each edge? A separate entity type? A linked structure of evidence nodes?
The choice affects everything downstream. If provenance is a property, it is easy to add but hard to query across sources. If it is an entity type, it becomes traversable: "show me all evidence for this relationship" becomes a graph query. The medical literature example uses evidence as an entity -- each piece of evidence is a node with its own identity, linked to the relationship it supports. That design makes it possible to aggregate evidence across sources, to filter by confidence, and to detect when two sources conflict.
The fields that are often regretted later, when they were not included from the start: source document, section, passage offset, extraction method, confidence score, evidence type (e.g., RCT vs. case report). Adding them after the fact requires backfilling or accepting that older extractions lack them. Designing them in from the beginning avoids that debt.
Designing for Extraction¶
Schema choices affect how easy or hard extraction is. Some relationship types are natural to express in language -- "treats," "causes," "inhibits" -- and extraction models handle them well. Others are awkward -- "has mechanism" or "exerts effect via" -- and the model may struggle to recognize them consistently. Entity types that are genuinely ambiguous even to a human reader -- is "oxidative stress" a process or a condition? -- will produce inconsistent extraction no matter how good the prompt.
The feedback loop between schema design and extraction quality is tight. A schema that is easy to extract from will produce better results. A schema that is hard to extract from will produce noise, and the natural response is to simplify the schema -- which may sacrifice the reasoning capability the graph was built for. The alternative is to iterate on both together: adjust the schema when extraction consistently fails on a type, and adjust the extraction prompt when the schema is right but the model is not. Finalizing the schema before extraction begins is often a mistake. The schema and the extraction prompt should co-evolve.
Designing for Evolution¶
The schema will change. New entity types will be needed that were not anticipated. Relationship types will turn out to be too coarse or too fine. Distinctions that seemed unimportant will matter more than expected. The question is how to design so that evolution is manageable rather than catastrophic.
Versioning helps. When the schema changes, the extraction output format changes, and downstream consumers need to know which version they are looking at. A manifest or metadata that records the schema version with each bundle or each graph snapshot makes it possible to migrate incrementally and to reason about compatibility.
Migration is the hard part. Adding a new entity type is usually straightforward. Splitting an existing type, or changing the semantics of a relationship, can require backfilling or re-extraction. The argument for keeping the schema as simple as possible for as long as possible is practical: every entity type and relationship type is a commitment. Add only what is clearly needed. Defer the rest until the need is demonstrated.
A single source of truth for domain specifics pays off here. In the medlit reference implementation, this is domain_spec.py: entity types, predicates, prompt instructions, and vocabulary guidance all defined in one place. The extraction prompt, validation logic, and deduplication rules all import from it. A change in one place propagates everywhere; there is no second copy to forget. Splitting these across YAML configs, Python code, and prompt templates invites drift: an entity type added in one file is forgotten in the prompt, or a predicate constraint tightened in the schema is not reflected in the dedup stage, or an entity type added to the schema but forgotten in the extraction prompt produces silent gaps. One module, one edit, no drift. Readers building their own systems will find that maintaining a single domain specification module -- whether in code, config, or a schema language -- reduces the class of errors that come from inconsistent copies of the same information.