Chapter07
Chapter 7: The Free KG Cases¶
Not every knowledge graph requires extraction. Some are built from sources that are already structured when they arrive -- from lab instruments, databases, ontologies, or human-curated encyclopedias. Understanding these "free" cases sharpens the argument for extraction by making clear what the hard problem actually is. It also sets a quality benchmark: the graphs built without extraction tend to be high-precision over their coverage, precisely because the structure was already there. The extraction problem is the problem of accessing the knowledge that wasn't structured -- and in most interesting domains, that's the majority of what's known.
When You Don't Need Extraction¶
The central claim of this book is that extraction from unstructured text is the bottleneck that has limited knowledge graphs for decades, and that LLMs have finally made it tractable. But that claim only applies when the knowledge lives in unstructured form. In many situations, the knowledge is already structured when it reaches you, and extraction to a knowledge graph is simply a mechanical reformatting, perhaps a short shell or Python script. Structured web sources (schema.org markup, government open data) and well-documented APIs fall into the same category.
Perhaps the data comes from lab equipment that outputs well-defined records -- a sequencer, a spectrometer, a sensor network. Perhaps it comes from a database table with columns and foreign keys. Perhaps it comes from an ontology or formal specification that was designed to be machine-readable from the start. In these cases, the mapping from source to graph is often straightforward: a short script, an ETL pipeline, or a direct translation of the source schema into graph form. No LLM is required. No extraction prompt. No ambiguity about what the text meant. The structure is given.
This chapter examines several such cases. The goal is not to dismiss them -- they are important and widely used -- but to clarify the boundary. When does a knowledge graph require extraction, and when does it not? The answer determines whether the techniques in the rest of this book apply to a reader's problem, and it also illuminates what extraction is for: bridging the gap between the knowledge that is already structured and the knowledge that is not.
The practical question is: which case are you in? A few signals help.
If your domain has established, maintained, machine-readable authorities -- ontologies, registries, curated databases -- and your knowledge questions can be answered from those sources alone, you may not need extraction at all. The authority is the graph, or close enough that a mapping script is all that stands between them.
If your domain's knowledge lives primarily in structured form but you need to connect across sources -- linking a genomics database to a drug database to a clinical outcomes registry -- you need identity resolution and schema alignment, but not LLM-based extraction. The challenge is integration, not interpretation.
If the knowledge you need is in the literature -- in the prose of papers, reports, case notes, specifications -- then extraction is not optional. The structured sources don't have it. The only way to get it into the graph is to read the text and interpret it, and that is what the rest of this book is about.
Misidentifying your case is expensive in both directions. Building an extraction pipeline for a domain where structured authorities already cover your needs wastes time and introduces noise you didn't have to create. Assuming a structured source covers your domain when the real knowledge lives in the prose means building a graph with systematic blind spots you may not notice until a user asks the wrong question.
Lab Instruments and Measured Data¶
Genomics provides the canonical example. DNA sequencers, mass spectrometers, and high-throughput assay platforms produce structured data almost by definition. A sequencer outputs base calls with quality scores; a protein-protein interaction assay outputs pairs of identifiers with confidence metrics. The data has a schema. It has identifiers that map to established ontologies. It can be loaded into a graph with minimal interpretation.
Graphs like STRING (protein-protein interactions), BioGRID (genetic and physical interactions), and IntAct (molecular interactions) are populated primarily from such experimental measurements. They aggregate data from thousands of published studies, but the aggregation is over structured datasets that the authors deposited in repositories -- not over the free text of the papers. The extraction problem, in this context, is largely solved by the experimental design: the scientist produces structured output, and the graph ingests it.
What these graphs are good for is clear. They support queries like "what proteins interact with BRCA1?" or "what pathways involve this gene?" with high precision, because the edges correspond to actual experimental observations. What they miss is equally important: the experiment that wasn't done, or wasn't published as structured data, or was described only in the discussion section of a paper, is not in the graph. The knowledge that lives in the prose -- the mechanistic interpretation, the caveats, the connections to other domains that the authors didn't formalize -- remains inaccessible. For many biomedical questions, that's the majority of what's known. Lab instruments and measured data give you a high-quality graph over a subset of the domain.
Generated and Synthetic Graphs¶
A related class of knowledge graphs is constructed from databases, ontologies, or formal specifications rather than from text or instruments. The Gene Ontology is a curated hierarchy of molecular function, biological process, and cellular component terms -- it is a graph by design, with typed relationships (is_a, part_of, regulates) between terms. Drug interaction databases like DrugBank or STITCH map drugs to targets, indications, and interactions using structured records. Legal code can be represented as a graph of sections, references, and amendments. In each case, the source was already formal enough that conversion to graph form is a matter of schema mapping, not interpretation.
The precision of these graphs is high. When a relationship is asserted in the Gene Ontology, it has been reviewed by curators and validated against evidence. When a drug-target interaction appears in DrugBank, it has been extracted and verified by the database maintainers. The boundary between a knowledge graph and a very well-structured relational database starts to blur here -- and that blurriness is instructive. A graph is often just a different view of the same data, with traversal and path-finding as the primary operations instead of joins and aggregations.
The limitation is coverage. These graphs contain what was explicitly encoded. They do not contain what was left implicit, or what was stated in natural language and never formalized, or what was discovered after the last curation pass. For domains where the authoritative knowledge is already in structured form, that may be sufficient. For domains where the knowledge lives in the literature, it is not.
Curated Graphs at Scale¶
Wikidata, Freebase before it, and DBpedia represent a different model: human curation at scale. Millions of entities, millions of relationships, maintained by a community of contributors who add facts, correct errors, and resolve disputes through discussion and consensus. The result is a graph that spans many domains, with reasonable quality where the community has focused effort, and gaps where it has not.
A single query can retrieve structured information about a person, a place, a chemical, a historical event -- with identifiers that are stable, with relationships that are typed, with provenance that points to sources. For many applications, that is enough. The cost is the cost of human labor: curation does not scale to the full breadth of human knowledge, and it scales least well to domains where the knowledge is technical, specialized, or rapidly evolving. Encyclopedia articles can be curated. The full text of the medical literature cannot.
DBpedia illustrates the boundary clearly. It is extracted from Wikipedia infoboxes and structured elements -- but the extraction is from semi-structured templates, not from free prose. The infobox for a drug might have a "mechanism" field; the body of the article might have three paragraphs of nuanced explanation that never made it into the template. DBpedia has the former. It does not have the latter. Curated graphs at scale work where the community can structure the knowledge. They do not work where the knowledge lives in papers, reports, and documents that no one has the bandwidth to formalize.
What These Cases Teach Us¶
The common thread across lab instruments, generated graphs, and curated encyclopedias is this: structured sources give you high-precision graphs over the knowledge that was already structured. The extraction problem is precisely the problem of accessing the knowledge that wasn't -- which, in most interesting domains, is the majority of what's known.
The free KG cases set a quality benchmark worth aiming for. When a graph is built from structured sources, the edges tend to be correct, the entities tend to be well-identified, and the schema tends to be coherent. An extraction-based graph should aspire to that level of precision where it can. The gap between that benchmark and what extraction typically achieves is the gap that schema design, prompt engineering, and pipeline architecture are trying to close.
The free cases also illustrate the shape of the gap. It is not that extraction is impossible or that extracted graphs are inherently low quality. It is that extraction is a different kind of problem -- one that requires interpreting natural language, resolving ambiguity, and making judgments about what the text implies. The tools for that problem have improved dramatically. The problem itself has not gone away.
The gap is worth being specific about. A structured-source graph rarely has a wrong relationship direction -- the schema defines which way an edge runs and the data conforms. An extracted graph will have some inverted edges, particularly for asymmetric relationships stated in passive voice. A structured-source graph has stable entity identity -- the authority assigns the ID and it doesn't drift. An extracted graph has provisional entities that may be duplicated, merged incorrectly, or resolved differently across pipeline runs. A structured-source graph has relationships with known epistemic type -- a DrugBank interaction is always a curated assertion, not a model inference. An extracted graph mixes extraction confidence levels, and the metadata that tracks them requires deliberate schema design to preserve. None of these gaps are fatal. They are all addressable by the architecture in Part III. But they are real, and knowing they exist is what motivates the engineering choices ahead.
Hybrid Approaches¶
Most real knowledge graphs combine extraction with structured sources. The typical pattern: extract entities and relationships from text, then link the extracted entities to authoritative identifiers from a curated database or ontology. A drug mention in a paper is resolved to an RxNorm code. A disease mention is resolved to a MeSH term. A gene mention is resolved to an HGNC identifier. The extraction provides the coverage -- the connections that appear in the literature. The authority lookup provides the identity -- the canonical form that makes those connections comparable across sources.
This hybrid approach does not make the extraction problem go away. It constrains it. Instead of inventing identifiers from scratch, the extractor (or a downstream resolution pass) maps to an existing vocabulary. That mapping is itself a form of extraction -- the model must recognize that "ketoconazole" in the text refers to the same thing as RxNorm's concept for ketoconazole -- but it is a narrower problem than inventing a full schema and populating it from scratch. The schema, or at least the entity vocabulary, is given by the authority. The extraction fills in the relationships and the provenance.
The medical literature example in Part III follows this pattern. Entities are extracted from papers and resolved to MeSH, HGNC, RxNorm, or provisional IDs when no authority match exists. The Gene Ontology and other ontologies provide a backbone for relationship types. The result is a graph that combines the coverage of extraction with the identity resolution of curated sources. It is not a free KG -- extraction is central -- but it is not a purely extracted KG either. The hybrid is the norm, and understanding both the extraction side and the structured side is necessary to build one well.
The Goal¶
The free KG cases are not a detour. They are the target.
The chapters that follow describe how to build a knowledge graph from unstructured text -- how to design a schema, run an extraction pipeline, resolve entity identity, track provenance, and serve the result. That engineering is substantial. But the reason to do it carefully is precisely that you want to arrive somewhere close to where the free cases already are: high-precision edges, stable entity identities, queryable provenance, confident relationship types. The structured sources set the standard. Extraction is the attempt to meet that standard in the domains where structured sources don't reach.
That gap -- between what the structured sources cover and what the literature knows -- is where the most interesting knowledge lives. The well-curated ontologies cover the established, the agreed-upon, the formalized. The literature covers the recent, the contested, the preliminary, the cross-disciplinary connection that nobody has formalized yet. A graph built from extraction can in principle reach all of it. Whether it reaches it at the quality level required for reliable reasoning depends on the decisions in Parts II and III.
This is the bridge the chapter is standing on. Behind it: decades of attempts to build knowledge representations, the bottleneck that stopped them, and the LLM-based tools that finally make extraction tractable. Ahead: the engineering of a system that tries to capture the knowledge in the text at a quality level worth reasoning over. The free cases show what that looks like when you get there.