Chapter14

Chapter 14: What This Makes Possible¶

The Three-Book Arc¶

Knowledge Graphs from Unstructured Text solves the extraction problem: how to get structured claims out of unstructured text at scale. This book solves the trustworthiness problem: how to ensure those claims are anchored to stable identities, sourced to their evidence, and aggregated correctly across sources. BFS-QL solves the interface problem: how to get those claims to a language model in a form it can actually reason over.

The three books are independent in the sense that each addresses a distinct problem. They are interdependent in the sense that each one's solution depends on the others being solved. An extraction pipeline without an identity server produces an unusable graph. An identity server without an extraction pipeline has nothing to process. A query protocol without a trustworthy graph is an interface to noise.

The identity server is the connective tissue. It is called by the extraction pipeline and queried by the query layer. It is the service that makes the graph trustworthy, and trustworthiness is what makes the system useful.

Cross-Domain Reasoning¶

The shared canonical ID infrastructure makes cross-domain reasoning possible in a way that was not previously practical. A graph built from biomedical literature can compose with a graph built from clinical trial data, a drug adverse event database, and a genomics resource -- not because any of these sources were designed to interoperate, but because they all anchor to the same authorities.

The biomedical community built MeSH, HGNC, RxNorm, and UniProt over decades for their own purposes: to organize their literature, to name their discoveries, to communicate across research groups. The identity server treats these authorities as the interoperability layer they accidentally became. The cross-domain reasoning capability is an emergent property of the decision to anchor to shared authorities, not a designed feature of any single system.

Democratization and Its Limits¶

Building and maintaining a serious knowledge graph still requires significant resources. You need a corpus, which may be behind paywalls. You need compute for extraction, which costs money. You need domain expertise to design the schema and validate the output. The result is that the first generation of domain-spanning knowledge graphs will likely be built by those who can afford to build them -- pharmaceutical companies, large universities, government agencies, well-funded startups. The question of who gets access then becomes a question of licensing, openness, and governance.

The promise the technology holds out is real nonetheless. A researcher at a small institution, or in a developing country, with access to a comprehensive KG over their domain would have the same structural view of the literature as a researcher at a well-funded lab. The graph doesn't care who queries it. The capability to expose connections that citation networks hide, to ground an LLM in curated knowledge -- that capability could be democratized. The technology enables it; policy and incentive will decide whether it happens.

Grounding LLM Inference¶

The pattern that changes what a language model can do: instead of asking a model to reason from its training data, give it structured, typed, provenance-tracked claims from your graph and ask it to reason from those. The difference in reliability is substantial. A model hallucinating over raw text and a model reasoning over a curated graph with explicit provenance are doing qualitatively different things, even if they look similar from the outside. This is the integration that makes a knowledge graph more than a database.

The mechanics are straightforward. A user asks a question. Your system retrieves relevant subgraphs -- entities and relationships that match the question's scope -- and injects them into the model's context. The model reasons over that context and produces an answer. The answer is grounded in the retrieved graph, not in the model's training. You can cite the sources. You can trace the reasoning path. When the graph is wrong, you fix the graph; you don't retrain the model.

Hypothesis Generation¶

Graph traversal as a discovery tool: not "what do we know about X" but "what's adjacent to X that hasn't been studied," "what entities are structurally similar to X in the graph," "what relationships exist between X and Y that no single paper asserts but that follow from combining multiple sources." These are queries that are impossible over raw text and natural over a well-constructed graph.

Consider a concrete example. Drug A treats disease D. Gene G is associated with disease D. Drug B modulates gene G. No single paper may state that drug B is worth testing for disease D. The inference follows from combining three relationships that exist in the graph. A researcher who had read all the relevant papers might make that connection; the graph makes it queryable. The results are candidate hypotheses -- drug-disease pairs that the graph implies but that may not have been studied together. The graph doesn't decide which are worth pursuing. It surfaces candidates that a human can filter and prioritize.

What Has Changed¶

The extraction bottleneck that held back knowledge representation for fifty years is now broken. The epistemic commons -- the shared identifier infrastructure built by the biomedical, chemical, legal, and geographic communities -- has existed for decades. The identity server is the bridge between them: the service that takes extracted mentions, anchors them to shared authorities, aggregates their evidence, and makes the resulting graph trustworthy.

The vision of machine reasoning over explicit, traceable, cross-domain knowledge -- a vision that animated researchers from McCarthy to Lenat to Berners-Lee -- is now achievable with tools that exist today, at a cost that is no longer prohibitive, for domains that matter.