Part IV: What It Makes Possible¶
Chapter 13: What Your Graph Can Do¶
The value of the graph is in what grounded reasoning becomes possible, not in the serving layer. The graph doesn't tell the reasoning system what to conclude -- it offers evidence and lets the reasoning happen.
The Server Is Not the Point¶
This chapter is about what becomes possible once your graph exists, not about how to build a particular serving layer. You might expose your graph through a REST API, an MCP server, a force-directed visualization, or no server at all -- just load the bundle into memory and run Python scripts against it. The infrastructure choices are yours. What this chapter is really about is the capability space: what can a well-constructed knowledge graph actually do for someone?
That distinction matters because it's easy to conflate "I have a graph" with "I have a graph server." The graph is the data structure and the relationships it encodes. The serving layer is one way to expose it. The capabilities we're about to describe -- direct querying, visualization, grounding LLMs, hypothesis generation -- are capabilities of the graph. The serving layer is a delivery mechanism. Choose one that fits your users and your deployment constraints; don't let the choice obscure what the graph itself enables.
Direct Querying¶
The basics: entity lookup, relationship queries, graph neighborhood traversal. These are useful and often sufficient for many applications. The interesting design question isn't which API style (REST, GraphQL, something else) but what the right query primitives are for your domain. What questions will your users actually ask?
A biomedical researcher might ask: "What drugs are known to treat this condition?" "What genes are associated with this disease?" "What's the evidence for this drug-gene interaction?" A legal researcher might ask: "What cases cite this statute?" "What statutes does this case interpret?" The primitives that support these questions -- get entity by ID, get relationships of type X from entity Y, get N-hop neighborhood, filter by provenance -- are similar across domains. The semantics of what counts as a good answer differ. A drug-disease "treats" relationship in medicine has different evidentiary standards than a case-statute "cites" relationship in law. Your query interface should expose primitives that map cleanly to your domain's question types, not force users to translate their questions into a generic graph query language.
At minimum, you need: entity lookup (by name or canonical ID), relationship enumeration (what connects to this entity, and how), and some form of traversal (neighbors, N-hop expansion, path finding). Provenance-aware queries -- "give me this relationship and its sources" -- belong in the primitive set if your domain cares about evidence, which most serious domains do. Everything else is optimization.
Graph Visualization¶
Chapter 9 recommended graph visualization as a diagnostic tool for pipeline development. What might visualization do for an end user exploring the graph?
A browsable, zoomable view of entities and relationships would let users navigate structure that would be tedious to reconstruct from query output. "Show me everything connected to this drug" produces a list; a force-directed layout produces a picture where clusters, bridges, and outliers are visible at a glance. For exploration and discovery -- "what's in this neighborhood?" "what connects these two things?" -- visualization often beats tabular output. The implementation cost is modest if you already have the query primitives; for users who think spatially about their domain, it may be the most natural interface you offer.
MCP as the Integration Point¶
The Model Context Protocol is worth understanding as an architectural pattern, not just as a specific technology. The idea is that a knowledge graph should be a first-class context source for LLM-based systems -- something that agents, assistants, and reasoning pipelines can query as naturally as a human researcher would reach for a reference database. Whether you use MCP specifically or some other integration approach, the principle is sound: your graph is most powerful when it's actively grounding inference, not sitting passively waiting to be queried by humans.
MCP defines a standard way for AI systems to discover and call tools. A knowledge graph exposed as an MCP server offers tools like "find entities matching this query," "get the neighborhood of this entity," "retrieve relationships of type X." An LLM-powered assistant with access to that server can answer domain questions by querying the graph, synthesizing the results, and citing the sources. The user gets an answer grounded in your curated knowledge rather than in the model's training distribution. The integration is loose: the model doesn't need to know your schema in advance; it discovers the available tools and uses them. That looseness is a feature. Because the contract is machine-readable, the reasoning layer can adapt to schema changes without code changes on the consumer side. Your graph can evolve without breaking every consumer.
MCP represents a qualitative shift in how the graph participates in inference. The specific thing MCP provides that REST and GraphQL don't is discoverability: tools are self-describing, typed, and enumerable at runtime. An agent can query the server to learn what tools exist and what they do, without being pre-programmed for your specific schema. That makes the graph a first-class active participant in agentic reasoning rather than a passive endpoint that some human wired up in advance.
If you decide not to use MCP, the graph still needs to be queryable by whatever system is doing the reasoning. REST, GraphQL, or a custom API all work, albeit not discoverable in the same way. The architectural point is that the graph should be available to the reasoning layer, not a separate system that humans query manually. Passive retrieval -- human runs query, copies result, pastes into chat -- is a fallback. Active grounding -- the reasoning system queries the graph as part of generating its answer -- is the target.
BFS Queries¶
The medlit implementation uses a JSON-based breadth-first search query language designed for LLM friendliness and context-window efficiency. The key design insight is that topology and presentation are orthogonal: BFS from seed nodes determines which nodes and edges are in the subgraph, while node and edge filters control only how much metadata each item carries in the response. Non-matching items appear as stubs rather than being omitted, so the LLM always sees an accurate picture of the graph's shape.
The full query format, response format, field reference, worked examples, and LLM prompt template are in the companion volume BFS-QL: Graph Queries for Language Models.
Chapter 14: The Augmented Researcher¶
What Machines Would See That We Can't¶
Consider confirmation bias. A researcher with a hypothesis tends to notice evidence that supports it and to underweight evidence that doesn't. This isn't a character flaw; it's how attention works. When you're reading papers one at a time, your prior beliefs shape what you notice, what you remember, and what you connect. A graph doesn't have prior beliefs. It encodes what the literature asserts, and a traversal query doesn't care whether the result confirms or contradicts your favorite theory. The graph surfaces connections that a human reader, biased toward coherence with existing beliefs, might have skimmed past. That doesn't make the graph right and the human wrong. It makes them different. The graph offers a view that isn't filtered through a single researcher's expectations.
Prestige bias works similarly. A finding from a famous lab or a high-impact journal gets more attention than the same finding from an unknown group or a niche venue. Citation networks amplify this: papers that are already well-cited get cited more, in a feedback loop that the Matthew effect describes. A knowledge graph built from a broad corpus can include relationships from papers that nobody cites. The graph doesn't know which papers are prestigious. It knows which relationships were extracted. A query over the graph can expose a connection that appeared in an obscure regional journal twenty years ago and was never picked up by the mainstream literature. Again, that doesn't make the obscure paper right. It makes it visible in a way that citation-based discovery systematically hides it.
Recency bias is the flip side. Newer work gets more attention than older work, partly because it's easier to find and partly because the field has collectively decided that recent results matter more. But important findings sometimes sit in the literature for decades before someone connects them to a new context. A graph that spans the full temporal range of a corpus can surface those connections. "What did we know about X in 1990?" is a query that citation networks handle poorly -- they tend to show you what's cited now, which skews recent -- but a graph can answer it directly.
The point is not that machines are unbiased. Extraction has its own biases: it favors what the model was trained on, what the schema captures, what the prompts elicit. The point is that the biases are different. A human reading the literature and a graph traversing the same literature will expose different patterns. The augmented researcher has access to both views.
The Combinatorial Argument¶
A graph with N entities and relationship types R has on the order of N² × R possible pairwise connections. Most of those don't exist; the graph is sparse. But the space of potential connections -- pairs of entities that could be related, that might be worth investigating -- is enormous. A human researcher can survey a tiny fraction of it. A graph can enumerate it.
The combinatorial argument is that important discoveries often live at the intersection of things that were known separately but never connected. Drug A was studied for condition X. Pathway B was studied in context Y. Nobody looked at A and B together because the relevant papers were in different subfields, published in different decades, or written in different languages. The connection was always possible in principle; it just required someone to look. A graph that spans both subfields can expose "A modulates B" as a candidate relationship -- either one that exists in the literature but wasn't connected, or one that the graph implies from combining multiple sources. The researcher's job becomes evaluating candidates rather than generating them from scratch. The graph does the combinatorial explosion; the human does the judgment.
Structural analogies across disciplines extend this. A relationship pattern that holds in one domain might hold in another. "Compound X inhibits enzyme Y" in biochemistry suggests "inhibitor of Y" as a search strategy in drug discovery. "Gene G is associated with disease D" in genetics suggests "genes in the same pathway as G" as candidates for D. The graph encodes structure; structural similarity queries exploit it. A researcher who knows one domain well can use the graph to find analogous patterns in domains they know less well. The graph doesn't replace domain expertise. It extends the reach of that expertise across a larger structure than any one person could hold in their head.
Linguistic and Geographic Blind Spots¶
The scientific literature is not evenly distributed. A disproportionate share of what gets read, cited, and built upon is published in English, from institutions in North America and Europe, in journals that Western researchers routinely check. That's not a conspiracy; it's the cumulative effect of where funding flows, where training happens, and how citation networks form. The result is that a researcher following the standard literature is systematically missing work from other languages, other regions, and other publication venues.
Citation networks encode and amplify this. If you discover papers by following citations, you stay within the citation graph. Papers that nobody in your network cites are invisible to you. They might as well not exist. A knowledge graph built from a genuinely broad corpus -- including non-English sources, regional journals, preprints, and gray literature -- can expose relationships that the citation network never connects. The graph doesn't care that a paper was published in Portuguese or in a journal with an impact factor of 0.5. It cares that the extraction found a relationship. A query over that graph can return results that would never appear in a citation-based search.
This isn't a panacea. Extraction quality varies by language and by how well the source matches the model's training distribution. Building a graph that truly spans the global literature requires deliberate effort: multilingual extraction, diverse source selection, and care that the pipeline doesn't silently drop or degrade non-standard inputs. But the capability is there. A well-constructed KG with broad sourcing can surface what citation networks systematically miss. For domains where important work happens outside the mainstream -- rare diseases, regional health issues, indigenous knowledge, applied research in developing countries -- that capability matters.
The Robot Scientist¶
In 2009, a team at Aberystwyth University published results from a system they called Adam. Adam was a robotic scientist that reasoned from a knowledge graph of yeast biology, formulated hypotheses about the function of specific genes, designed experiments to test those hypotheses, ran the experiments using a robotic lab, and updated its beliefs from the results. The loop was fully autonomous. Adam identified, from the graph, genes with unknown function; inferred, from structural and pathway relationships, what those functions might be; and confirmed several of its predictions experimentally. It was the scientific method, formalized and automated. No human was in the loop between hypothesis formation and experimental confirmation.
Eve extended the pattern to drug discovery. The same loop -- reason from the graph, form hypotheses about drug-target interactions, test them -- was applied to the problem of identifying compounds that might be effective against specific pathogens. Eve was not looking for candidates in the way a drug discovery pipeline looks for candidates. It was reasoning over a structured knowledge representation, traversing relationships between compounds, targets, and biological processes, and identifying implications of those relationships that hadn't been tested.
What Adam and Eve demonstrated was that autonomous scientific reasoning is achievable, given a rich enough knowledge representation. The bottleneck wasn't the reasoning -- the inference, the experimental loop, the belief updating. The bottleneck was getting the knowledge in. Adam's knowledge graph was narrow: yeast biology, curated by domain experts, sufficient for inference within that domain. Eve's graph was broader but still hand-constructed. Building the knowledge representation required a team of domain experts working by hand for months. That meant the approach was confined to domains where someone had already done that work. Everywhere else, the graph didn't exist, and neither did the possibility of automating the reasoning over it.
That bottleneck is gone. The machinery in Part III -- extraction from literature, identity resolution, provenance tracking, hypothesis generation as graph traversal -- is the machinery that lets an Adam-like system scale beyond a single hand-curated domain. A graph spanning drug discovery, disease biology, and chemical space, built from the literature rather than manually encoded, could generate hypotheses connecting compounds, targets, and indications across literatures that no single human could synthesize. The representation was always the limiting factor. The tools to build the representation now exist.
The honest answer about where we are: close enough to see the path, not close enough to declare victory. We have extraction that works at scale. We have identity resolution. We have provenance that supports evidence-weighted reasoning. What we don't yet have is the full autonomous loop -- automated experiment design, robotic execution, belief updating from results -- deployed across arbitrary domains. The wet-lab part remains a different kind of engineering problem. But the representation was always the bottleneck. Once the graph exists, the rest is engineering. And what that engineering might enable -- systems that expose what the literature already implies but hasn't yet connected, in domains where the literature is too scattered for any individual researcher to see the full picture -- is reason enough to take this seriously.
Chapter 15: Consequences¶
Compressed Discovery Timelines¶
In drug discovery, the bottleneck is often synthesis -- not of molecules, but of knowledge. A promising target emerges from basic research. The relevant literature spans decades, multiple disciplines, and hundreds of papers. Someone has to read it, extract the key relationships, and figure out what's known, what's contested, and what's missing. That synthesis can take months. A functioning extraction pipeline and a well-constructed graph can compress it to days. The same is true in rare disease research, where the literature is scattered across case reports, small studies, and patient advocacy publications. And in materials science, where the space of possible compounds is vast and the literature connecting structure to properties is fragmented. In each of these domains, the bottleneck is not the underlying science; it's the human capacity to hold and connect what's already been published. A KG that does that synthesis automatically changes the pace of work. The researcher's time shifts from "what do we know?" to "what should we do next?" That shift is consequential.
The Rare Disease Problem¶
Rare diseases are underserved not because nobody cares but because no single community is large enough to see the full picture. A disease that affects one in fifty thousand people might have a few hundred papers published about it, scattered across decades and subdisciplines. No single clinician sees enough cases to develop deep expertise. No single researcher has the bandwidth to synthesize the full literature. The patient community is small and often fragmented. The result is that knowledge about rare diseases exists -- it's in the literature -- but it's never assembled in a form that any one person or group can use. Patients and their doctors are left to piece it together from whatever they can find.
A knowledge graph built from the full rare-disease literature could serve as a coordination mechanism. It wouldn't replace clinical expertise or patient advocacy. It would give both something to work with: a structured view of what's known, what's been tried, what's connected to what. A clinician facing an unfamiliar rare diagnosis could query the graph for similar cases, related genes, and treatment attempts. A patient group could use it to identify research gaps and prioritize what to fund. The graph doesn't solve the problem of small communities. It gives small communities access to the same structural synthesis that large communities can achieve through sheer numbers. That's a different kind of equalizer.
Open Problems¶
The approach in this book works. It also has limits. An honest assessment of what doesn't yet solve well:
Very long document contexts. Scientific papers can be tens of thousands of words. The relationships that matter may span sections written pages apart. Chunking helps but doesn't fully solve the problem: a relationship that spans a chunk boundary may be missed, and the model's context within any chunk is always less than the full document. Longer context windows in future models will help. So will multi-pass strategies that explicitly handle cross-chunk dependencies. The problem is tractable; it's not solved.
Multi-hop reasoning during extraction. Some relationships require integrating information across multiple sentences, paragraphs, or documents. "Drug A was tested in combination with B; the combination showed activity against C" implies a relationship between the combination and C that depends on understanding both clauses. Current extraction is largely single-pass over local context. Richer reasoning during extraction -- the ability to hold intermediate conclusions and combine them -- would improve recall on complex relationships. This is an active research direction.
Real-time updating. The pipeline in this book is batch-oriented: you ingest a corpus, build a graph, serve it. When new papers appear, you re-run the pipeline. That works for many use cases. It doesn't work for domains where freshness matters -- breaking news, emerging outbreaks, rapidly evolving fields. Incremental update, where new documents are processed and merged without full re-ingestion, is a different design. It's buildable; it adds complexity.
Schema evolution without re-extraction. When you add a new entity type or relationship type, the natural approach is to update the schema and re-extract. That's expensive at scale. Schema evolution that can incorporate new types without re-processing the entire corpus -- perhaps by running a targeted extraction pass over documents likely to contain the new type -- is an open problem. Most projects today bite the bullet and re-extract when the schema changes significantly.
None of these are fundamental blockers. They're places where the current approach is good but not great, and where progress would expand the range of problems the technology can address.
Where the Field Is Going¶
The specific reasoning substrate will change. LLMs today, something else in ten years -- perhaps more efficient models, perhaps hybrid systems that combine neural and symbolic reasoning, perhaps something we haven't imagined. The need for this grounding layer will not change. Whatever comes after LLMs will still need explicit, domain-specific, human-curated knowledge structure to reason reliably in specialized domains. The book is not documenting a technology moment; it is identifying a permanent architectural requirement that the current moment has finally made practical to address.
Retrieval-augmented generation is a point of convergence. The idea that language models should be grounded in retrieved context rather than relying solely on training is now mainstream. Knowledge graphs are one form that retrieved context can take -- structured, typed, provenance-tracked. The RAG paradigm and the KG approach are complementary. As RAG matures, the value of structured retrieval -- graphs over document chunks -- becomes clearer. The convergence is already happening.
Structured world models in foundation models are another direction. Some researchers are exploring whether large models can learn internal representations that are more graph-like, with explicit entities and relationships. If that succeeds, the boundary between "retrieve from external graph" and "reason from internal structure" may blur. Even then, the argument for explicit, inspectable, provenance-tracked graphs remains: internal representations are opaque; external graphs are auditable. For domains where you need to trace a claim to a source, an external graph is the right architecture. The substrate may evolve. The need for that layer will not.
What the field needs that isn't purely technical is harder to forecast but worth naming. Shared schemas for common domains would reduce duplicated effort and make graphs interoperable across research groups. Open corpora with permissive licensing would let extraction pipelines be benchmarked and compared. Community norms around provenance -- what it means to assert a relationship, how confidence should be calibrated, how retractions should propagate through downstream graphs -- are still being established. The engineering described in this book is relatively mature compared to the social infrastructure around it. Both are necessary.