Part I: The Interface Problem¶

Chapter 1: Graphs Are Hard for Language Models¶

In the summer of 2023, Microsoft Research published a paper called "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." [@edge2024graphrag] The timing was perfect. The field had spent two years watching retrieval-augmented generation -- RAG -- mature from an interesting idea into production infrastructure, and the obvious next question was already in the air: if retrieving text passages helps, what about retrieving structured knowledge? A graph, after all, knows things that a pile of documents doesn't. It knows what is connected to what. It knows the type of every connection. It knows that two entities mentioned in separate papers are the same entity, and it knows how they relate. Graph RAG promised to bring all of that to bear on LLM reasoning.

The paper was well-executed and the results were real. Community indexes built from document corpora outperformed naive RAG on certain kinds of global, thematic questions. Developers read it and started building.

What happened next was instructive. The demos worked. The production deployments were harder. Teams connecting LLMs to real graphs -- not community indexes built for the purpose, but existing SPARQL endpoints, corporate Neo4j instances, Wikidata, domain-specific triple stores -- ran into a consistent set of problems that the paper hadn't addressed and that no amount of prompt engineering reliably fixed. The models wrote queries that were syntactically plausible but semantically broken. They hallucinated predicate names. They got URI prefixes wrong. They produced SPARQL that parsed but returned nothing, or returned the wrong thing, or timed out against endpoints that weren't designed for the query shapes an LLM tends to generate. The knowledge was in the graph. The LLM still couldn't reliably get to it.

This chapter is about why. The problems teams encountered were not random, and they were not going to be fixed by better prompting or a smarter model. They followed from something more fundamental: the mismatch between how graph query languages are structured and how language models actually work. Understanding that mismatch is the first step toward designing around it.

The Transformer and the Context Window¶

To understand why the interface problem is hard, it helps to understand one architectural fact about the models at the center of it.

In 2017, a team at Google Brain published "Attention Is All You Need," [@vaswani2017attention] introducing the transformer architecture that underlies every large language model in use today. The paper was not received as a landmark at the time -- it was one of several strong results at that year's NeurIPS, and its title, chosen with deliberate provocation, was partly a bet that turned out to be right. Within a few years the transformer had displaced essentially every competing architecture for sequence modeling. The bet paid off.

The core mechanism is self-attention: every token in the input sequence attends to every other token, producing a weighted representation of the full context. This is what gives transformers their remarkable ability to reason over long-range dependencies -- the token at position 500 can directly influence the interpretation of the token at position 3, with no information loss from distance. Previous architectures had struggled with exactly this; the transformer solved it cleanly.

The cost is quadratic. Self-attention over a sequence of n tokens requires computing n² attention weights. Double the sequence length and you quadruply the compute. This was understood from the beginning and accepted as a reasonable tradeoff -- in 2017, nobody was thinking about context windows of 100,000 tokens. The sequences being modeled were sentences and short paragraphs. The quadratic cost was manageable.

By 2023, context windows had grown from hundreds of tokens to tens of thousands, and the quadratic cost had become a central engineering concern. Researchers developed linear attention approximations, sparse attention patterns, and sliding window schemes to push the boundary further. Context windows continued to grow. But the fundamental constraint didn't go away -- it got managed, not eliminated. Every token in the context window still imposes a cost on every other token. Longer contexts are not just more expensive in proportion; they are more expensive per token. This is not a hardware limitation that will eventually be engineered away. It is structural to how self-attention works.

The practical consequence for graph querying is direct. A knowledge graph neighborhood -- the entities and relationships within two or three hops of a seed node -- can easily contain hundreds of nodes and thousands of edges. Serializing that neighborhood naively and stuffing it into the context window is expensive, and the expense compounds: a larger context costs more to process and, it turns out, reasons less reliably over its contents.

Lost in the Middle¶

In 2023, a team at Stanford published an empirical study with a title that became a shorthand for a problem the field had been observing anecdotally: "Lost in the Middle: How Language Models Use Long Contexts." [@liu2023lost] The finding was stark. LLM performance on tasks that required retrieving specific information from a long context degraded sharply when that information was positioned in the middle of the context window. Models were good at using information near the beginning and near the end. The middle was a dead zone.

This was not a minor effect. On some tasks, performance at the middle of a long context was barely better than chance, while performance at the boundaries remained strong. The effect was consistent across model families and context lengths.

The implication for graph querying is that a large, unfiltered graph dump in the context does not just waste tokens -- it actively degrades reasoning. An LLM handed a serialized subgraph of three hundred nodes will not reliably find the relevant dozen. The relevant nodes, wherever they happen to fall in the serialization, are just as likely to land in the dead zone as not. Giving the model more context is not the answer. Giving it the right context is.

The Memory Hierarchy Analogy¶

Computer architects confronted a version of this problem sixty years ago.

In the 1960s, RAM was expensive and scarce. The gap between the speed of the processor and the speed of available memory was already large and growing. The naive approach -- treat all memory equally, fetch whatever you need when you need it -- didn't work at scale. Programs needed more memory than could be kept fast, and fetching from slow storage on every access made the processor sit idle.

The solution was the cache hierarchy: a small amount of very fast memory close to the processor, a larger amount of slower memory behind it, and backing storage behind that. The key insight was that programs don't access memory randomly -- they have locality. The data a program needs right now is probably near the data it needed a moment ago. Keep the working set in fast memory, page everything else out, and performance improves dramatically.

Peter Denning formalized this in 1968 with working set theory. [@denning1968working] The working set of a process at any moment is the set of memory pages it has accessed recently -- the minimum it needs in fast memory to run efficiently. The question cache architects asked was not "how much memory can we provide?" but "what does this process actually need right now?"

The analogy to LLM context is exact. The context window is fast memory -- expensive per token, finite, and the place where reasoning actually happens. Backing storage is the graph: vast, slow to query, and mostly irrelevant to any given question. The design question is not "how much of the graph can we fit in context?" but "what does the model actually need right now to answer this question?"

BFS-QL's answer, developed in detail in Chapter 3, is a working-set-aware data structure: topology always present so the model can navigate, full metadata only where the cost is justified. The context window stays manageable. The reasoning stays accurate. The graph stays accessible.

Why the Interface Is the Problem¶

Returning to the teams that ran into trouble with production Graph RAG deployments: the failure mode wasn't that their graphs were bad, or that their models were too weak, or that knowledge graphs are fundamentally unsuitable for LLM reasoning. The failure mode was the interface. They were asking language models to use tools designed for human authors, under constraints those tools were never designed to respect.

SPARQL is a powerful and well-designed query language. It was built to let human experts express precise, complex queries against RDF graphs. It rewards deep familiarity with the schema, careful attention to prefix namespaces, and an understanding of how the underlying store evaluates queries. These are things human experts acquire over time. They are not things a language model can reliably produce on demand, cold, against an unfamiliar graph, in the middle of a conversation.

Cypher has similar properties for property graphs. Expressive, powerful, designed for human authors.

The failure modes -- hallucinated predicates, wrong prefixes, syntactically valid but semantically empty queries -- are not bugs that better prompting fixes. They are predictable consequences of asking a model to generate a precise formal language it has seen only in training, against a schema it doesn't know, without feedback. The interface is not designed for this use case.

The rest of Part I examines the natural alternatives and why they fall short. Chapter 2 takes SPARQL and RAG in turn. Chapter 3 proposes a different starting point -- one that fits how language models actually reason, respects the context window as a constrained resource, and makes the graph accessible without asking the model to be something it isn't.

Chapter 2: Why Not SPARQL?¶

In a 2001 article in Scientific American, Tim Berners-Lee described a vision he called the Semantic Web. The web he had built was for humans -- pages of text, navigable by people who could read and interpret them. The Semantic Web would be different: structured, machine-readable, traversable by software agents that could understand what they were reading. An agent looking for a drug interaction wouldn't fetch a page and hope the answer was in the prose. It would issue a query, receive a structured response, follow links to related data, and assemble an answer from explicit, typed facts. The knowledge would be in the graph. The agent would get to it.

The vision failed, for reasons the companion volume examines at length. But the part worth noting here is what the agents were supposed to do: write queries. SPARQL -- the query language that emerged from the Semantic Web effort -- was designed with exactly this use case in mind. Expressive, precise, composable. Everything an intelligent agent would need.

Twenty years later, intelligent agents arrived. They were language models, not the rule-based reasoners Berners-Lee had imagined, and they turned out to be very bad at writing SPARQL. The vision was right about the destination. It was wrong about what the agent would look like and what interface it would need.

How LLMs Generate Text¶

To understand why SPARQL generation fails systematically, it helps to understand what an LLM is actually doing when it writes a query.

A language model generates text token by token, each token sampled from a probability distribution conditioned on everything that came before. The model has no symbolic reasoner, no query planner, no schema validator. It has statistical patterns absorbed from its training corpus -- including patterns from SPARQL queries, Cypher queries, and documentation about both. When asked to write a query, it produces text that looks like a query, following the patterns it has seen. Most of the time, the surface form is correct. The query parses.

What the model cannot do is verify. It cannot check that a predicate name it has generated actually exists in the target schema. It cannot confirm that a URI prefix is valid for this particular endpoint. It cannot know whether the query it has written will return results, return the wrong results, or time out. It generates plausible text and stops. Verification is not part of the architecture.

This produces a characteristic failure pattern. The queries look right. They often parse. They frequently don't work.

The Failure Modes¶

Hallucinated predicates. The model generates a predicate name that sounds semantically appropriate but does not exist in the schema. Against a biomedical graph, it might write dbo:treatedBy when the actual predicate is mesh:treats, or invent schema:hasSymptom for a graph that uses snomed:hasPresentation. The query is syntactically valid. It returns nothing. The model, receiving an empty result, may conclude that no such relationship exists in the graph -- a false negative with real consequences.

Wrong URI prefixes. SPARQL queries over RDF graphs require correct namespace prefixes. dbo: and dbr: are different namespaces in DBpedia; conflating them produces broken queries. Wikidata uses wd: for entities and wdt: for properties; the distinction is non-obvious and frequently confused. A model that has seen many SPARQL examples in training will have absorbed prefix patterns, but those patterns don't transfer cleanly to every endpoint, and the model has no mechanism to verify which prefixes are valid for the graph it is currently querying.

Syntactically valid, semantically empty. Some of the most insidious failures produce queries that parse, execute, and return results -- just not the right ones. A query that asks for all entities of type owl:Thing will return everything in many triple stores. A query with a subtly wrong join condition will return a Cartesian product or an empty set depending on the data. The model has no way to distinguish a correct result from a plausible-looking wrong one.

Query shape mismatch. LLMs tend to generate queries that match the patterns most common in their training data. Those patterns are not necessarily the patterns a given endpoint handles efficiently. A query that works against a local Fuseki instance may time out against a public Wikidata endpoint with rate limits and query complexity restrictions. The model doesn't know the difference.

None of these failure modes are fixable by making the model smarter or the prompt more detailed. They are structural. The model is generating text in a precise formal language against a schema it cannot inspect, with no feedback loop between generation and verification. Better prompting reduces the error rate at the margins. It does not change the underlying mechanism.

The Same Argument Applies to Cypher¶

Cypher, Neo4j's query language for property graphs, is a different language with the same problem. It is expressive and well-designed, built for human authors who know their schema and can iterate against a live database. An LLM generating Cypher cold, against an unfamiliar graph, runs into the same failure modes: invented relationship types, wrong property names, match patterns that produce unexpected results. The surface syntax is different. The mechanism of failure is identical.

This is not a criticism of either language. SPARQL and Cypher are excellent at what they were designed to do. The problem is that "designed for human authors with schema familiarity and an interactive development environment" and "suitable for LLM generation in real time against an unknown graph" describe different requirements. No amount of language design work makes both true simultaneously.

Why RAG Doesn't Close the Gap¶

The natural response to query generation failures is to bypass query generation entirely. Instead of asking the model to write SPARQL, retrieve relevant content from the graph and give it to the model as text. This is the RAG approach applied to graphs, and it has genuine appeal: it avoids the formal language generation problem, it fits neatly into existing RAG infrastructure, and it sidesteps the schema familiarity requirement.

The insight behind RAG is sound. [@lewis2020rag] Giving the model something to reason from rather than asking it to reason from memory reduces hallucination and improves accuracy on knowledge-intensive tasks. For document retrieval, where the question is "find the passage most semantically similar to this query," vector similarity search is a good fit. The model generates a query embedding, the retriever finds similar document embeddings, and the relevant passages land in the context.

Graphs break this in a specific and important way. Relevance in a graph is structural, not semantic. The most important node for answering a question might be two hops away from any node that looks semantically similar to the query. Consider a question about drug interactions for a specific patient profile. The relevant nodes are the drug, its metabolic targets, the enzymes those targets share with other drugs the patient is taking, and the clinical outcomes associated with those shared pathways. None of those intermediate nodes -- the enzymes, the pathways -- are semantically similar to "drug interactions for this patient." They are structurally connected to the answer. Vector similarity retrieval will not find them. It will find nodes that mention drug interactions in their text representation, which is a different set.

The distinction is fundamental. Vector similarity retrieval asks: what is near this query in embedding space? Graph traversal asks: what is connected to what I already know? For the kinds of questions that make knowledge graphs valuable -- multi-hop relational reasoning, pathway analysis, provenance tracing -- the second question is the right one. A retrieval system built for the first question answers the second question poorly, not because the implementation is bad but because the operation is wrong.

This is where Graph RAG systems built on vector retrieval tend to fail in practice. They find semantically similar nodes efficiently. They miss structurally important nodes reliably. The result is a context that looks relevant but is missing the connections that would make the reasoning meaningful.

The fix is not better embeddings or a larger retrieval set. The fix is a different operation: traversal rather than retrieval, structure rather than similarity. That is what Chapter 3 proposes.

Chapter 3: The Right Abstraction¶

In 1980, David Patterson at UC Berkeley and John Hennessy at Stanford were separately arriving at the same uncomfortable conclusion about the direction computer architecture had taken. The prevailing wisdom of the era was that more was better: more instructions, more addressing modes, more hardware support for complex operations. The VAX-11/780, released by Digital Equipment Corporation in 1977, was the apotheosis of this philosophy -- a machine with hundreds of instructions, some of them extraordinarily powerful, capable of expressing complex operations in a single opcode. Compiler writers loved it. It was, by the standards of the time, a masterpiece.

Patterson and Hennessy thought it was a mistake.

Their argument was not that complex instructions were useless. It was that they were expensive in ways that weren't obvious and beneficial in ways that were overstated. A complex instruction that took ten cycles to execute was not better than ten simple instructions that each took one cycle -- the simple version was equally fast and the compiler could see each step, reason about it, and optimize across them. The hardware complexity required to implement the full instruction set also made it harder to pipeline, harder to verify, and harder to push to higher clock speeds. Simplicity wasn't a limitation. It was an advantage.

The resulting architecture -- RISC, Reduced Instruction Set Computing -- was controversial. It contradicted decades of conventional wisdom and threatened the business models of companies that had invested heavily in CISC implementations. Patterson's Berkeley RISC and Hennessy's MIPS processors were academic projects. Industry was skeptical.

The market settled the argument. RISC architectures -- MIPS, SPARC, ARM -- came to dominate embedded computing, then mobile computing, then, with the Apple M-series chips, high-performance desktop computing. The complex instruction sets that had seemed so powerful turned out to be expensive overhead that compilers didn't need and processors couldn't efficiently execute. Fewer, simpler operations, composable by the compiler, outperformed the rich surface area that had seemed like a gift to programmers.

The lesson generalizes. A large interface surface area is not a feature. It is a burden -- on the implementor who must make everything work, on the user who must learn what to use when, and on any automated system that must generate calls into it reliably. The question is not "how much can we express?" but "how little do we need to express everything that matters?"

That is the design question BFS-QL answers.

Traversal, Not Querying¶

Chapter 2 established that query generation fails because it asks a language model to produce precise formal language against an unknown schema without verification. The failure is structural. But there is a deeper point worth making: even if query generation worked reliably, it would be the wrong operation.

Consider how an LLM actually reasons about a domain it is exploring. It starts with something it knows -- a named entity, a concept, a fact from context. It wants to know what that thing connects to. It expands outward, following relationships, building a picture of the neighborhood. It asks follow-up questions based on what it finds. This is not querying -- it is traversal. The operation is not "express a precise constraint and retrieve the matching set" but "start here, look around, go deeper where it's interesting."

Breadth-first search is the natural formalization of this. Start from one or more seed entities. Expand to their immediate neighbors. Expand again to the next ring. Collect the subgraph. Decide how far to go based on what you find. BFS over a knowledge graph is exactly the operation that matches how an LLM explores a domain: incremental, local, driven by what is already known.

This reframing has an important consequence. A query language like SPARQL is designed to express the full answer in a single declaration -- here is the constraint, find everything that matches. BFS is designed to be issued iteratively -- here is where I am, show me what is nearby. The iterative model fits the LLM's conversational, multi-turn reasoning style. The declarative model requires the LLM to know, upfront, what it is looking for. For graph exploration -- which is often precisely the case where the LLM doesn't know what it's looking for yet -- the iterative model is the right one.

Topology and Presentation¶

BFS over a knowledge graph produces a subgraph. The question is what that subgraph should contain.

The naive answer is: everything. Return all nodes and all edges within the traversal depth, with all their metadata. This is correct in the sense that no information is lost. It is impractical for the reasons Chapter 1 established: a dense graph at two hops can contain hundreds of nodes and thousands of edges, and dumping all of that into the context window is expensive and degrades reasoning.

The tempting alternative is filtering: return only the nodes and edges that match the query's constraints, discard the rest. If the user asks about drugs and diseases, return only Drug and Disease nodes; drop everything else. This keeps the context small. It also produces a misleading picture of the graph.

Consider a Disease node connected to ten Drug nodes, two Gene nodes, and fifteen Publication nodes. If the query filters for Drugs only and discards everything else, the model sees a Disease connected to ten drugs and nothing else. It does not know that the Disease is also connected to genes and publications. It cannot ask follow-up questions about those connections because it doesn't know they exist. The filtered subgraph is not a smaller version of the truth -- it is a different, inaccurate picture of the graph's structure.

The right answer separates two orthogonal concerns: topology -- what nodes and edges exist -- and presentation -- how much data each one carries. BFS-QL's response to this is the stub. A stub is a node or edge that is present in the result but carries only identity information: its ID and type, nothing more. Stubs are not filtered out. They are present. The model knows they exist, knows what kind of thing they are, and can choose to follow up on them -- by calling describe_entity for a node stub, or issuing a new bfs_query seeded at that node. The stub is a navigational handle, not a dead end.

This means the BFS-QL response to "show me drugs and diseases" is not "here are the drugs and diseases, nothing else." It is "here are the drugs and diseases with full metadata, and here are the other things they connect to as lightweight stubs so you know the topology." The context cost is controlled. The picture of the graph is accurate.

The Working Set Applied to Graph Data¶

Denning's working set theory, described in Chapter 1, asked what the minimum is that a process needs in memory to run efficiently. The answer was not "nothing" -- you need the pages that are currently active. It was not "everything" -- you can't afford to keep it all in fast memory. It was the working set: the pages recently accessed, likely to be accessed again, sufficient for the computation at hand.

The BFS-QL query model asks the same question about context. The node_types and predicates parameters are the mechanism for declaring the working set: these are the types of nodes I need in full, these are the predicates I need with provenance, everything else I need only as topology. The model pays the context cost where it matters and defers cost where it doesn't.

This is a principled design choice, not a workaround. The working set concept is a solution to a fundamental resource allocation problem. Applying it to context management produces a query model that is context-efficient by construction -- not by truncating results or hoping the model will ignore irrelevant content, but by giving the model precise control over where the cost is paid.

In practice, the recommended first move on an unfamiliar graph is to request topology only: call bfs_query with topology_only=True, which returns every node and edge in the traversal as bare identity records -- ID and type, nothing more. A 2-hop neighborhood of 84 nodes and 99 edges fits in roughly 14,000 characters this way, compared to 110,000 characters for the same traversal with full metadata. The model can survey the complete shape of the neighborhood, identify which nodes are worth expanding, and then call describe_entity selectively on the ones that matter. The result is the working set in the strict sense: topology in fast memory, metadata paged in on demand.

The Minimal Surface¶

Returning to Patterson and Hennessy: the RISC insight was that the right number of instructions is the minimum needed to express everything that matters. Not fewer -- the architecture must be complete. Not more -- every additional instruction is overhead.

BFS-QL has six tools. The choice of six is not arbitrary. It is the minimum complete set for graph exploration:

describe_schema orients the model to an unfamiliar graph.
search_entities resolves names to canonical IDs.
bfs_query traverses the graph from known seeds.
describe_entity retrieves full detail for a single stub.
describe_entities retrieves full detail for a batch of stubs.
intersect_subgraphs returns nodes within k hops of every seed simultaneously.

These six operations cover the full space of what an LLM needs to do with a knowledge graph. Orient, resolve, traverse, expand, batch-expand, intersect. Each operation earned its place by covering something none of the others do. The protocol has grown by one each time a real gap appeared -- not by speculation. Further additions are possible, but the bar is the same: a genuine capability that cannot be composed from existing tools without material cost to the model.

A larger surface area would not be more powerful. It would be harder to use reliably -- more choices about which tool to call when, more schema to internalize, more opportunity for the model to pick the wrong tool for the situation. The RISC lesson applies directly: fewer, simpler tools that compose well outperform a rich surface area that requires expertise to navigate.

Canonical IDs and the Epistemic Commons¶

BFS-QL uses canonical IDs as the fundamental unit of navigation. A seed is a canonical ID. A stub carries a canonical ID. search_entities resolves a name to a canonical ID. The entire interface is built around them.

A canonical ID is not merely a unique key. When a graph assigns a MeSH term to a disease entity, it is connecting that entity to the accumulated judgment of the biomedical community — its definition, its place in the taxonomy, its known synonyms — built and maintained over decades. Each identifier is a pointer into that structure: a located fact rather than a merely named one. A graph node labeled "diabetes" is a string. A graph node identified as MeSH:D003924 is placed in the edifice of human knowledge as the biomedical community understands it.

This is why BFS-QL is built around canonical IDs: that epistemic infrastructure is what makes the interface worth building, and what makes graphs composable across sources (developed in Part IV). The full argument — authorities, the epistemic commons, identity resolution, and what you inherit when you anchor — is in the companion volume The Identity Server: Canonical Identity for Knowledge Graphs.

A Worked Example: Desmopressin in the Medlit Graph¶

Abstract principles are clearest when grounded in a concrete case. The following is a real session against a knowledge graph built from 36 PubMed Central papers on Cushing disease and related endocrinology -- the medlit demo dataset used throughout Part III.

The session uses BFS-QL's six tools exactly as described. No SPARQL. No schema memorization. No pre-specified query structure. The model orients itself, finds a seed, surveys the topology, and assembles a picture of how desmopressin fits into the Cushing disease literature.

Step 1: Orient.

describe_schema()
→ graph_description:
    "medlit: 36 PubMed papers on Cushing disease"
→ entity_types: [
    anatomicalstructure, author, biologicalprocess,
    biomarker, disease, drug, enzyme, gene, hormone,
    institution, paper, pathway, procedure, protein,
    symptom, ...]
→ predicates: [
    AFFILIATED_WITH, ASSOCIATED_WITH, AUTHORED, CAUSES,
    CITES, DESCRIBED, INHIBITS, REGULATES, TREATS, ...
]

The model now knows the vocabulary. No schema memorization required -- the schema was fetched from the graph that defines it.

Step 2: Resolve.

search_entities("desmopressin")
→ RxNorm:3251  (drug)       ← the canonical drug entry
→ PMC11128938  (paper)
    ← a paper whose name contains "desmopressin"
→ PMC10436086  (paper)

Three matches. The model inspects the entity types and selects RxNorm:3251 as the drug node. The paper matches are a useful signal -- PMC11128938 is the primary paper about desmopressin in this graph.

Step 3: Survey the topology.

bfs_query(
  seeds=["RxNorm:3251"], max_hops=2, topology_only=True
)
→ 84 nodes, 99 edges
→ Each node: {id, entity_type}
→ Each edge: {subject, predicate, object}
→ Response size: ~14,000 characters

The full neighborhood at 14K characters fits comfortably in context. The model can now read the complete topology -- all 84 nodes and 99 edges -- and identify the structure without having paid for metadata it hasn't looked at yet.

From this topology survey, three main traversal axes are visible:

Via DBPedia:Cushing's_disease: connected to 15 associated diseases (dyslipidemia, hypertension, osteoporosis), competing drugs (osilodrostat, metyrapone, cabergoline, pasireotide), and causal factors (pituitary adenoma, glucocorticoids).
Via RxNorm:5492 (cortisol): connected to HPA axis regulation, adrenal anatomy, two proteins (UniProt:A3QQ76, UniProt:D3K902), neuroplasticity, and downstream symptoms.
Via RxNorm:376 (ACTH): connected to dopamine agonist inhibition and hypercortisolism causality.

Step 4: Drill down selectively.

describe_entity("DBPedia:Cushing's_disease")
→ name: "Cushing's disease"
→ canonical_url:
    "https://dbpedia.org/page/Cushing's_disease"
→ supporting_documents: [
    PMC11128938, PMC11779774, PMC4374115, ...
]
→ properties: {synonyms: ["CD"], ...}

The model retrieves full metadata only for the node it wants to understand in depth. The 83 other nodes remain as topology stubs -- present, navigable, not consuming context budget.

What the model learns.

Desmopressin is primarily a diagnostic agent in this graph, not merely a therapeutic one. It appears in stimulation tests for differential diagnosis of Cushing disease -- distinguishing pituitary from ectopic ACTH sources -- which is why it connects to ACTH, cortisol, and the procedure cluster (bilateral inferior petrosal sinus sampling, transsphenoidal surgery) rather than to the treatment drugs directly.

This is the kind of inference that requires structural knowledge, not semantic similarity. The connection between desmopressin and BIPSS is not in the text of any one paper in a way that vector retrieval would surface. It is in the graph. BFS-QL made it accessible.

The Landmark Ahead¶

BFS-QL solves the single-graph problem. But there is a more interesting consequence that Part IV takes up at length.

The interface contract that makes one graph accessible -- six tools, a flat query format, canonical IDs as seeds -- turns out to make many graphs composable. When two graphs both use MeSH terms for diseases and HGNC symbols for genes, an LLM holding connections to both can traverse the boundary between them using only the canonical IDs it already has. No special protocol support. No federation layer. No query rewriting. The shared canonical ID is the bridge.

This is not a property of BFS-QL. It is a property of canonical identity -- a decision the biomedical, legal, and chemistry communities made decades ago for entirely different reasons. What BFS-QL does is expose it: by building the interface around canonical IDs as the fundamental unit of navigation, it makes the composability that was always latent in those shared authorities directly accessible to LLM reasoning.

The landmark is visible from here. Part IV is where we reach it.