Skip to content

Chapter03

Chapter 3: The Right Abstraction

In 1980, David Patterson at UC Berkeley and John Hennessy at Stanford were separately arriving at the same uncomfortable conclusion about the direction computer architecture had taken. The prevailing wisdom of the era was that more was better: more instructions, more addressing modes, more hardware support for complex operations. The VAX-11/780, released by Digital Equipment Corporation in 1977, was the apotheosis of this philosophy -- a machine with hundreds of instructions, some of them extraordinarily powerful, capable of expressing complex operations in a single opcode. Compiler writers loved it. It was, by the standards of the time, a masterpiece.

Patterson and Hennessy thought it was a mistake.

Their argument was not that complex instructions were useless. It was that they were expensive in ways that weren't obvious and beneficial in ways that were overstated. A complex instruction that took ten cycles to execute was not better than ten simple instructions that each took one cycle -- the simple version was equally fast and the compiler could see each step, reason about it, and optimize across them. The hardware complexity required to implement the full instruction set also made it harder to pipeline, harder to verify, and harder to push to higher clock speeds. Simplicity wasn't a limitation. It was an advantage.

The resulting architecture -- RISC, Reduced Instruction Set Computing -- was controversial. It contradicted decades of conventional wisdom and threatened the business models of companies that had invested heavily in CISC implementations. Patterson's Berkeley RISC and Hennessy's MIPS processors were academic projects. Industry was skeptical.

The market settled the argument. RISC architectures -- MIPS, SPARC, ARM -- came to dominate embedded computing, then mobile computing, then, with the Apple M-series chips, high-performance desktop computing. The complex instruction sets that had seemed so powerful turned out to be expensive overhead that compilers didn't need and processors couldn't efficiently execute. Fewer, simpler operations, composable by the compiler, outperformed the rich surface area that had seemed like a gift to programmers.

The lesson generalizes. A large interface surface area is not a feature. It is a burden -- on the implementor who must make everything work, on the user who must learn what to use when, and on any automated system that must generate calls into it reliably. The question is not "how much can we express?" but "how little do we need to express everything that matters?"

That is the design question BFS-QL answers.

Traversal, Not Querying

Chapter 2 established that query generation fails because it asks a language model to produce precise formal language against an unknown schema without verification. The failure is structural. But there is a deeper point worth making: even if query generation worked reliably, it would be the wrong operation.

Consider how an LLM actually reasons about a domain it is exploring. It starts with something it knows -- a named entity, a concept, a fact from context. It wants to know what that thing connects to. It expands outward, following relationships, building a picture of the neighborhood. It asks follow-up questions based on what it finds. This is not querying -- it is traversal. The operation is not "express a precise constraint and retrieve the matching set" but "start here, look around, go deeper where it's interesting."

Breadth-first search is the natural formalization of this. Start from one or more seed entities. Expand to their immediate neighbors. Expand again to the next ring. Collect the subgraph. Decide how far to go based on what you find. BFS over a knowledge graph is exactly the operation that matches how an LLM explores a domain: incremental, local, driven by what is already known.

This reframing has an important consequence. A query language like SPARQL is designed to express the full answer in a single declaration -- here is the constraint, find everything that matches. BFS is designed to be issued iteratively -- here is where I am, show me what is nearby. The iterative model fits the LLM's conversational, multi-turn reasoning style. The declarative model requires the LLM to know, upfront, what it is looking for. For graph exploration -- which is often precisely the case where the LLM doesn't know what it's looking for yet -- the iterative model is the right one.

Topology and Presentation

BFS over a knowledge graph produces a subgraph. The question is what that subgraph should contain.

The naive answer is: everything. Return all nodes and all edges within the traversal depth, with all their metadata. This is correct in the sense that no information is lost. It is impractical for the reasons Chapter 1 established: a dense graph at two hops can contain hundreds of nodes and thousands of edges, and dumping all of that into the context window is expensive and degrades reasoning.

The tempting alternative is filtering: return only the nodes and edges that match the query's constraints, discard the rest. If the user asks about drugs and diseases, return only Drug and Disease nodes; drop everything else. This keeps the context small. It also produces a misleading picture of the graph.

Consider a Disease node connected to ten Drug nodes, two Gene nodes, and fifteen Publication nodes. If the query filters for Drugs only and discards everything else, the model sees a Disease connected to ten drugs and nothing else. It does not know that the Disease is also connected to genes and publications. It cannot ask follow-up questions about those connections because it doesn't know they exist. The filtered subgraph is not a smaller version of the truth -- it is a different, inaccurate picture of the graph's structure.

The right answer separates two orthogonal concerns: topology -- what nodes and edges exist -- and presentation -- how much data each one carries. BFS-QL's response to this is the stub. A stub is a node or edge that is present in the result but carries only identity information: its ID and type, nothing more. Stubs are not filtered out. They are present. The model knows they exist, knows what kind of thing they are, and can choose to follow up on them -- by calling describe_entity for a node stub, or issuing a new bfs_query seeded at that node. The stub is a navigational handle, not a dead end.

This means the BFS-QL response to "show me drugs and diseases" is not "here are the drugs and diseases, nothing else." It is "here are the drugs and diseases with full metadata, and here are the other things they connect to as lightweight stubs so you know the topology." The context cost is controlled. The picture of the graph is accurate.

The Working Set Applied to Graph Data

Denning's working set theory, described in Chapter 1, asked what the minimum is that a process needs in memory to run efficiently. The answer was not "nothing" -- you need the pages that are currently active. It was not "everything" -- you can't afford to keep it all in fast memory. It was the working set: the pages recently accessed, likely to be accessed again, sufficient for the computation at hand.

The BFS-QL query model asks the same question about context. The node_types and predicates parameters are the mechanism for declaring the working set: these are the types of nodes I need in full, these are the predicates I need with provenance, everything else I need only as topology. The model pays the context cost where it matters and defers cost where it doesn't.

This is a principled design choice, not a workaround. The working set concept is a solution to a fundamental resource allocation problem. Applying it to context management produces a query model that is context-efficient by construction -- not by truncating results or hoping the model will ignore irrelevant content, but by giving the model precise control over where the cost is paid.

In practice, the recommended first move on an unfamiliar graph is to request topology only: call bfs_query with topology_only=True, which returns every node and edge in the traversal as bare identity records -- ID and type, nothing more. A 2-hop neighborhood of 84 nodes and 99 edges fits in roughly 14,000 characters this way, compared to 110,000 characters for the same traversal with full metadata. The model can survey the complete shape of the neighborhood, identify which nodes are worth expanding, and then call describe_entity selectively on the ones that matter. The result is the working set in the strict sense: topology in fast memory, metadata paged in on demand.

The Minimal Surface

Returning to Patterson and Hennessy: the RISC insight was that the right number of instructions is the minimum needed to express everything that matters. Not fewer -- the architecture must be complete. Not more -- every additional instruction is overhead.

BFS-QL has six tools. The choice of six is not arbitrary. It is the minimum complete set for graph exploration:

  • describe_schema orients the model to an unfamiliar graph.
  • search_entities resolves names to canonical IDs.
  • bfs_query traverses the graph from known seeds.
  • describe_entity retrieves full detail for a single stub.
  • describe_entities retrieves full detail for a batch of stubs.
  • intersect_subgraphs returns nodes within k hops of every seed simultaneously.

These six operations cover the full space of what an LLM needs to do with a knowledge graph. Orient, resolve, traverse, expand, batch-expand, intersect. Each operation earned its place by covering something none of the others do. The protocol has grown by one each time a real gap appeared -- not by speculation. Further additions are possible, but the bar is the same: a genuine capability that cannot be composed from existing tools without material cost to the model.

A larger surface area would not be more powerful. It would be harder to use reliably -- more choices about which tool to call when, more schema to internalize, more opportunity for the model to pick the wrong tool for the situation. The RISC lesson applies directly: fewer, simpler tools that compose well outperform a rich surface area that requires expertise to navigate.

Canonical IDs and the Epistemic Commons

BFS-QL uses canonical IDs as the fundamental unit of navigation. A seed is a canonical ID. A stub carries a canonical ID. search_entities resolves a name to a canonical ID. The entire interface is built around them.

A canonical ID is not merely a unique key. When a graph assigns a MeSH term to a disease entity, it is connecting that entity to the accumulated judgment of the biomedical community — its definition, its place in the taxonomy, its known synonyms — built and maintained over decades. Each identifier is a pointer into that structure: a located fact rather than a merely named one. A graph node labeled "diabetes" is a string. A graph node identified as MeSH:D003924 is placed in the edifice of human knowledge as the biomedical community understands it.

This is why BFS-QL is built around canonical IDs: that epistemic infrastructure is what makes the interface worth building, and what makes graphs composable across sources (developed in Part IV). The full argument — authorities, the epistemic commons, identity resolution, and what you inherit when you anchor — is in the companion volume The Identity Server: Canonical Identity for Knowledge Graphs.

A Worked Example: Desmopressin in the Medlit Graph

Abstract principles are clearest when grounded in a concrete case. The following is a real session against a knowledge graph built from 36 PubMed Central papers on Cushing disease and related endocrinology -- the medlit demo dataset used throughout Part III.

The session uses BFS-QL's six tools exactly as described. No SPARQL. No schema memorization. No pre-specified query structure. The model orients itself, finds a seed, surveys the topology, and assembles a picture of how desmopressin fits into the Cushing disease literature.

Step 1: Orient.

describe_schema()
→ graph_description:
    "medlit: 36 PubMed papers on Cushing disease"
→ entity_types: [
    anatomicalstructure, author, biologicalprocess,
    biomarker, disease, drug, enzyme, gene, hormone,
    institution, paper, pathway, procedure, protein,
    symptom, ...]
→ predicates: [
    AFFILIATED_WITH, ASSOCIATED_WITH, AUTHORED, CAUSES,
    CITES, DESCRIBED, INHIBITS, REGULATES, TREATS, ...
]

The model now knows the vocabulary. No schema memorization required -- the schema was fetched from the graph that defines it.

Step 2: Resolve.

search_entities("desmopressin")
→ RxNorm:3251  (drug)       ← the canonical drug entry
→ PMC11128938  (paper)
    ← a paper whose name contains "desmopressin"
→ PMC10436086  (paper)

Three matches. The model inspects the entity types and selects RxNorm:3251 as the drug node. The paper matches are a useful signal -- PMC11128938 is the primary paper about desmopressin in this graph.

Step 3: Survey the topology.

bfs_query(
  seeds=["RxNorm:3251"], max_hops=2, topology_only=True
)
→ 84 nodes, 99 edges
→ Each node: {id, entity_type}
→ Each edge: {subject, predicate, object}
→ Response size: ~14,000 characters

The full neighborhood at 14K characters fits comfortably in context. The model can now read the complete topology -- all 84 nodes and 99 edges -- and identify the structure without having paid for metadata it hasn't looked at yet.

From this topology survey, three main traversal axes are visible:

  • Via DBPedia:Cushing's_disease: connected to 15 associated diseases (dyslipidemia, hypertension, osteoporosis), competing drugs (osilodrostat, metyrapone, cabergoline, pasireotide), and causal factors (pituitary adenoma, glucocorticoids).
  • Via RxNorm:5492 (cortisol): connected to HPA axis regulation, adrenal anatomy, two proteins (UniProt:A3QQ76, UniProt:D3K902), neuroplasticity, and downstream symptoms.
  • Via RxNorm:376 (ACTH): connected to dopamine agonist inhibition and hypercortisolism causality.

Step 4: Drill down selectively.

describe_entity("DBPedia:Cushing's_disease")
→ name: "Cushing's disease"
→ canonical_url:
    "https://dbpedia.org/page/Cushing's_disease"
→ supporting_documents: [
    PMC11128938, PMC11779774, PMC4374115, ...
]
→ properties: {synonyms: ["CD"], ...}

The model retrieves full metadata only for the node it wants to understand in depth. The 83 other nodes remain as topology stubs -- present, navigable, not consuming context budget.

What the model learns.

Desmopressin is primarily a diagnostic agent in this graph, not merely a therapeutic one. It appears in stimulation tests for differential diagnosis of Cushing disease -- distinguishing pituitary from ectopic ACTH sources -- which is why it connects to ACTH, cortisol, and the procedure cluster (bilateral inferior petrosal sinus sampling, transsphenoidal surgery) rather than to the treatment drugs directly.

This is the kind of inference that requires structural knowledge, not semantic similarity. The connection between desmopressin and BIPSS is not in the text of any one paper in a way that vector retrieval would surface. It is in the graph. BFS-QL made it accessible.

The Landmark Ahead

BFS-QL solves the single-graph problem. But there is a more interesting consequence that Part IV takes up at length.

The interface contract that makes one graph accessible -- six tools, a flat query format, canonical IDs as seeds -- turns out to make many graphs composable. When two graphs both use MeSH terms for diseases and HGNC symbols for genes, an LLM holding connections to both can traverse the boundary between them using only the canonical IDs it already has. No special protocol support. No federation layer. No query rewriting. The shared canonical ID is the bridge.

This is not a property of BFS-QL. It is a property of canonical identity -- a decision the biomedical, legal, and chemistry communities made decades ago for entirely different reasons. What BFS-QL does is expose it: by building the interface around canonical IDs as the fundamental unit of navigation, it makes the composability that was always latent in those shared authorities directly accessible to LLM reasoning.

The landmark is visible from here. Part IV is where we reach it.