Chapter01

Chapter 1: Graphs Are Hard for Language Models¶

In the summer of 2023, Microsoft Research published a paper called "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." [@edge2024graphrag] The timing was perfect. The field had spent two years watching retrieval-augmented generation -- RAG -- mature from an interesting idea into production infrastructure, and the obvious next question was already in the air: if retrieving text passages helps, what about retrieving structured knowledge? A graph, after all, knows things that a pile of documents doesn't. It knows what is connected to what. It knows the type of every connection. It knows that two entities mentioned in separate papers are the same entity, and it knows how they relate. Graph RAG promised to bring all of that to bear on LLM reasoning.

The paper was well-executed and the results were real. Community indexes built from document corpora outperformed naive RAG on certain kinds of global, thematic questions. Developers read it and started building.

What happened next was instructive. The demos worked. The production deployments were harder. Teams connecting LLMs to real graphs -- not community indexes built for the purpose, but existing SPARQL endpoints, corporate Neo4j instances, Wikidata, domain-specific triple stores -- ran into a consistent set of problems that the paper hadn't addressed and that no amount of prompt engineering reliably fixed. The models wrote queries that were syntactically plausible but semantically broken. They hallucinated predicate names. They got URI prefixes wrong. They produced SPARQL that parsed but returned nothing, or returned the wrong thing, or timed out against endpoints that weren't designed for the query shapes an LLM tends to generate. The knowledge was in the graph. The LLM still couldn't reliably get to it.

This chapter is about why. The problems teams encountered were not random, and they were not going to be fixed by better prompting or a smarter model. They followed from something more fundamental: the mismatch between how graph query languages are structured and how language models actually work. Understanding that mismatch is the first step toward designing around it.

The Transformer and the Context Window¶

To understand why the interface problem is hard, it helps to understand one architectural fact about the models at the center of it.

In 2017, a team at Google Brain published "Attention Is All You Need," [@vaswani2017attention] introducing the transformer architecture that underlies every large language model in use today. The paper was not received as a landmark at the time -- it was one of several strong results at that year's NeurIPS, and its title, chosen with deliberate provocation, was partly a bet that turned out to be right. Within a few years the transformer had displaced essentially every competing architecture for sequence modeling. The bet paid off.

The core mechanism is self-attention: every token in the input sequence attends to every other token, producing a weighted representation of the full context. This is what gives transformers their remarkable ability to reason over long-range dependencies -- the token at position 500 can directly influence the interpretation of the token at position 3, with no information loss from distance. Previous architectures had struggled with exactly this; the transformer solved it cleanly.

The cost is quadratic. Self-attention over a sequence of n tokens requires computing n² attention weights. Double the sequence length and you quadruply the compute. This was understood from the beginning and accepted as a reasonable tradeoff -- in 2017, nobody was thinking about context windows of 100,000 tokens. The sequences being modeled were sentences and short paragraphs. The quadratic cost was manageable.

By 2023, context windows had grown from hundreds of tokens to tens of thousands, and the quadratic cost had become a central engineering concern. Researchers developed linear attention approximations, sparse attention patterns, and sliding window schemes to push the boundary further. Context windows continued to grow. But the fundamental constraint didn't go away -- it got managed, not eliminated. Every token in the context window still imposes a cost on every other token. Longer contexts are not just more expensive in proportion; they are more expensive per token. This is not a hardware limitation that will eventually be engineered away. It is structural to how self-attention works.

The practical consequence for graph querying is direct. A knowledge graph neighborhood -- the entities and relationships within two or three hops of a seed node -- can easily contain hundreds of nodes and thousands of edges. Serializing that neighborhood naively and stuffing it into the context window is expensive, and the expense compounds: a larger context costs more to process and, it turns out, reasons less reliably over its contents.

Lost in the Middle¶

In 2023, a team at Stanford published an empirical study with a title that became a shorthand for a problem the field had been observing anecdotally: "Lost in the Middle: How Language Models Use Long Contexts." [@liu2023lost] The finding was stark. LLM performance on tasks that required retrieving specific information from a long context degraded sharply when that information was positioned in the middle of the context window. Models were good at using information near the beginning and near the end. The middle was a dead zone.

This was not a minor effect. On some tasks, performance at the middle of a long context was barely better than chance, while performance at the boundaries remained strong. The effect was consistent across model families and context lengths.

The implication for graph querying is that a large, unfiltered graph dump in the context does not just waste tokens -- it actively degrades reasoning. An LLM handed a serialized subgraph of three hundred nodes will not reliably find the relevant dozen. The relevant nodes, wherever they happen to fall in the serialization, are just as likely to land in the dead zone as not. Giving the model more context is not the answer. Giving it the right context is.

The Memory Hierarchy Analogy¶

Computer architects confronted a version of this problem sixty years ago.

In the 1960s, RAM was expensive and scarce. The gap between the speed of the processor and the speed of available memory was already large and growing. The naive approach -- treat all memory equally, fetch whatever you need when you need it -- didn't work at scale. Programs needed more memory than could be kept fast, and fetching from slow storage on every access made the processor sit idle.

The solution was the cache hierarchy: a small amount of very fast memory close to the processor, a larger amount of slower memory behind it, and backing storage behind that. The key insight was that programs don't access memory randomly -- they have locality. The data a program needs right now is probably near the data it needed a moment ago. Keep the working set in fast memory, page everything else out, and performance improves dramatically.

Peter Denning formalized this in 1968 with working set theory. [@denning1968working] The working set of a process at any moment is the set of memory pages it has accessed recently -- the minimum it needs in fast memory to run efficiently. The question cache architects asked was not "how much memory can we provide?" but "what does this process actually need right now?"

The analogy to LLM context is exact. The context window is fast memory -- expensive per token, finite, and the place where reasoning actually happens. Backing storage is the graph: vast, slow to query, and mostly irrelevant to any given question. The design question is not "how much of the graph can we fit in context?" but "what does the model actually need right now to answer this question?"

BFS-QL's answer, developed in detail in Chapter 3, is a working-set-aware data structure: topology always present so the model can navigate, full metadata only where the cost is justified. The context window stays manageable. The reasoning stays accurate. The graph stays accessible.

Why the Interface Is the Problem¶

Returning to the teams that ran into trouble with production Graph RAG deployments: the failure mode wasn't that their graphs were bad, or that their models were too weak, or that knowledge graphs are fundamentally unsuitable for LLM reasoning. The failure mode was the interface. They were asking language models to use tools designed for human authors, under constraints those tools were never designed to respect.

SPARQL is a powerful and well-designed query language. It was built to let human experts express precise, complex queries against RDF graphs. It rewards deep familiarity with the schema, careful attention to prefix namespaces, and an understanding of how the underlying store evaluates queries. These are things human experts acquire over time. They are not things a language model can reliably produce on demand, cold, against an unfamiliar graph, in the middle of a conversation.

Cypher has similar properties for property graphs. Expressive, powerful, designed for human authors.

The failure modes -- hallucinated predicates, wrong prefixes, syntactically valid but semantically empty queries -- are not bugs that better prompting fixes. They are predictable consequences of asking a model to generate a precise formal language it has seen only in training, against a schema it doesn't know, without feedback. The interface is not designed for this use case.

The rest of Part I examines the natural alternatives and why they fall short. Chapter 2 takes SPARQL and RAG in turn. Chapter 3 proposes a different starting point -- one that fits how language models actually reason, respects the context window as a constrained resource, and makes the graph accessible without asking the model to be something it isn't.