Skip to content

Chapter02

Chapter 2: Why Not SPARQL?

In a 2001 article in Scientific American, Tim Berners-Lee described a vision he called the Semantic Web. The web he had built was for humans -- pages of text, navigable by people who could read and interpret them. The Semantic Web would be different: structured, machine-readable, traversable by software agents that could understand what they were reading. An agent looking for a drug interaction wouldn't fetch a page and hope the answer was in the prose. It would issue a query, receive a structured response, follow links to related data, and assemble an answer from explicit, typed facts. The knowledge would be in the graph. The agent would get to it.

The vision failed, for reasons the companion volume examines at length. But the part worth noting here is what the agents were supposed to do: write queries. SPARQL -- the query language that emerged from the Semantic Web effort -- was designed with exactly this use case in mind. Expressive, precise, composable. Everything an intelligent agent would need.

Twenty years later, intelligent agents arrived. They were language models, not the rule-based reasoners Berners-Lee had imagined, and they turned out to be very bad at writing SPARQL. The vision was right about the destination. It was wrong about what the agent would look like and what interface it would need.

How LLMs Generate Text

To understand why SPARQL generation fails systematically, it helps to understand what an LLM is actually doing when it writes a query.

A language model generates text token by token, each token sampled from a probability distribution conditioned on everything that came before. The model has no symbolic reasoner, no query planner, no schema validator. It has statistical patterns absorbed from its training corpus -- including patterns from SPARQL queries, Cypher queries, and documentation about both. When asked to write a query, it produces text that looks like a query, following the patterns it has seen. Most of the time, the surface form is correct. The query parses.

What the model cannot do is verify. It cannot check that a predicate name it has generated actually exists in the target schema. It cannot confirm that a URI prefix is valid for this particular endpoint. It cannot know whether the query it has written will return results, return the wrong results, or time out. It generates plausible text and stops. Verification is not part of the architecture.

This produces a characteristic failure pattern. The queries look right. They often parse. They frequently don't work.

The Failure Modes

Hallucinated predicates. The model generates a predicate name that sounds semantically appropriate but does not exist in the schema. Against a biomedical graph, it might write dbo:treatedBy when the actual predicate is mesh:treats, or invent schema:hasSymptom for a graph that uses snomed:hasPresentation. The query is syntactically valid. It returns nothing. The model, receiving an empty result, may conclude that no such relationship exists in the graph -- a false negative with real consequences.

Wrong URI prefixes. SPARQL queries over RDF graphs require correct namespace prefixes. dbo: and dbr: are different namespaces in DBpedia; conflating them produces broken queries. Wikidata uses wd: for entities and wdt: for properties; the distinction is non-obvious and frequently confused. A model that has seen many SPARQL examples in training will have absorbed prefix patterns, but those patterns don't transfer cleanly to every endpoint, and the model has no mechanism to verify which prefixes are valid for the graph it is currently querying.

Syntactically valid, semantically empty. Some of the most insidious failures produce queries that parse, execute, and return results -- just not the right ones. A query that asks for all entities of type owl:Thing will return everything in many triple stores. A query with a subtly wrong join condition will return a Cartesian product or an empty set depending on the data. The model has no way to distinguish a correct result from a plausible-looking wrong one.

Query shape mismatch. LLMs tend to generate queries that match the patterns most common in their training data. Those patterns are not necessarily the patterns a given endpoint handles efficiently. A query that works against a local Fuseki instance may time out against a public Wikidata endpoint with rate limits and query complexity restrictions. The model doesn't know the difference.

None of these failure modes are fixable by making the model smarter or the prompt more detailed. They are structural. The model is generating text in a precise formal language against a schema it cannot inspect, with no feedback loop between generation and verification. Better prompting reduces the error rate at the margins. It does not change the underlying mechanism.

The Same Argument Applies to Cypher

Cypher, Neo4j's query language for property graphs, is a different language with the same problem. It is expressive and well-designed, built for human authors who know their schema and can iterate against a live database. An LLM generating Cypher cold, against an unfamiliar graph, runs into the same failure modes: invented relationship types, wrong property names, match patterns that produce unexpected results. The surface syntax is different. The mechanism of failure is identical.

This is not a criticism of either language. SPARQL and Cypher are excellent at what they were designed to do. The problem is that "designed for human authors with schema familiarity and an interactive development environment" and "suitable for LLM generation in real time against an unknown graph" describe different requirements. No amount of language design work makes both true simultaneously.

Why RAG Doesn't Close the Gap

The natural response to query generation failures is to bypass query generation entirely. Instead of asking the model to write SPARQL, retrieve relevant content from the graph and give it to the model as text. This is the RAG approach applied to graphs, and it has genuine appeal: it avoids the formal language generation problem, it fits neatly into existing RAG infrastructure, and it sidesteps the schema familiarity requirement.

The insight behind RAG is sound. [@lewis2020rag] Giving the model something to reason from rather than asking it to reason from memory reduces hallucination and improves accuracy on knowledge-intensive tasks. For document retrieval, where the question is "find the passage most semantically similar to this query," vector similarity search is a good fit. The model generates a query embedding, the retriever finds similar document embeddings, and the relevant passages land in the context.

Graphs break this in a specific and important way. Relevance in a graph is structural, not semantic. The most important node for answering a question might be two hops away from any node that looks semantically similar to the query. Consider a question about drug interactions for a specific patient profile. The relevant nodes are the drug, its metabolic targets, the enzymes those targets share with other drugs the patient is taking, and the clinical outcomes associated with those shared pathways. None of those intermediate nodes -- the enzymes, the pathways -- are semantically similar to "drug interactions for this patient." They are structurally connected to the answer. Vector similarity retrieval will not find them. It will find nodes that mention drug interactions in their text representation, which is a different set.

The distinction is fundamental. Vector similarity retrieval asks: what is near this query in embedding space? Graph traversal asks: what is connected to what I already know? For the kinds of questions that make knowledge graphs valuable -- multi-hop relational reasoning, pathway analysis, provenance tracing -- the second question is the right one. A retrieval system built for the first question answers the second question poorly, not because the implementation is bad but because the operation is wrong.

This is where Graph RAG systems built on vector retrieval tend to fail in practice. They find semantically similar nodes efficiently. They miss structurally important nodes reliably. The result is a context that looks relevant but is missing the connections that would make the reasoning meaningful.

The fix is not better embeddings or a larger retrieval set. The fix is a different operation: traversal rather than retrieval, structure rather than similarity. That is what Chapter 3 proposes.