Foreword: A Manifesto for Machine Knowledge¶

This foreword appears in all three volumes of the Graphwright series.

We are now in an age of machine reasoning, and some of this reasoning is done in high-stakes domains: medicine, law, engineering, spaceflight. Lives and livelihoods can be affected by incorrect conclusions or decisions. The cost of error is real and significant. LLMs are here, they are staying, and there is no turning back the clock.

As we all know, LLMs have weaknesses. Their mastery of language syntax is astonishing, but they don't understand "this refers to that," or "these two things are the same." They have no persistent notion of identity. They do not inhabit a world of things connected by relationships. They do not track logical consequence from one step to the next.

They cannot reason across multiple causal steps because they cannot reliably reason across a single causal step. They do not know what things are or how they behave, only how they are talked about.

And so we build RAG (retrieval-augmented generation) systems, hoping to improve the situation. We improve the LLM's focus on material that is more relevant, more similar, better connected to sources of information, and it helps.

But we are still dealing with strings, not things.

We still cannot say "this refers to that," or "these two mentions refer to the same entity." We still cannot follow a chain of causality or enforce a sequence of logical steps. We retrieve passages, but we do not operate on meaning.

If RAG doesn't close the gap, what would?

Identity -- what are we talking about?
Canonical IDs -- identifiers anchored in curated human knowledge (think Wikipedia)
Authoritative ontologies -- shared bodies of reference (think dictionaries, taxonomies)
Deduplication across sources -- recognizing that the same thing may be named in different ways ("tumor" vs "neoplasm")
A fixed set of entity types
Type -- which relationships are meaningful?
A fixed set of predicates
Domain and range for each predicate -- constraints on which kinds of things can be related, so we do not assert things like "aspirin inhibits New York"
Structural validity -- a claim is valid if it is well-formed with respect to the graph's type system, independent of whether it is true or false
Provenance -- where did this claim come from?
Source traceability
Evidence aggregation
Confidence grounded in origin

A system cannot reason reliably about the world unless it represents that world with stable identities, constrained relationships, and explicit evidence.

Machine reasoning requires a data model, not just a model.

The Typed Graph¶

When we build a knowledge graph where we

fix the set of entity types and the set of predicates
establish domain and range constraints for each predicate
require that entities be assigned canonical IDs whenever possible
preserve provenance information for all relationships

we are no longer dealing with strings, but with a structured representation of the world. This is what we call a typed graph.

A typed graph does not guarantee that its conclusions are true. It guarantees something more fundamental: that its claims are well-formed, grounded in identifiable entities, and traceable to their sources.

Large classes of nonsense and hallucination are not corrected -- they are never admitted into the system at all. Category errors are rejected. Ambiguous references are resolved or made explicit. Unsupported claims are visible as such.

The result is a system whose outputs may still be wrong, but are always inspectable, reproducible, and subject to correction.

That is the minimum standard for reasoning in high-stakes domains.

Unstructured Text
       |
       v
  Extraction (LLM)
       |
       v
  Mentions (strings)
       |
       v
  Identity Resolution
  -- canonical IDs
  -- deduplication
       |
       v
  Typed Graph
  -- entity types
  -- predicates
  -- domain/range
  -- provenance
       |
       v
  Queries / Traversals
       |
       v
  Machine Reasoning
  -- multi-step
  -- composable
  -- inspectable

Preface¶

The knowledge is in the graph. The LLM can't get to it.

That is the problem this book solves. Structured knowledge graphs -- DBpedia, Wikidata, UniProt, domain-specific SPARQL endpoints, internal Neo4j instances -- contain enormous amounts of curated, queryable knowledge. Almost none of it is accessible to a language model in practice, because the interfaces that exist were built for human authors, not machine reasoners. SPARQL and Cypher are expressive and precise. They are also, for an LLM trying to answer a question in real time, practically unusable on anything non-trivial. The hallucinated predicates, wrong URI prefixes, and syntactically valid but semantically broken queries are not bugs to be fixed. They follow directly from how language models work and how those query languages are structured. The interface is the problem.

This book is about the missing interface.

Knowledge Graphs from Unstructured Text is about getting knowledge in -- extracting entities, relationships, and provenance from raw documents and assembling them into a queryable graph. This book is about getting knowledge out, specifically, out in a form a language model can actually use. Readers who have an existing graph -- a Wikidata endpoint, a corporate triple store, a Neo4j instance, a kgraph-derived Postgres database -- can start here. Readers building from scratch should read that book first.

The coupling point between the two books is a single Python class: KGraphPostgresBackend. kgraph writes; BFS-QL reads. Together they cover the full pipeline from raw text to a language model that can reason over what that text contained.

There is a larger argument in this book that deserves to be stated upfront, because it is easy to miss while working through the protocol details.

The canonical identifier authorities -- MeSH for diseases, RxNorm for drugs, UniProt for proteins, HGNC for genes -- have existed for decades. They were designed as identity resolution tools: a way for different databases, research groups, and institutions to refer to the same entity without ambiguity. They do that job well.

What nobody designed them for, and what this book argues they have quietly become, is the interoperability layer for LLM reasoning across knowledge sources. When two graphs both anchor their disease entities to MeSH terms, an LLM holding connections to both graphs can traverse the boundary between them without any special protocol support. The shared canonical ID is the bridge. It was always the bridge. It just didn't matter until language models needed to cross it.

This means every knowledge graph that uses canonical IDs correctly is automatically composable with every other one that does the same. The companion volume argues for canonical identity as a quality and provenance concern -- get it right or your graph will be inconsistent and hard to maintain. That argument is correct as far as it goes. But it understates the stakes. Canonical identity is also a composition argument. Graphs that anchor to established ontological authorities compose naturally with each other and with the open linked-data ecosystem. Graphs that mint their own IDs are islands.

The LLM is the reasoner. BFS-QL is the interface. Shared canonical IDs are the bridges between graphs. All three pieces are available right now.

This book is organized in four parts. Part I makes the case that the interface problem is real and that the natural first answers -- let the LLM write SPARQL, wrap the graph in a document retriever -- do not solve it. Part II specifies the BFS-QL protocol: six MCP tools, a flat query format, and the design decisions behind them. Part III shows how to build a backend, with worked implementations for SPARQL endpoints, Postgres/pgvector, and Neo4j. Part IV zooms out to graph composition, the SaaS layer, and what comes next.

The appendix contains the complete BFS-QL reference -- query format, response format, and LLM prompt templates -- suitable for copying directly into implementations.