Foreword: A Manifesto for Machine Knowledge¶
This foreword appears in all three volumes of the Graphwright series.
We are now in an age of machine reasoning, and some of this reasoning is done in high-stakes domains: medicine, law, engineering, spaceflight. Lives and livelihoods can be affected by incorrect conclusions or decisions. The cost of error is real and significant. LLMs are here, they are staying, and there is no turning back the clock.
As we all know, LLMs have weaknesses. Their mastery of language syntax is astonishing, but they don't understand "this refers to that," or "these two things are the same." They have no persistent notion of identity. They do not inhabit a world of things connected by relationships. They do not track logical consequence from one step to the next.
They cannot reason across multiple causal steps because they cannot reliably reason across a single causal step. They do not know what things are or how they behave, only how they are talked about.
And so we build RAG (retrieval-augmented generation) systems, hoping to improve the situation. We improve the LLM's focus on material that is more relevant, more similar, better connected to sources of information, and it helps.
But we are still dealing with strings, not things.
We still cannot say "this refers to that," or "these two mentions refer to the same entity." We still cannot follow a chain of causality or enforce a sequence of logical steps. We retrieve passages, but we do not operate on meaning.
If RAG doesn't close the gap, what would?
- Identity -- what are we talking about?
- Canonical IDs -- identifiers anchored in curated human knowledge (think Wikipedia)
- Authoritative ontologies -- shared bodies of reference (think dictionaries, taxonomies)
- Deduplication across sources -- recognizing that the same thing may be named in different ways ("tumor" vs "neoplasm")
-
A fixed set of entity types
-
Type -- which relationships are meaningful?
- A fixed set of predicates
- Domain and range for each predicate -- constraints on which kinds of things can be related, so we do not assert things like "aspirin inhibits New York"
-
Structural validity -- a claim is valid if it is well-formed with respect to the graph's type system, independent of whether it is true or false
-
Provenance -- where did this claim come from?
- Source traceability
- Evidence aggregation
- Confidence grounded in origin
A system cannot reason reliably about the world unless it represents that world with stable identities, constrained relationships, and explicit evidence.
Machine reasoning requires a data model, not just a model.
The Typed Graph¶
When we build a knowledge graph where we
- fix the set of entity types and the set of predicates
- establish domain and range constraints for each predicate
- require that entities be assigned canonical IDs whenever possible
- preserve provenance information for all relationships
we are no longer dealing with strings, but with a structured representation of the world. This is what we call a typed graph.
A typed graph does not guarantee that its conclusions are true. It guarantees something more fundamental: that its claims are well-formed, grounded in identifiable entities, and traceable to their sources.
Large classes of nonsense and hallucination are not corrected -- they are never admitted into the system at all. Category errors are rejected. Ambiguous references are resolved or made explicit. Unsupported claims are visible as such.
The result is a system whose outputs may still be wrong, but are always inspectable, reproducible, and subject to correction.
That is the minimum standard for reasoning in high-stakes domains.
Unstructured Text
|
v
Extraction (LLM)
|
v
Mentions (strings)
|
v
Identity Resolution
-- canonical IDs
-- deduplication
|
v
Typed Graph
-- entity types
-- predicates
-- domain/range
-- provenance
|
v
Queries / Traversals
|
v
Machine Reasoning
-- multi-step
-- composable
-- inspectable
Preface¶
The two companion volumes -- Knowledge Graphs from Unstructured Text and BFS-QL: A Graph Query Protocol for Language Models -- both depend on canonical identity but neither has room to fully explain it. The first book builds a knowledge graph from unstructured text. The second book serves that graph to a language model. Both books gesture at the identity server and say "this is important" and then move on.
This book is what they are pointing at. The same ideas apply whenever structured data must be deduplicated or joined across sources -- not only when that data happens to be stored as a graph.
Canonical identity is the difference between a collection of extracted facts and a knowledge graph that can be trusted. It is the difference between a node labeled "desmopressin" and a node anchored to RxNorm:3251, connected to the accumulated judgment of the biomedical community about what that compound is, what it does, and how it relates to everything else they have named. It is the difference between a fact and a located fact -- one that can be reasoned about, connected across sources, and trusted because the community that defined it is known.
The extraction bottleneck that held back knowledge representation for fifty years is now broken. Large language models can read a scientific paper and produce structured claims at a cost that was unimaginable a decade ago. The remaining question is whether those claims can be trusted -- traced to their sources, deduplicated across papers, and anchored to shared authorities that give them stable meaning. That is the question this book answers.
The identity server is a microservice. It has a clean interface. It can be deployed as a Docker image and pointed at any knowledge graph project. But it is also an argument about epistemology: that trustworthiness in a knowledge system is not a property you add after the fact, it is a constraint you either satisfy from the beginning or violate permanently.
This book is for engineers building knowledge graphs who want that constraint to be architectural, not aspirational.