Foreword: A Manifesto for Machine Knowledge¶

This foreword appears in all three volumes of the Graphwright series.

We are now in an age of machine reasoning, and some of this reasoning is done in high-stakes domains: medicine, law, engineering, spaceflight. Lives and livelihoods can be affected by incorrect conclusions or decisions. The cost of error is real and significant. LLMs are here, they are staying, and there is no turning back the clock.

As we all know, LLMs have weaknesses. Their mastery of language syntax is astonishing, but they don't understand "this refers to that," or "these two things are the same." They have no persistent notion of identity. They do not inhabit a world of things connected by relationships. They do not track logical consequence from one step to the next.

They cannot reason across multiple causal steps because they cannot reliably reason across a single causal step. They do not know what things are or how they behave, only how they are talked about.

And so we build RAG (retrieval-augmented generation) systems, hoping to improve the situation. We improve the LLM's focus on material that is more relevant, more similar, better connected to sources of information, and it helps.

But we are still dealing with strings, not things.

We still cannot say "this refers to that," or "these two mentions refer to the same entity." We still cannot follow a chain of causality or enforce a sequence of logical steps. We retrieve passages, but we do not operate on meaning.

If RAG doesn't close the gap, what would?

Identity -- what are we talking about?
Canonical IDs -- identifiers anchored in curated human knowledge (think Wikipedia)
Authoritative ontologies -- shared bodies of reference (think dictionaries, taxonomies)
Deduplication across sources -- recognizing that the same thing may be named in different ways ("tumor" vs "neoplasm")
A fixed set of entity types
Type -- which relationships are meaningful?
A fixed set of predicates
Domain and range for each predicate -- constraints on which kinds of things can be related, so we do not assert things like "aspirin inhibits New York"
Structural validity -- a claim is valid if it is well-formed with respect to the graph's type system, independent of whether it is true or false
Provenance -- where did this claim come from?
Source traceability
Evidence aggregation
Confidence grounded in origin

A system cannot reason reliably about the world unless it represents that world with stable identities, constrained relationships, and explicit evidence.

Machine reasoning requires a data model, not just a model.

The Typed Graph¶

When we build a knowledge graph where we

fix the set of entity types and the set of predicates
establish domain and range constraints for each predicate
require that entities be assigned canonical IDs whenever possible
preserve provenance information for all relationships

we are no longer dealing with strings, but with a structured representation of the world. This is what we call a typed graph.

A typed graph does not guarantee that its conclusions are true. It guarantees something more fundamental: that its claims are well-formed, grounded in identifiable entities, and traceable to their sources.

Large classes of nonsense and hallucination are not corrected -- they are never admitted into the system at all. Category errors are rejected. Ambiguous references are resolved or made explicit. Unsupported claims are visible as such.

The result is a system whose outputs may still be wrong, but are always inspectable, reproducible, and subject to correction.

That is the minimum standard for reasoning in high-stakes domains.

Unstructured Text
       |
       v
  Extraction (LLM)
       |
       v
  Mentions (strings)
       |
       v
  Identity Resolution
  -- canonical IDs, deduplication
       |
       v
  Typed Graph
  -- entity types, predicates, domain/range, provenance
       |
       v
  Queries / Traversals
       |
       v
  Machine Reasoning
  -- multi-step, composable, inspectable

Preface¶

My brother told an LLM:

I live near a carwash and the weather is warm and sunny. I want to get my car washed. Should I walk or drive there?

and of course he was told that on a nice day like this, he could use the exercise, so he should walk to the carwash. The model didn't know he would need his car in order to get it washed. The wrong answer was delivered with the same tone and confidence as a right one. That's the problem this book is about.

Large language models are fluent, capable, and unreliable in ways that are hard to predict in advance. They fail not randomly but systematically: at the boundary of what their training covered, at questions that require grounded reasoning about specific domains, at any task where being wrong matters. The fix is not to distrust them entirely. It is to give them something reliable to reason from — a structured, inspectable, domain-specific representation of what is actually known. That is a knowledge graph.

This is a book written in the age of Large Language Models, but the central thesis is about machine reasoning in general, now and in the future. Knowledge graphs predate LLMs and will outlast them, because they capture something essential to how humans understand and reason about complex fields in an explicit, structured form that can be shared and curated. A machine cannot reason reliably about such fields without knowledge encoded in some form of graph. The software projects described here are demonstrations of this thesis, not the subject of it. LLMs are enablers for the creation of knowledge graphs, which were much discussed in the past but only practical at scale now.

This book intentionally addresses two kinds of reader. Some readers will come for the argument — the history of knowledge representation, the case for explicit structure, the implications of what becomes possible when extraction is tractable. Others will come for the engineering — the schema design, the pipeline architecture, the identity resolution, the serving layer. The book tries not to exclude either. Readers who want the argument can follow Part I and Part IV without getting lost in Part III. Readers who want the engineering will find it in Parts II and III, grounded in the argument of Part I.

A simple knowledge graph: nodes represent entities, edges represent typed relationships between them. The labels carry meaning; without them, the structure is just topology.

This book builds its argument around a concrete project: a knowledge graph for medical literature, available at https://github.com/wware/kgraph. The project is the worked example throughout the book -- when the engineering chapters describe extraction pipelines, identity servers, and graph serving, that is the code they are describing.

The kgraph repository contains two example domains. examples/medlit is the medical literature implementation: parsing PubMed Central articles, extracting entities and relationships with an LLM (large language model), and resolving them to canonical IDs from biomedical authorities (UMLS, HGNC, RxNorm). examples/sherlock is a simpler literary example -- Sherlock Holmes stories from Project Gutenberg, characters and locations extracted, no external authority lookup needed -- that shows the same framework applied to a domain without canonical ID infrastructure. The medlit example is the primary worked case throughout the book; sherlock appears where a simpler illustration is useful. Both produce the same output format: a bundle that a query layer can load and serve.

The gist of this book (beyond providing a lot of how-to information) is that a knowledge graph in some form is

a necessity, not just a convenience, for reliable machine reasoning
an explicit representation of how human experts understand difficult topics