Chapter01

Chapter 1: What Is Canonical Identity and Why Does It Matter?¶

The Same Thing, Many Names¶

Pick any well-studied drug and search for it across a corpus of biomedical papers. You will find it referred to by its generic name, its brand names, its chemical name, its abbreviation, and occasionally a misspelling that has propagated through citations. Desmopressin appears as "desmopressin", "DDAVP", "dDAVP", "1-deamino-8-D-arginine vasopressin", "desmopressin acetate", and in older papers simply as "the synthetic vasopressin analogue." In a graph built from extracted mentions without identity resolution, these are six unconnected nodes. Every relationship involving desmopressin is split across them. Queries return partial results. Confidence aggregation is meaningless. The graph is sophisticated extraction masquerading as structured knowledge.

This is not a corner case. It is the default. Every entity in every technical domain accumulates surface form variation over time. Genes have official symbols and common names and names that were superseded when two research groups discovered the same gene independently. Diseases have clinical names, eponyms, and ICD codes. Chemicals have IUPAC names, trade names, and CAS registry numbers. The variation is not noise to be cleaned up -- it is a faithful record of how human knowledge actually develops, in parallel, across communities that do not always talk to each other.

Canonical identity resolution is the process of deciding that all these surface forms refer to the same thing and assigning them a single stable identifier. The identity server is the service that does this.

Nothing about that idea is specific to a knowledge graph as a storage shape. The same problem appears in a relational warehouse, a document database, a lakehouse table, or a folder of CSV exports: if the same real-world entity can appear under more than one string, you either resolve those strings to one stable identifier or accept broken joins, wrong aggregates, and inconsistent merges across systems. This book speaks the language of graphs because the companion volumes build and query a graph, and because multi-hop structure makes the failure modes vivid -- but canonical identity is a requirement of faithful data representation, not of any particular physical schema.

Identity Is Load-Bearing¶

A knowledge graph without canonical identity is not a degraded version of a knowledge graph with canonical identity. It is a different kind of artifact entirely -- one that cannot support multi-hop reasoning across sources, cannot aggregate evidence across papers, cannot compose with other graphs, and cannot be trusted in high-stakes applications. Identity is not a quality improvement. It is load-bearing structure.

Consider what becomes possible when every entity has a canonical ID:

Multi-hop reasoning works correctly. A query asking "what drugs have been used to treat conditions caused by the gene this mutation affects" requires traversing three relationship types. If the gene appears under two different names in two different papers, the traversal breaks at the second hop. Canonical identity closes the gap.

Evidence aggregation is meaningful. The claim "desmopressin inhibits cortisol secretion" appearing in twelve papers is stronger than the same claim appearing in one. But this aggregation is only possible if all twelve instances resolve to the same entity. Without canonical identity, you have twelve separate claims about six different nodes.

Composition across graphs is automatic. When a graph built from PubMed papers and a graph built from clinical trial data both anchor their drug entities to RxNorm, a query can traverse from a research finding to a clinical trial outcome without any special bridging logic. The shared authority is the bridge.

The Epistemic Commons¶

The authorities the identity server consults -- MeSH, RxNorm, HGNC, UniProt, ChEMBL -- are not bureaucratic naming systems. They are the accumulated judgment of expert communities about how to organize their domain of knowledge. When you anchor an entity to a MeSH term, you are not just assigning a unique key. You are connecting that entity to its place in a taxonomy built by the National Library of Medicine over decades: its definition, its hierarchical position among related concepts, its known synonyms, its cross-references to related terms in adjacent domains.

This is what it means to place a fact. An unanchored claim that "desmopressin inhibits cortisol" is a string in a database. An anchored claim that RxNorm:3251 inhibits MeSH:D003345 is a fact located in the edifice of human biomedical knowledge, connected to everything the biomedical community knows about desmopressin and cortisol, traceable to the source that made the claim, and composable with every other graph that uses the same authorities.

The epistemic commons -- the shared identifier infrastructure built by the biomedical, chemical, legal, and geographic communities -- was built for human use. The identity server makes it available to machines. That is not a small thing.

What the Identity Server Does¶

The identity server is responsible for five operations:

Resolve: Given a mention string and an entity type, return a canonical ID. This is the primary operation. It consults the lookup chain -- exact match, fuzzy match, embedding similarity -- and falls back to creating a provisional entity if no match is found.

Promote: Given a provisional entity that has accumulated sufficient evidence, upgrade it to canonical status. The promotion threshold threshold is domain-configurable.

Find synonyms: Given a canonical ID, return all known surface forms. Used for query-time synonym expansion and graph inspection.

Merge: Given two entities determined to be the same, produce one canonical record. Survivor selection is domain-configurable. Provenance from both entities is preserved.

On entity added: A hook called after any entity is added or updated. Used for downstream notifications, cache invalidation, and logging.

These five operations are the complete interface. Everything else -- the lookup chain, the caching strategy, the Postgres schema, the domain service HTTP calls -- is implementation detail in service of these five operations.