Part I: The Problem of Identity¶

Chapter 1: What Is Canonical Identity and Why Does It Matter?¶

The Same Thing, Many Names¶

Pick any well-studied drug and search for it across a corpus of biomedical papers. You will find it referred to by its generic name, its brand names, its chemical name, its abbreviation, and occasionally a misspelling that has propagated through citations. Desmopressin appears as "desmopressin", "DDAVP", "dDAVP", "1-deamino-8-D-arginine vasopressin", "desmopressin acetate", and in older papers simply as "the synthetic vasopressin analogue." In a graph built from extracted mentions without identity resolution, these are six unconnected nodes. Every relationship involving desmopressin is split across them. Queries return partial results. Confidence aggregation is meaningless. The graph is sophisticated extraction masquerading as structured knowledge.

This is not a corner case. It is the default. Every entity in every technical domain accumulates surface form variation over time. Genes have official symbols and common names and names that were superseded when two research groups discovered the same gene independently. Diseases have clinical names, eponyms, and ICD codes. Chemicals have IUPAC names, trade names, and CAS registry numbers. The variation is not noise to be cleaned up -- it is a faithful record of how human knowledge actually develops, in parallel, across communities that do not always talk to each other.

Canonical identity resolution is the process of deciding that all these surface forms refer to the same thing and assigning them a single stable identifier. The identity server is the service that does this.

Nothing about that idea is specific to a knowledge graph as a storage shape. The same problem appears in a relational warehouse, a document database, a lakehouse table, or a folder of CSV exports: if the same real-world entity can appear under more than one string, you either resolve those strings to one stable identifier or accept broken joins, wrong aggregates, and inconsistent merges across systems. This book speaks the language of graphs because the companion volumes build and query a graph, and because multi-hop structure makes the failure modes vivid -- but canonical identity is a requirement of faithful data representation, not of any particular physical schema.

Identity Is Load-Bearing¶

A knowledge graph without canonical identity is not a degraded version of a knowledge graph with canonical identity. It is a different kind of artifact entirely -- one that cannot support multi-hop reasoning across sources, cannot aggregate evidence across papers, cannot compose with other graphs, and cannot be trusted in high-stakes applications. Identity is not a quality improvement. It is load-bearing structure.

Consider what becomes possible when every entity has a canonical ID:

Multi-hop reasoning works correctly. A query asking "what drugs have been used to treat conditions caused by the gene this mutation affects" requires traversing three relationship types. If the gene appears under two different names in two different papers, the traversal breaks at the second hop. Canonical identity closes the gap.

Evidence aggregation is meaningful. The claim "desmopressin inhibits cortisol secretion" appearing in twelve papers is stronger than the same claim appearing in one. But this aggregation is only possible if all twelve instances resolve to the same entity. Without canonical identity, you have twelve separate claims about six different nodes.

Composition across graphs is automatic. When a graph built from PubMed papers and a graph built from clinical trial data both anchor their drug entities to RxNorm, a query can traverse from a research finding to a clinical trial outcome without any special bridging logic. The shared authority is the bridge.

The Epistemic Commons¶

The authorities the identity server consults -- MeSH, RxNorm, HGNC, UniProt, ChEMBL -- are not bureaucratic naming systems. They are the accumulated judgment of expert communities about how to organize their domain of knowledge. When you anchor an entity to a MeSH term, you are not just assigning a unique key. You are connecting that entity to its place in a taxonomy built by the National Library of Medicine over decades: its definition, its hierarchical position among related concepts, its known synonyms, its cross-references to related terms in adjacent domains.

This is what it means to place a fact. An unanchored claim that "desmopressin inhibits cortisol" is a string in a database. An anchored claim that RxNorm:3251 inhibits MeSH:D003345 is a fact located in the edifice of human biomedical knowledge, connected to everything the biomedical community knows about desmopressin and cortisol, traceable to the source that made the claim, and composable with every other graph that uses the same authorities.

The epistemic commons -- the shared identifier infrastructure built by the biomedical, chemical, legal, and geographic communities -- was built for human use. The identity server makes it available to machines. That is not a small thing.

What the Identity Server Does¶

The identity server is responsible for five operations:

Resolve: Given a mention string and an entity type, return a canonical ID. This is the primary operation. It consults the lookup chain -- exact match, fuzzy match, embedding similarity -- and falls back to creating a provisional entity if no match is found.

Promote: Given a provisional entity that has accumulated sufficient evidence, upgrade it to canonical status. The promotion threshold threshold is domain-configurable.

Find synonyms: Given a canonical ID, return all known surface forms. Used for query-time synonym expansion and graph inspection.

Merge: Given two entities determined to be the same, produce one canonical record. Survivor selection is domain-configurable. Provenance from both entities is preserved.

On entity added: A hook called after any entity is added or updated. Used for downstream notifications, cache invalidation, and logging.

These five operations are the complete interface. Everything else -- the lookup chain, the caching strategy, the Postgres schema, the domain service HTTP calls -- is implementation detail in service of these five operations.

Chapter 2: The Scale of the Problem¶

Multiplicity at Corpus Scale¶

A corpus of one thousand biomedical papers contains, conservatively, tens of thousands of entity mentions. A well-studied disease like Cushing's disease will appear under its eponym, its clinical description ("hypercortisolism"), its ICD-10 code, and several abbreviated forms. A gene like POMC will appear under its official symbol, its full name ("pro-opiomelanocortin"), and older names used in papers from the 1980s and 1990s. A drug used in diagnosis like desmopressin will appear under its generic name, its brand name, its chemical name, and abbreviations.

Across a thousand papers, a single well-studied entity might generate fifty distinct surface forms. Across ten thousand papers, it might generate a hundred. The multiplicity scales with corpus size, with the breadth of time covered, and with the diversity of research communities that contributed papers.

Manual deduplication does not scale. An expert might be able to reconcile the entity mentions in a hundred papers with reasonable effort. At a thousand papers it becomes a full-time job. At ten thousand papers it is impossible. The identity server exists because the problem cannot be solved by hand at the scale where knowledge graphs become useful.

Sources of Variation¶

Surface form variation has several sources, each requiring a different resolution strategy:

Abbreviations and acronyms: ACTH for adrenocorticotropic hormone, DDAVP for desmopressin. Abbreviations are often defined at first use in a paper and then used without expansion. A system that only sees the abbreviation has no way to resolve it without consulting the paper's own definition section or an external authority.

Synonyms and alternate nomenclatures: Different research communities sometimes develop independent naming systems for the same concepts before converging on a standard. In genetics, two groups that independently discover the same gene often give it different names; official symbols are assigned later by nomenclature committees.

Misspellings and OCR artifacts: Papers from older literature, or papers processed through optical character recognition, contain systematic misspellings. These are a small fraction of mentions but they are present in every large corpus.

Evolving terminology: Medical terminology changes. What was called "Cushing's syndrome" in older literature may be distinguished from "Cushing's disease" in newer literature, where the former refers to hypercortisolism from any cause and the latter specifically to a pituitary adenoma. A system that treats these as the same entity conflates distinct clinical concepts; a system that treats them as always different misses genuine synonymy in papers that use them interchangeably.

Cross-language variants: In a corpus drawn from international literature, the same entity may appear under its English name, its name in another language, or a transliteration.

No single resolution strategy handles all of these. The lookup chain addresses this by applying strategies in sequence, from cheapest and most precise to most expensive and most approximate.

The Lookup Chain¶

The lookup chain is the identity server's resolution strategy. It applies three stages in order, stopping when a match is found:

Exact match: Compare the normalized mention string against known surface forms in the identity server's database and against the authority's own synonym list. Fast, zero false positives, handles the majority of mentions in a well-studied domain.

Fuzzy match: Apply edit-distance or token-based similarity to catch misspellings and minor variations. Requires a similarity threshold to avoid false positives; the threshold is domain-configurable.

Embedding similarity: Embed the mention string and search for nearby vectors in the entity database using pgvector. Handles semantic equivalence that lexical methods cannot -- cases where two surface forms share no characters but refer to the same concept. Most expensive; used only when the first two stages fail.

If all three stages fail, the identity server creates a provisional entity. The mention is not discarded -- it participates in the graph immediately, under a provisional ID -- but it is flagged for later resolution or promotion.

The three-stage design is a cost optimization. Most mentions in a well-studied domain will resolve at the exact match stage. Fuzzy and embedding stages are invoked only for the residue. In a large corpus run, this keeps the total cost manageable without sacrificing resolution quality on the hard cases.

Chapter 3: The Epistemic Commons¶

Authorities as Infrastructure¶

The identity server does not invent canonical identifiers. It borrows them from communities that have been building shared identity infrastructure for decades.

MeSH -- Medical Subject Headings -- is maintained by the National Library of Medicine and covers diseases, drugs, biological processes, and anatomical structures. It has been the standard vocabulary for biomedical literature indexing since 1963. Its hierarchical structure encodes relationships among concepts that would otherwise have to be extracted from text.

HGNC -- HUGO Gene Nomenclature Committee -- maintains official symbols and names for human genes. When a paper from 1987 uses a gene name that was superseded in 1995, HGNC records both names and the relationship between them. The identity server can resolve the old name to the current symbol without any domain-specific logic.

RxNorm, maintained by the National Library of Medicine, provides normalized names for clinical drugs. UniProt maintains the authoritative database for protein sequences and functional information. ChEMBL covers bioactive molecules.

NCBI Taxonomy¶

The Linnaean hierarchy -- kingdom, phylum, class, order, family, genus, species -- is the picture most people carry from school biology. When a knowledge graph needs organisms (strains, species, higher taxa) to sit in a stable tree, not just diseases and drugs, a separate class of authority applies. NCBI Taxonomy, maintained by the National Center for Biotechnology Information, is the taxonomy that backs GenBank, RefSeq, BLAST, and the organism lines in UniProt and related resources. In practice it is the shared hierarchy most biomedical pipelines assume when they say "this sequence is from Homo sapiens" or "this clade." It is not the same thing as MeSH: it encodes clinical and literature concepts (including some organism terms for indexing); NCBI Taxonomy encodes taxonomic parent/child relationships for naming and classifying life for sequence and database work. Other curated name lists exist for specialized domains (marine taxa, fungi, viruses under ICTV rules, and so on); a production domain service may consult more than one. This book treats NCBI Taxonomy as the canonical placeholder for "the official online organism tree" in a biomedical stack -- with the understanding that a fuller treatment would spell out API usage, version stability, and when to fall back to embedding-based resolution for organisms without a clean database hit.

These authorities share a common property: they were built to solve the same problem the identity server solves, at the level of a single domain, by a community of experts who needed shared identity to communicate. The identity server aggregates them. It is a client of the epistemic commons, not a replacement for it.

What You Inherit When You Anchor¶

Anchoring an entity to an authority identifier does more than assign a unique key. It connects the entity to the authority's full record for that identifier: its definition, its synonyms, its taxonomic position, its cross-references to related identifiers in adjacent authorities.

A disease entity anchored to MeSH:D003480 (Cushing Syndrome) inherits the MeSH tree's knowledge that Cushing Syndrome is a subtype of Adrenal Cortex Diseases, which is a subtype of Endocrine System Diseases, which is a subtype of Pathological Conditions, Anatomical. It inherits the MeSH-recorded synonyms: "Hypercortisolism", "Adrenal Cortex Hyperfunction". It inherits the cross-references to ICD-10-CM codes.

None of this has to be extracted from the corpus. It is already encoded in the authority. Anchoring is the operation that makes it available to the graph.

Cross-Domain Composition¶

The consequence of anchoring to shared authorities extends beyond a single graph. When two graphs -- one built from research papers, one built from clinical trial records -- both anchor their disease entities to MeSH and their drug entities to RxNorm, a query can traverse from a research finding to a clinical trial outcome. The shared identifiers are the bridges.

This is not a feature of BFS-QL, or of any query protocol. It is a consequence of the decision to anchor to shared authorities. The identity server makes that decision systematic and enforced rather than optional and inconsistent.

The practical implication for graph builders: every entity that could be anchored to an authority should be. Provisional entities that remain unanchored are islands -- they participate in their local graph but cannot bridge to other graphs. The authority lookup stage of the lookup chain is not an optimization. It is the operation that connects the graph to the epistemic commons.