Chapter02

Chapter 2: The Scale of the Problem¶

Multiplicity at Corpus Scale¶

A corpus of one thousand biomedical papers contains, conservatively, tens of thousands of entity mentions. A well-studied disease like Cushing's disease will appear under its eponym, its clinical description ("hypercortisolism"), its ICD-10 code, and several abbreviated forms. A gene like POMC will appear under its official symbol, its full name ("pro-opiomelanocortin"), and older names used in papers from the 1980s and 1990s. A drug used in diagnosis like desmopressin will appear under its generic name, its brand name, its chemical name, and abbreviations.

Across a thousand papers, a single well-studied entity might generate fifty distinct surface forms. Across ten thousand papers, it might generate a hundred. The multiplicity scales with corpus size, with the breadth of time covered, and with the diversity of research communities that contributed papers.

Manual deduplication does not scale. An expert might be able to reconcile the entity mentions in a hundred papers with reasonable effort. At a thousand papers it becomes a full-time job. At ten thousand papers it is impossible. The identity server exists because the problem cannot be solved by hand at the scale where knowledge graphs become useful.

Sources of Variation¶

Surface form variation has several sources, each requiring a different resolution strategy:

Abbreviations and acronyms: ACTH for adrenocorticotropic hormone, DDAVP for desmopressin. Abbreviations are often defined at first use in a paper and then used without expansion. A system that only sees the abbreviation has no way to resolve it without consulting the paper's own definition section or an external authority.

Synonyms and alternate nomenclatures: Different research communities sometimes develop independent naming systems for the same concepts before converging on a standard. In genetics, two groups that independently discover the same gene often give it different names; official symbols are assigned later by nomenclature committees.

Misspellings and OCR artifacts: Papers from older literature, or papers processed through optical character recognition, contain systematic misspellings. These are a small fraction of mentions but they are present in every large corpus.

Evolving terminology: Medical terminology changes. What was called "Cushing's syndrome" in older literature may be distinguished from "Cushing's disease" in newer literature, where the former refers to hypercortisolism from any cause and the latter specifically to a pituitary adenoma. A system that treats these as the same entity conflates distinct clinical concepts; a system that treats them as always different misses genuine synonymy in papers that use them interchangeably.

Cross-language variants: In a corpus drawn from international literature, the same entity may appear under its English name, its name in another language, or a transliteration.

No single resolution strategy handles all of these. The lookup chain addresses this by applying strategies in sequence, from cheapest and most precise to most expensive and most approximate.

The Lookup Chain¶

The lookup chain is the identity server's resolution strategy. It applies three stages in order, stopping when a match is found:

Exact match: Compare the normalized mention string against known surface forms in the identity server's database and against the authority's own synonym list. Fast, zero false positives, handles the majority of mentions in a well-studied domain.

Fuzzy match: Apply edit-distance or token-based similarity to catch misspellings and minor variations. Requires a similarity threshold to avoid false positives; the threshold is domain-configurable.

Embedding similarity: Embed the mention string and search for nearby vectors in the entity database using pgvector. Handles semantic equivalence that lexical methods cannot -- cases where two surface forms share no characters but refer to the same concept. Most expensive; used only when the first two stages fail.

If all three stages fail, the identity server creates a provisional entity. The mention is not discarded -- it participates in the graph immediately, under a provisional ID -- but it is flagged for later resolution or promotion.

The three-stage design is a cost optimization. Most mentions in a well-studied domain will resolve at the exact match stage. Fuzzy and embedding stages are invoked only for the residue. In a large corpus run, this keeps the total cost manageable without sacrificing resolution quality on the hard cases.