Skip to content

Chapter11

Chapter 11: The Identity Server

The identity server is the component responsible for entity identity across the knowledge graph: resolving a mention to a canonical ID, tracking provisional entities until they can be confirmed, detecting synonyms, and merging duplicates. Its full architecture -- domain plugin contract, authority lookup chain, synonym cache, promotion policy, Docker deployment -- is covered in the companion volume The Identity Server: Canonical Identity for Knowledge Graphs. This chapter covers only what the ingestion pipeline needs to know about calling it.

Identity Is Load-Bearing

Canonical entities with canonical IDs are the design decision that most separates a useful knowledge graph from a sophisticated extraction exercise. Without identity resolution, you have a collection of mentions that look like a graph but don't support cross-document reasoning. With it, "Drug A treats Disease B" means the same thing whether it came from a 2010 review article or a 2024 clinical trial, because both assertions resolve to the same nodes.

The Pipeline's View

The ingestion pipeline calls the identity server as a black box. After the LLM extraction pass produces raw mentions, the ingest stage calls resolve(mention, entity_type) for each one and receives back a stable ID -- canonical if an authority matched, provisional otherwise. Provisional IDs are valid graph nodes: relationships referencing them are valid edges and evidence accumulates against them through any later promotion or merge. The pipeline does not handle provisional entities specially.

Papers, authors, and citations from document metadata enter with their canonical ID already known (a PMC ID, an ORCID) and bypass the lookup chain entirely. The citation graph this produces -- CITES(Paper, Paper) edges derived directly from reference lists, confidence 1.0 -- is a built-in corpus expansion mechanism: frequently cited papers not yet ingested become natural candidates for the next ingest run.