Chapter06

Chapter 6: The Base Identity Server¶

Domain-Agnostic Core¶

The base identity server contains everything that is true of identity resolution regardless of domain:

The provisional/canonical/merged state machine. Every entity starts as provisional or enters directly as canonical (for provenance-derived entities like papers and authors). Provisional entities accumulate evidence and are promoted when a threshold is met. Merged entities are absorbed into a survivor and cease to exist as independent nodes.

The lookup chain. Exact match against known surface forms. Fuzzy match via edit distance. Embedding similarity via pgvector. These three stages are universal; only the thresholds and the authority consulted at each stage are domain-specific.

Idempotency. All operations must be safe to retry. Ingestion pipelines fail. Runs restart from checkpoints. The identity server must produce the same result whether a resolve call is the first or the fifteenth for a given mention.

Postgres locking. Multiple ingestion workers run concurrently. Advisory locks prevent race conditions on entity creation and merging without serializing the entire pipeline.

pgvector similarity search. Embedding vectors are stored in Postgres using the pgvector extension. The cosine distance query ORDER BY embedding <=> $1 LIMIT k is the implementation of the embedding similarity stage of the lookup chain.

None of this is domain-specific. A graph of legal entities uses the same state machine, the same lookup chain structure, the same idempotency requirements, the same locking strategy as a graph of biomedical entities. The base server handles all of it.

The Plugin Contract¶

The base server calls out to a domain service for four things it cannot know:

Authority lookup: Given a mention string and entity type, consult the appropriate external authority and return a canonical ID if one exists. The base server does not know which authorities exist, which APIs to call, or how to interpret their responses. The domain service knows all of this.

Synonym criteria: Given two entity records, are they close enough to be considered synonyms? The threshold for synonym detection varies by domain and entity type. A gene symbol and a gene full name that share no characters may be synonyms; two drug names that differ by one character may not be.

Survivor selection: Given two entities being merged, which record becomes the survivor? The domain may prefer the record with an authority ID over a provisional one, the record with more supporting evidence, or the more recently updated record. The domain service implements this preference.

Confidence weighting: Given a list of provenance records, compute an aggregate confidence score. The base server provides the list; the domain service provides the weights and the aggregation logic. In biomedicine, an RCT outweighs a case report; in other domains, the weights are different or absent.

These four hooks are the complete plugin contract. The domain service implements them. The base server calls them. Neither needs to know anything about the other's internal implementation.

The Docker Image¶

The base identity server ships as a Docker image. The image contains:

The Python service implementing the five identity server operations
Postgres client libraries and pgvector support
An HTTP client for calling the domain service
An LRU cache layer wrapping the domain service client
A stub domain service that returns nulls and defaults

The stub domain service makes the image functional without any domain configuration. It resolves nothing to authorities (all entities start as provisional), accepts all candidates as non-synonyms, always selects the first entity as the survivor, and returns a confidence of 0.5 for all provenance lists. This is correct behavior for a system with no domain knowledge -- it is not an error state.

To deploy the identity server for a real domain, replace the stub with a real domain service pointed at the appropriate authorities. The identity server image does not change. The domain service is a separate container.

Caching¶

Why caching is not optional¶

The lookup chain calls the domain service for every entity mention that does not resolve at the exact match stage. In a corpus of ten thousand papers, this means tens of thousands of HTTP calls to the domain service, each of which may call an external authority API. Without caching, a large corpus run is slow, expensive in API costs, and potentially rate-limited out of completion.

The caching strategy has two levels: an LRU cache in the identity server that caches domain service HTTP responses, and a long-TTL cache in the domain service that caches external authority API responses. Together they ensure that the expensive operations -- external API calls -- happen once per unique entity, not once per mention.

LRU cache in the identity server¶

The identity server wraps its HTTP client to the domain service with an LRU cache keyed on (mention, entity_type). A call to /resolve-authority for "desmopressin" with type "drug" will hit the external authority API on the first mention in the corpus and return the cached result for every subsequent mention.

The hit rate for this cache is high in practice. Entity mentions are not uniformly distributed across a corpus -- a paper about Cushing's disease will mention ACTH, cortisol, and desmopressin many times, and these same entities will appear in many other papers about the same disease. The most-mentioned entities are exactly the ones that benefit most from caching.

Cache size is configurable. For a corpus run that processes all papers sequentially in a single worker, an LRU size of 10,000 entries is sufficient to capture the hot set of entities in most domains. For parallel workers, each worker maintains its own cache; there is no shared cache between workers, which avoids coordination overhead at the cost of some redundant API calls at the start of each worker's run.

Long-TTL cache in the domain service¶

The domain service caches external authority API responses with a long TTL -- hours or days, or for the duration of a batch run. Authority records are stable: a MeSH term's canonical ID and synonyms do not change between the start and end of a corpus ingestion run. There is no value in fetching the same authority record twice.

The domain service uses Redis for this cache. Redis provides TTL-based expiration and handles cache persistence across domain service restarts. If the domain service is restarted mid-run, the cache survives the restart and the run can continue without re-fetching all previously resolved authority records.

This is the most important cache in the system. External authority API calls are the bottleneck for resolution performance. The domain service cache eliminates them after the first call.

Co-location¶

The identity server, domain service, Postgres, and Redis run in the same docker-compose network. HTTP calls between them traverse a virtual network interface; latency is sub-millisecond. The caching strategy is designed for this topology -- it assumes that cache misses are cheap (a local network call to Redis or Postgres) and that the expensive operations (external authority APIs) are eliminated by caching.

If the identity server and domain service are deployed in separate network regions, the latency assumptions change. The caching strategy remains correct but the per-call cost of cache misses increases. Co-location is a deployment requirement for the performance characteristics described here, not just a convenience.