Skip to content

Chapter07

Chapter 7: Entity Lifecycle

Three Statuses

Every entity in the identity server has one of three statuses:

Provisional: The entity was created from an extracted mention that did not resolve to a known authority. It has a provisional ID (a UUID generated by the identity server) and participates fully in the graph -- relationships can reference it, evidence accumulates against it -- but it is flagged as unanchored.

Canonical: The entity is anchored to an external authority. It has a canonical ID derived from the authority (e.g., RxNorm:3251). It is the authoritative node for all surface forms that resolve to it.

Merged: The entity was determined to be a duplicate of another entity and absorbed into the survivor. Merged entities retain their history -- their provisional ID, their evidence records, their mention strings -- but they are no longer active graph nodes. All relationships that referenced them are transparently redirected to the survivor.

Promotion

A provisional entity accumulates evidence as more papers are processed. When the evidence crosses a promotion threshold -- configurable per entity type and per domain -- the identity server triggers a promotion attempt. The domain service's /resolve-authority endpoint is called with the entity's most common surface form. If a match is found, the entity is promoted to canonical and assigned the authority ID. If no match is found, the entity remains provisional.

Promotion is not a one-time event. A provisional entity that fails to promote early in the corpus run may promote later when additional surface forms have accumulated and the fuzzy or embedding match finds a better candidate. The identity server tracks failed promotion attempts and retries them at configurable intervals.

The promotion threshold is a quality dial. A low threshold promotes entities quickly but risks premature promotion to incorrect authorities. A high threshold keeps entities provisional longer but ensures that promotion decisions are well-evidenced. The medlit reference implementation uses a threshold of three independent papers for biological process entities and five for drug entities, reflecting the relative ease of resolving each type.

Merging

Merging occurs when two entities are determined to be the same thing. This can happen in several ways: two provisional entities that accumulate the same authority ID through promotion, two entities whose surface forms exceed the fuzzy similarity threshold, or two entities whose embedding vectors are within the cosine distance threshold.

The merge operation calls the domain service's /select-survivor endpoint to determine which entity becomes the survivor. All relationships that referenced the non-survivor are updated to reference the survivor. The non-survivor's status is set to merged, and its history -- including the fact that it was merged and when -- is preserved in the merge log.

Merge is idempotent. If two entities have already been merged and the identity server encounters evidence that they should be merged again (because a new paper uses a surface form that triggers the same merge condition), the operation returns the existing merge result without creating a new merge record.

When the ontology changes

The domain spec is not a one-time artifact. The epistemic commons itself evolves -- MeSH terms are deprecated, renamed, or restructured; research communities develop new consensus on how to categorize things; new predicates become necessary as the domain matures. When the ontology changes, the graph must have a principled response. That response should follow the same rule that governs everything else in a typed graph: make the state visible, not silent.

Deprecated predicates. When a predicate is retired from the domain spec, edges that used it do not disappear. They are flagged -- either at schema reload time or on the next linter pass -- as carrying a deprecated predicate. The flag is a structured attribute on the edge, not a deletion. Provenance is never retroactively erased. The linter emits DEPRECATED_PREDICATE violations for these edges, routing them to a review queue. Actual removal is a deliberate, auditable operation, not an automatic consequence of the schema change.

Tightened domain or range constraints. Suppose the treats predicate originally allowed BIOLOGICAL_PROCESS as an object, and a schema revision restricts it to DISEASE only. Existing edges with BIOLOGICAL_PROCESS objects are now constraint violations -- but they were valid when written. The schema version recorded at ingest time (see below) is what allows the linter to distinguish "was valid under the schema in force at the time of extraction" from "is valid under the current schema." That distinction matters for prioritizing remediation: edges that were malformed at extraction time are errors; edges that became malformed due to a schema revision are migration items.

Predicate renaming or splitting. A predicate that is renamed or split into two more specific predicates is handled as deprecate-old plus introduce-new. Edges carrying the old predicate are flagged as deprecated. A migration script -- not the identity server -- moves them to the new predicate with a provenance note recording the transformation. The migration script is an explicit, reviewable artifact; the transformation is logged in the merge history alongside entity merges and promotions.

Versioned schemas. The domain service's GET /schema response should carry a version field -- a semantic version or a content hash. The identity server records which schema version was active when each edge was ingested. This makes the migration state auditable: you can query "show me all edges ingested under schema version 2.1 that fail validation under schema version 2.3" and get a concrete work list.

The philosophical point follows directly from Chapter 12: the graph's response to ontology evolution is the same as its response to any other form of conflict or structural tension -- surface it, record it, and resolve it deliberately. An ontology change that silently invalidates existing edges is a hidden violation of the provenance guarantee. An ontology change that makes violations visible and auditable is just another form of graph linting.