Skip to content

Chapter08

Chapter 8: Identity During Extraction

The Ingestion Pipeline's View

The ingestion pipeline treats the identity server as a black box. It calls resolve(mention, entity_type) and receives a canonical or provisional ID. It stores that ID in the relationship record. It does not know or care whether the ID was resolved from an authority, created as provisional, or returned from cache.

This separation is the design goal. The ingestion pipeline is responsible for extracting structured claims from text. The identity server is responsible for deciding what those claims are about. Keeping these responsibilities separate means the pipeline can be tested against a stub identity server, and the identity server can be upgraded without changing the pipeline.

The Ingest Stage

After the LLM extraction pass produces raw mentions and relationships, the ingest stage calls the identity server to resolve each mention to a canonical ID. The ingest stage processes one paper at a time. For each mention in the paper's extraction output, it calls resolve and records the returned ID. For each relationship, it substitutes the canonical IDs for both the subject and object mentions before storing the relationship.

The ingest stage is the point where the graph's entities gain their stable identities. Before ingest, entities are mention strings. After ingest, they are canonical or provisional IDs. The graph is built from IDs, not strings.

Handling Provisional Entities

Provisional entities participate in the graph exactly like canonical entities. A relationship between two provisional entities is a valid graph edge. Queries that traverse provisional entities return results. Evidence accumulates against provisional entities and is preserved through promotion and merging.

This is an important design choice. An alternative design would defer graph construction until all entities are resolved -- but this creates a chicken-and-egg problem: resolution quality improves with more evidence, but evidence can only accumulate if entities are in the graph. By allowing provisional entities to participate immediately, the identity server enables a pipeline that processes papers in any order and resolves entities progressively as evidence accumulates.