Skip to content

Part III: Integration

Chapter 8: Identity During Extraction

The Ingestion Pipeline's View

The ingestion pipeline treats the identity server as a black box. It calls resolve(mention, entity_type) and receives a canonical or provisional ID. It stores that ID in the relationship record. It does not know or care whether the ID was resolved from an authority, created as provisional, or returned from cache.

This separation is the design goal. The ingestion pipeline is responsible for extracting structured claims from text. The identity server is responsible for deciding what those claims are about. Keeping these responsibilities separate means the pipeline can be tested against a stub identity server, and the identity server can be upgraded without changing the pipeline.

The Ingest Stage

After the LLM extraction pass produces raw mentions and relationships, the ingest stage calls the identity server to resolve each mention to a canonical ID. The ingest stage processes one paper at a time. For each mention in the paper's extraction output, it calls resolve and records the returned ID. For each relationship, it substitutes the canonical IDs for both the subject and object mentions before storing the relationship.

The ingest stage is the point where the graph's entities gain their stable identities. Before ingest, entities are mention strings. After ingest, they are canonical or provisional IDs. The graph is built from IDs, not strings.

Handling Provisional Entities

Provisional entities participate in the graph exactly like canonical entities. A relationship between two provisional entities is a valid graph edge. Queries that traverse provisional entities return results. Evidence accumulates against provisional entities and is preserved through promotion and merging.

This is an important design choice. An alternative design would defer graph construction until all entities are resolved -- but this creates a chicken-and-egg problem: resolution quality improves with more evidence, but evidence can only accumulate if entities are in the graph. By allowing provisional entities to participate immediately, the identity server enables a pipeline that processes papers in any order and resolves entities progressively as evidence accumulates.

Chapter 9: Identity During Querying

search_entities and the Identity Server

BFS-QL's search_entities tool accepts a natural-language string and returns a list of matching entity IDs. Under the hood, this is an identity server operation: embed the query string, search for nearby entity vectors in the identity server's database, return the canonical IDs of the matching entities.

The caller -- the LLM using the BFS-QL interface -- does not know that it is calling the identity server. It provides a string and receives IDs. The identity server provides the matching. This is the correct abstraction: the query layer is responsible for traversal, the identity server is responsible for resolution.

Embeddings Are an Identity Server Concern

The Postgres/pgvector backend described in BFS-QL Chapter 10 notes that "embedding model consistency between ingest and query time must be explicit metadata, not convention." The identity server resolves this requirement by owning all embeddings.

Because the identity server manages both ingest-time embedding (during entity creation and the embedding-similarity stage of the lookup chain) and query-time embedding (during search_entities), it guarantees consistency without any coordination between the ingestion pipeline and the query layer. The query layer calls search_entities with a string; the identity server embeds it with the same model it used during ingest; the cosine distances are meaningful.

The embedding model, vector dimensions, and distance metric are internal implementation details of the identity server. The ingestion pipeline does not know which embedding model is in use. The query layer does not know. Only the identity server knows, and it is consistent because it is a single service.

Cross-Graph Composition

When a BFS-QL client has connections to two graphs -- one built from research papers, one built from clinical trial records -- and both graphs anchor their entities to the same authorities, the client can traverse from one graph to the other using canonical IDs as bridges.

This traversal requires no special protocol support in BFS-QL. The client calls bfs_query on the first graph starting from a canonical ID. The response includes the canonical IDs of neighboring entities. The client calls search_entities on the second graph with those IDs. If the second graph contains entities with matching canonical IDs, the traversal crosses graphs.

The identity server is why this works. Both graphs used the same authorities. Both graphs anchored their entities to those authorities. The shared IDs are the bridges. The identity server made them available; BFS-QL traverses them.