Part II: The Typed Graph¶
Chapter 4: What a Typed Graph Is¶
Beyond the Triple¶
The foundational unit of the Semantic Web is the RDF triple: (subject, predicate, object). In its purest form, an untyped graph is a collection of these triples where any node can be a subject or object, and any string can be a predicate. While this flexibility was a design goal for the "Web of Data," it is a liability for an engineering artifact. In an untyped graph, you can assert that a drug "inhibits" a city, or that a gene "is_prescribed_for" a protein. The system has no grounds to object; it merely records the triple.
A typed graph abandons this infinite flexibility in favor of structural guarantees. It declares a finite set of entity types (e.g., DRUG, GENE, DISEASE) and a finite vocabulary of predicates. Crucially, every predicate in a typed graph carries a domain and a range: the set of entity types that may appear as its subject and object, respectively. A predicate like inhibits might have a domain of (DRUG, GENE) and a range of (GENE, BIOLOGICAL_PROCESS). Any attempt to create an edge that violates these constraints is not a "bad fact"—it is a structural failure, as meaningless as a syntax error in a compiled language.
The Ontology as Contract¶
In a typed graph, the ontology is not documentation; it is a machine-checkable contract that governs every edge. This distinction is foundational. Documentation is aspirational—it describes how the data should look. A contract is enforceable—it defines what the data is permitted to look like.
When a graph is governed by a contract, the software that interacts with it can make strong assumptions. A query optimizer knows exactly which entity types it will encounter after traversing a specific predicate. A visualization tool knows which icons to use for nodes based on their declared type. Most importantly, an ingestion pipeline can reject malformed extractions before they ever reach the database. By moving constraints from the application layer into the graph's own structure, we ensure that the graph's integrity is an architectural property rather than a convention that must be remembered by every developer.
PredicateSpec and EntityType¶
To make these constraints concrete, we represent the ontology as a Domain Spec. In the reference implementation, this is defined using Pydantic models and Python enums.
from enum import Enum
from pydantic import BaseModel, Field
from typing import Optional, FrozenSet
class EntityType(str, Enum):
DRUG = "drug"
GENE = "gene"
DISEASE = "disease"
PROCESS = "biological_process"
class PredicateSpec(BaseModel):
name: str
domain: FrozenSet[EntityType]
range: FrozenSet[EntityType]
description: str
is_functional: bool = False
negation_of: Optional[str] = None
class Config:
frozen = True
The EntityType enum defines the closed world of things that can exist. The PredicateSpec carries the rules for their interaction. The domain and range are sets, allowing a predicate to bridge multiple type pairs (e.g., a DRUG can inhibit a GENE, but a GENE can also inhibit another GENE). The is_functional flag indicates that a subject can have at most one such outgoing edge—a structural way to represent unique properties.
Where the Ontology Comes From¶
The engineer does not invent this schema from first principles. Instead, the ontology is derived from the epistemic commons. The biomedical community has already done the hard work of defining what these types and relationships are.
MeSH's category hierarchy provides the implicit entity types. RxNorm's drug-disease relationships provide the predicates. HGNC's gene-protein associations define the domain and range constraints. The typed graph schema simply makes these implicit structures explicit and computable. By deriving the ontology from the same authorities used for identity resolution, we ensure that the graph's structure is aligned with the community's own knowledge. If the National Library of Medicine says that a drug treats a disease, the treats predicate in our schema will have a domain of DRUG and a range of DISEASE.
Finite vs. Open-World¶
The typed graph is a closed-world artifact. This is the key difference from the RDF/OWL open-world assumption. In an open-world system, the absence of a statement means its truth is unknown. In a closed-world typed graph, predicates outside the schema do not exist. If upregulates is not in the domain spec, it cannot be asserted.
This limitation is the source of the graph's expressive power. By bounding the vocabulary, we make the graph's contents predictable and searchable. We move from a "bag of triples" to a structured knowledge base that can be linted, validated, and queried with mathematical precision. The typed graph does not try to represent everything; it tries to represent its specific domain perfectly.
Chapter 5: The Domain Service and the Schema¶
What the Domain Provides¶
The domain service is where domain knowledge lives. It is a small HTTP service -- four endpoints, each doing one thing -- that the identity server calls when it needs to make a domain-specific decision.
The biomedical domain service for the medlit reference implementation calls the PubChem API for chemical entities, the MeSH API for disease and biological process entities, the HGNC REST API for gene entities, and the RxNorm API for drug entities. It implements synonym detection thresholds tuned for biomedical nomenclature. It selects survivors by preferring authority-anchored records over provisional ones. It computes confidence from a study type weight table aligned with evidence-based medicine principles.
A domain service for legal entities would call different authorities -- perhaps a court document database for case citations, a legislative database for statute references -- with different synonym criteria and different confidence weights (or none at all). The domain service for a materials science corpus would consult different authorities again.
The base server does not know or care about any of this. It knows the four endpoint contracts. The domain service fulfills them.
Evidence Quality Weighting¶
In evidence-based medicine, not all evidence is equal. A randomized controlled trial is the strongest form of evidence for a clinical claim. A meta-analysis that synthesizes multiple RCTs is stronger still, but depends on the quality of the constituent trials. An observational study is weaker; a single case report is the weakest form of published evidence.
The domain service encodes this hierarchy in a weight table:
STUDY_WEIGHTS = {
"meta_analysis": 0.95,
"rct": 1.0,
"cohort": 0.8,
"case_control": 0.7,
"observational": 0.6,
"review": 0.5,
"case_report": 0.4,
}
When the identity server asks the domain service to compute confidence for a list of provenance records, the domain service looks up the study type of each record, retrieves its weight, and aggregates. The aggregation formula is configurable -- a simple maximum, a weighted mean, or a formula that rewards replication across independent studies.
The weight table is a model, not ground truth. A well-replicated observational finding across five independent cohorts may be more reliable than a single small RCT. The weights are a defensible starting point; the domain service makes them transparent and filterable rather than hiding them inside a black box.
The Schema as a Runtime Artifact¶
In traditional database design, the schema is a static artifact—a set of SQL DDL statements or a compiled Protobuf definition that remains fixed until the next deployment. In the typed graph architecture, we treat the schema as a dynamic runtime artifact served by the domain service.
When the base identity server initializes, it is
semantically empty. It understands the mechanics of resolution and the state
machine of entities, but it has no knowledge of the specific entity types or
predicates that define a domain. Its first action is to query the domain
service's GET /schema endpoint. The response is a serialized
ontology: a complete declaration of the finite set of
EntityType enums and PredicateSpec objects that govern the graph.
This late-binding of the ontology is what enables the separation of concerns
between the engine and the domain. Because the base server discovers its
constraints at runtime, it can perform predicate validation
validation}, type checking, and conflict
detection without being recompiled for every new
project. If the medlit domain service adds a new predicate—for instance,
contraindicated_in(drug, disease)—the identity server immediately inherits
the knowledge of that predicate's domain and range constraints.
By elevating the schema to a runtime artifact, we move it from being passive documentation to an active, executable specification. This same artifact seeds the graph linter (Chapter 13) and the BFS-QL compiler (Chapter 10), ensuring that every component in the stack is synchronized against a single, authoritative definition of what a well-formed claim looks like. The schema is not just a description of the data; it is the machine-readable contract that makes the data trustworthy.
Implementing the Domain Service¶
The medlit domain service is implemented in Python using FastAPI and Pydantic. FastAPI provides automatic OpenAPI documentation and request validation. Pydantic models define the request and response schemas for each endpoint.
The /resolve-authority endpoint accepts a mention string and entity type.
It dispatches to the appropriate authority API based on entity type, normalizes
the response to a canonical ID and authority name, and returns the result. On
a cache miss, it calls the external API and caches the response for the duration
of the run.
The /select-survivor endpoint accepts two entity records and returns the
preferred one. The medlit implementation prefers the record with an authority
ID; if both have authority IDs from the same authority, it prefers the one with
more supporting evidence; if evidence counts are equal, it prefers the more
recently updated record.
The /compute-confidence endpoint accepts a list of provenance records and
returns a float. The medlit implementation looks up the study type of each
record, applies the weight table, and returns a weighted mean capped at 0.99.
The /synonym-criteria endpoint returns a static configuration object defining
the similarity thresholds for fuzzy and embedding-based synonym detection.
Chapter 6: The Base Identity Server¶
Domain-Agnostic Core¶
The base identity server contains everything that is true of identity resolution regardless of domain:
The provisional/canonical/merged state machine. Every entity starts as provisional or enters directly as canonical (for provenance-derived entities like papers and authors). Provisional entities accumulate evidence and are promoted when a threshold is met. Merged entities are absorbed into a survivor and cease to exist as independent nodes.
The lookup chain. Exact match against known surface forms. Fuzzy match via edit distance. Embedding similarity via pgvector. These three stages are universal; only the thresholds and the authority consulted at each stage are domain-specific.
Idempotency. All operations must be safe to retry. Ingestion pipelines fail. Runs restart from checkpoints. The identity server must produce the same result whether a resolve call is the first or the fifteenth for a given mention.
Postgres locking. Multiple ingestion workers run concurrently. Advisory locks prevent race conditions on entity creation and merging without serializing the entire pipeline.
pgvector similarity search. Embedding vectors are stored in Postgres using the
pgvector extension. The cosine distance query ORDER BY embedding <=> $1 LIMIT k
is the implementation of the embedding similarity stage of the lookup chain.
None of this is domain-specific. A graph of legal entities uses the same state machine, the same lookup chain structure, the same idempotency requirements, the same locking strategy as a graph of biomedical entities. The base server handles all of it.
The Plugin Contract¶
The base server calls out to a domain service for four things it cannot know:
Authority lookup: Given a mention string and entity type, consult the appropriate external authority and return a canonical ID if one exists. The base server does not know which authorities exist, which APIs to call, or how to interpret their responses. The domain service knows all of this.
Synonym criteria: Given two entity records, are they close enough to be considered synonyms? The threshold for synonym detection varies by domain and entity type. A gene symbol and a gene full name that share no characters may be synonyms; two drug names that differ by one character may not be.
Survivor selection: Given two entities being merged, which record becomes the survivor? The domain may prefer the record with an authority ID over a provisional one, the record with more supporting evidence, or the more recently updated record. The domain service implements this preference.
Confidence weighting: Given a list of provenance records, compute an aggregate confidence score. The base server provides the list; the domain service provides the weights and the aggregation logic. In biomedicine, an RCT outweighs a case report; in other domains, the weights are different or absent.
These four hooks are the complete plugin contract. The domain service implements them. The base server calls them. Neither needs to know anything about the other's internal implementation.
The Docker Image¶
The base identity server ships as a Docker image. The image contains:
- The Python service implementing the five identity server operations
- Postgres client libraries and pgvector support
- An HTTP client for calling the domain service
- An LRU cache layer wrapping the domain service client
- A stub domain service that returns nulls and defaults
The stub domain service makes the image functional without any domain configuration. It resolves nothing to authorities (all entities start as provisional), accepts all candidates as non-synonyms, always selects the first entity as the survivor, and returns a confidence of 0.5 for all provenance lists. This is correct behavior for a system with no domain knowledge -- it is not an error state.
To deploy the identity server for a real domain, replace the stub with a real domain service pointed at the appropriate authorities. The identity server image does not change. The domain service is a separate container.
Caching¶
Why caching is not optional¶
The lookup chain calls the domain service for every entity mention that does not resolve at the exact match stage. In a corpus of ten thousand papers, this means tens of thousands of HTTP calls to the domain service, each of which may call an external authority API. Without caching, a large corpus run is slow, expensive in API costs, and potentially rate-limited out of completion.
The caching strategy has two levels: an LRU cache in the identity server that caches domain service HTTP responses, and a long-TTL cache in the domain service that caches external authority API responses. Together they ensure that the expensive operations -- external API calls -- happen once per unique entity, not once per mention.
LRU cache in the identity server¶
The identity server wraps its HTTP client to the domain service with an LRU
cache keyed on (mention, entity_type). A call to /resolve-authority for
"desmopressin" with type "drug" will hit the external authority API on the
first mention in the corpus and return the cached result for every subsequent
mention.
The hit rate for this cache is high in practice. Entity mentions are not uniformly distributed across a corpus -- a paper about Cushing's disease will mention ACTH, cortisol, and desmopressin many times, and these same entities will appear in many other papers about the same disease. The most-mentioned entities are exactly the ones that benefit most from caching.
Cache size is configurable. For a corpus run that processes all papers sequentially in a single worker, an LRU size of 10,000 entries is sufficient to capture the hot set of entities in most domains. For parallel workers, each worker maintains its own cache; there is no shared cache between workers, which avoids coordination overhead at the cost of some redundant API calls at the start of each worker's run.
Long-TTL cache in the domain service¶
The domain service caches external authority API responses with a long TTL -- hours or days, or for the duration of a batch run. Authority records are stable: a MeSH term's canonical ID and synonyms do not change between the start and end of a corpus ingestion run. There is no value in fetching the same authority record twice.
The domain service uses Redis for this cache. Redis provides TTL-based expiration and handles cache persistence across domain service restarts. If the domain service is restarted mid-run, the cache survives the restart and the run can continue without re-fetching all previously resolved authority records.
This is the most important cache in the system. External authority API calls are the bottleneck for resolution performance. The domain service cache eliminates them after the first call.
Co-location¶
The identity server, domain service, Postgres, and Redis run in the same docker-compose network. HTTP calls between them traverse a virtual network interface; latency is sub-millisecond. The caching strategy is designed for this topology -- it assumes that cache misses are cheap (a local network call to Redis or Postgres) and that the expensive operations (external authority APIs) are eliminated by caching.
If the identity server and domain service are deployed in separate network regions, the latency assumptions change. The caching strategy remains correct but the per-call cost of cache misses increases. Co-location is a deployment requirement for the performance characteristics described here, not just a convenience.
Chapter 7: Entity Lifecycle¶
Three Statuses¶
Every entity in the identity server has one of three statuses:
Provisional: The entity was created from an extracted mention that did not resolve to a known authority. It has a provisional ID (a UUID generated by the identity server) and participates fully in the graph -- relationships can reference it, evidence accumulates against it -- but it is flagged as unanchored.
Canonical: The entity is anchored to an external authority. It has a canonical ID derived from the authority (e.g., RxNorm:3251). It is the authoritative node for all surface forms that resolve to it.
Merged: The entity was determined to be a duplicate of another entity and absorbed into the survivor. Merged entities retain their history -- their provisional ID, their evidence records, their mention strings -- but they are no longer active graph nodes. All relationships that referenced them are transparently redirected to the survivor.
Promotion¶
A provisional entity accumulates evidence as more papers are processed. When the
evidence crosses a promotion threshold -- configurable per entity type and per
domain -- the identity server triggers a promotion attempt. The domain service's
/resolve-authority endpoint is called with the entity's most common surface
form. If a match is found, the entity is promoted to canonical and assigned the
authority ID. If no match is found, the entity remains provisional.
Promotion is not a one-time event. A provisional entity that fails to promote early in the corpus run may promote later when additional surface forms have accumulated and the fuzzy or embedding match finds a better candidate. The identity server tracks failed promotion attempts and retries them at configurable intervals.
The promotion threshold is a quality dial. A low threshold promotes entities quickly but risks premature promotion to incorrect authorities. A high threshold keeps entities provisional longer but ensures that promotion decisions are well-evidenced. The medlit reference implementation uses a threshold of three independent papers for biological process entities and five for drug entities, reflecting the relative ease of resolving each type.
Merging¶
Merging occurs when two entities are determined to be the same thing. This can happen in several ways: two provisional entities that accumulate the same authority ID through promotion, two entities whose surface forms exceed the fuzzy similarity threshold, or two entities whose embedding vectors are within the cosine distance threshold.
The merge operation calls the domain service's /select-survivor endpoint to
determine which entity becomes the survivor. All relationships that referenced
the non-survivor are updated to reference the survivor. The non-survivor's
status is set to merged, and its history -- including the fact that it was merged
and when -- is preserved in the merge log.
Merge is idempotent. If two entities have already been merged and the identity server encounters evidence that they should be merged again (because a new paper uses a surface form that triggers the same merge condition), the operation returns the existing merge result without creating a new merge record.
When the ontology changes¶
The domain spec is not a one-time artifact. The epistemic commons itself evolves -- MeSH terms are deprecated, renamed, or restructured; research communities develop new consensus on how to categorize things; new predicates become necessary as the domain matures. When the ontology changes, the graph must have a principled response. That response should follow the same rule that governs everything else in a typed graph: make the state visible, not silent.
Deprecated predicates. When a predicate is retired from the domain spec,
edges that used it do not disappear. They are flagged -- either at schema reload
time or on the next linter pass -- as carrying a deprecated predicate. The flag
is a structured attribute on the edge, not a deletion. Provenance is never
retroactively erased. The linter emits DEPRECATED_PREDICATE violations for
these edges, routing them to a review queue. Actual removal is a deliberate,
auditable operation, not an automatic consequence of the schema change.
Tightened domain or range constraints. Suppose the treats predicate
originally allowed BIOLOGICAL_PROCESS as an object, and a schema revision
restricts it to DISEASE only. Existing edges with BIOLOGICAL_PROCESS objects
are now constraint violations -- but they were valid when written. The schema
version recorded at ingest time (see below) is what allows
the linter to distinguish "was valid under the schema in force at the time of
extraction" from "is valid under the current schema." That distinction matters
for prioritizing remediation: edges that were malformed at extraction time are
errors; edges that became malformed due to a schema revision are migration items.
Predicate renaming or splitting. A predicate that is renamed or split into two more specific predicates is handled as deprecate-old plus introduce-new. Edges carrying the old predicate are flagged as deprecated. A migration script -- not the identity server -- moves them to the new predicate with a provenance note recording the transformation. The migration script is an explicit, reviewable artifact; the transformation is logged in the merge history alongside entity merges and promotions.
Versioned schemas. The domain service's GET /schema response should carry
a version field -- a semantic version or a content hash. The identity
server records which schema version was active when each edge was ingested. This
makes the migration state auditable: you can query "show me all edges ingested
under schema version 2.1 that fail validation under schema version 2.3" and get
a concrete work list.
The philosophical point follows directly from Chapter 12: the graph's response to ontology evolution is the same as its response to any other form of conflict or structural tension -- surface it, record it, and resolve it deliberately. An ontology change that silently invalidates existing edges is a hidden violation of the provenance guarantee. An ontology change that makes violations visible and auditable is just another form of graph linting.