Skip to content

Chapter04

Chapter 4: What a Typed Graph Is

Beyond the Triple

The foundational unit of the Semantic Web is the RDF triple: (subject, predicate, object). In its purest form, an untyped graph is a collection of these triples where any node can be a subject or object, and any string can be a predicate. While this flexibility was a design goal for the "Web of Data," it is a liability for an engineering artifact. In an untyped graph, you can assert that a drug "inhibits" a city, or that a gene "is_prescribed_for" a protein. The system has no grounds to object; it merely records the triple.

A typed graph abandons this infinite flexibility in favor of structural guarantees. It declares a finite set of entity types (e.g., DRUG, GENE, DISEASE) and a finite vocabulary of predicates. Crucially, every predicate in a typed graph carries a domain and a range: the set of entity types that may appear as its subject and object, respectively. A predicate like inhibits might have a domain of (DRUG, GENE) and a range of (GENE, BIOLOGICAL_PROCESS). Any attempt to create an edge that violates these constraints is not a "bad fact"—it is a structural failure, as meaningless as a syntax error in a compiled language.

The Ontology as Contract

In a typed graph, the ontology is not documentation; it is a machine-checkable contract that governs every edge. This distinction is foundational. Documentation is aspirational—it describes how the data should look. A contract is enforceable—it defines what the data is permitted to look like.

When a graph is governed by a contract, the software that interacts with it can make strong assumptions. A query optimizer knows exactly which entity types it will encounter after traversing a specific predicate. A visualization tool knows which icons to use for nodes based on their declared type. Most importantly, an ingestion pipeline can reject malformed extractions before they ever reach the database. By moving constraints from the application layer into the graph's own structure, we ensure that the graph's integrity is an architectural property rather than a convention that must be remembered by every developer.

PredicateSpec and EntityType

To make these constraints concrete, we represent the ontology as a Domain Spec. In the reference implementation, this is defined using Pydantic models and Python enums.

from enum import Enum
from pydantic import BaseModel, Field
from typing import Optional, FrozenSet

class EntityType(str, Enum):
    DRUG = "drug"
    GENE = "gene"
    DISEASE = "disease"
    PROCESS = "biological_process"

class PredicateSpec(BaseModel):
    name: str
    domain: FrozenSet[EntityType]
    range: FrozenSet[EntityType]
    description: str
    is_functional: bool = False
    negation_of: Optional[str] = None

    class Config:
        frozen = True

The EntityType enum defines the closed world of things that can exist. The PredicateSpec carries the rules for their interaction. The domain and range are sets, allowing a predicate to bridge multiple type pairs (e.g., a DRUG can inhibit a GENE, but a GENE can also inhibit another GENE). The is_functional flag indicates that a subject can have at most one such outgoing edge—a structural way to represent unique properties.

Where the Ontology Comes From

The engineer does not invent this schema from first principles. Instead, the ontology is derived from the epistemic commons. The biomedical community has already done the hard work of defining what these types and relationships are.

MeSH's category hierarchy provides the implicit entity types. RxNorm's drug-disease relationships provide the predicates. HGNC's gene-protein associations define the domain and range constraints. The typed graph schema simply makes these implicit structures explicit and computable. By deriving the ontology from the same authorities used for identity resolution, we ensure that the graph's structure is aligned with the community's own knowledge. If the National Library of Medicine says that a drug treats a disease, the treats predicate in our schema will have a domain of DRUG and a range of DISEASE.

Finite vs. Open-World

The typed graph is a closed-world artifact. This is the key difference from the RDF/OWL open-world assumption. In an open-world system, the absence of a statement means its truth is unknown. In a closed-world typed graph, predicates outside the schema do not exist. If upregulates is not in the domain spec, it cannot be asserted.

This limitation is the source of the graph's expressive power. By bounding the vocabulary, we make the graph's contents predictable and searchable. We move from a "bag of triples" to a structured knowledge base that can be linted, validated, and queried with mathematical precision. The typed graph does not try to represent everything; it tries to represent its specific domain perfectly.