Part IV: Trustworthiness¶
Chapter 10: Provenance as Architecture¶
Provenance Is Not Optional¶
In high-stakes domains -- medicine, law, materials safety -- every claim in the knowledge graph must be traceable to its source. This is not a feature. It is a constraint. A graph that cannot answer "where did this claim come from?" is not suitable for use in these domains regardless of how sophisticated its extraction pipeline is.
Provenance must be architectural, not retrofitted. Adding provenance to an existing graph requires touching every relationship record. The data model for provenance affects the schema, the extraction output format, the ingest stage, the confidence aggregation logic, and the query interface. Getting it right at the start costs little. Getting it wrong costs a full re-extraction.
What Provenance Records¶
A complete provenance record contains:
- paper_id: Which paper made this claim
- section_type: Where in the paper (abstract, introduction, methods, results, discussion, conclusion)
- paragraph_idx: Exact paragraph within the section
- extraction_method: How the claim was extracted (LLM model and version, prompt version)
- confidence: Confidence in this specific piece of evidence
- study_type: The study design (RCT, meta-analysis, cohort, case report, etc.)
The section type is meaningful for evidence quality: a claim stated in the results section carries more weight than the same claim in the discussion, where it may be speculative. The paragraph index enables a reader to find the exact sentence in the paper that supports the claim -- essential for human verification.
Multi-Source Claims¶
A claim that appears in multiple papers is stronger than a claim that appears in one. The identity server aggregates evidence across sources as part of its normal operation: when the same relationship is extracted from multiple papers and both subject and object entities resolve to the same canonical IDs, the identity server records a multi-source claim with a composite confidence.
The composite confidence is computed by the domain service. In the medlit reference implementation, it is a weighted mean of the individual confidence scores, where weights are determined by study type. A claim supported by two RCTs and one cohort study has a composite confidence higher than the same claim supported by two case reports.
Replication is a signal of robustness, not a guarantee of correctness. The identity server records replication faithfully; the interpretation of that replication is a human judgment informed by the provenance records.
Chapter 11: Making Bad Ideas Inexpressible¶
Hilbert's Dream¶
At the turn of the 20th century, David Hilbert proposed a grand program for mathematics: to find a formal system in which every true statement could be proved and, crucially, no false or meaningless statement could even be constructed. He wanted a system where bad mathematics was inexpressible, not merely discouraged. Kurt Gödel famously proved that this was impossible for mathematics as a whole.
However, for a domain-constrained knowledge graph, we can actually achieve a version of Hilbert's dream. We are not trying to represent all of human thought; we are trying to represent a finite set of biomedical or legal claims. By using a typed schema, we can build a boundary that structurally refuses to hold certain classes of "bad ideas."
What Becomes Inexpressible¶
The typed schema operates at four distinct layers to make unwarranted or malformed assertions inexpressible:
The Type Layer: Edges where the subject or object violates the predicate's domain or range are structurally rejected. You cannot state that "Aspirin (DRUG) inhibits New York (LOCATION)" if the inhibits predicate only accepts GENE or PROCESS as its object. Predicates outside the finite vocabulary are simply not available for use.
The Identity Layer: Assertions involving unresolvable or "provisional" entities that fail to meet minimum evidence thresholds are sequestered. A claim about a thing that cannot be named and anchored to an authority is an incomplete idea; the system can choose to refuse these until identity is established.
The Provenance Layer: The schema can require that every edge carries a pointer to its warrant. An assertion without a source, or a claim whose extraction method is undeclared, is not a fact in the system's eyes—it is a malformed record. The "undifferentiated provenance bag" common in untyped systems is structurally replaced by mandatory, typed provenance records.
The Consistency Layer: Contradictory assertions—such as a functional predicate having two different values, or both a predicate and its negation_of pair coexisting—are not permitted to "just sit there." They must either be resolved or wrapped in a conflict record that acknowledges the dispute. The "bad idea" of unacknowledged contradiction becomes inexpressible.
The Functional Programming Analogy¶
This approach mirrors the "make illegal states unrepresentable" mantra of functional programming in languages like Haskell or Rust. In those languages, you don't write runtime checks for null values if your type system can guarantee a value is always present. You move the invariant into the type system.
A typed knowledge graph applies this principle to assertions. We move the "rules of evidence" into the graph's structure. If the schema compiles and the linter passes, we know that certain classes of errors—type mismatches, missing sources, unanchored identities—simply cannot exist in the data.
What This Requires of the Ontology¶
This level of enforcement is only as good as the ontology it enforces. To make bad ideas inexpressible, the ontology must be rich:
- Functional predicates must be declared (e.g., has_official_symbol).
- Negation pairs must be identified (e.g., activates vs. inhibits).
- Provenance completeness rules must be defined (e.g., "every extraction must have a confidence score").
The Domain Spec becomes the constitution of the graph. If it is weak, the graph is noisy. If it is rigorous, the graph becomes a high-fidelity instrument for reasoning.
The Limits: Gödel's Revenge¶
We must be honest about the boundary: the typed graph enforces structural well-formedness, not semantic correctness. A well-typed, well-sourced edge can still be factually wrong. An LLM might extract "Aspirin treats Cancer" and correctly link it to RxNorm and MeSH. The linter will pass it because the types match and the provenance is present.
This is not a defect; it is a feature. It separates the structural integrity of the knowledge base from the truth value of the claims it contains. We can guarantee the former; for the latter, we provide the provenance so a human (or a more sophisticated agent) can decide.
Chapter 12: The Graph Linter¶
Linting as Explicit Epistemics¶
In the Unix philosophy, tools are small, focused, and composable. The ingestion pipeline can enforce schema constraints at insertion time, but there is a complementary tool worth building: a standalone linter that audits the graph independently of the insertion path. Call it kglint.
The idea is straightforward. A code linter checks source files for structural violations that the compiler might not catch -- style problems, unused variables, suspicious patterns. A graph linter does the same for knowledge claims. It doesn't just check for broken data; it checks for broken epistemics. It asks, of every edge in the graph: is this claim well-formed? Is it sourced? Does it contradict something else without acknowledgment?
This separation -- enforcement at insertion, auditing after the fact -- reflects Unix philosophy. The insertion path needs to be fast. A linter can be thorough. Running them as distinct tools makes both easier to reason about and test.
What a Graph Linter Checks¶
The key design insight is that a graph linter should not have hardcoded rules. Its rule set should be derived entirely from the Domain Spec at runtime. Every predicate in the spec has a domain, a range, a provenance requirement, and optionally a functional flag or a negation pair. The linter reads the spec and generates its checks from those declarations.
The checks fall into the same four layers described in Chapter 11:
- Vocabulary violations: predicates in the graph not defined in the schema
- Domain/range violations: edges where subject or object entity type violates the predicate's declared constraints
- Provenance gaps: edges without a valid provenance record, or with provenance that does not declare an extraction method
- Unacknowledged contradictions: functional predicates with multiple values for the same subject, or edges whose predicates are declared negation pairs and both exist between the same entity pair
Adding a new predicate to the Domain Spec automatically adds lint coverage for it. The linter requires no maintenance as the schema evolves.
Violation Structure¶
A useful linter emits structured output, not free text. Each violation should be a typed record that downstream tools can handle programmatically:
{
"violation_type": "DOMAIN_RANGE_MISMATCH",
"severity": "ERROR",
"edge_id": "edge_789",
"subject_type": "DRUG",
"predicate": "inhibits",
"object_type": "LOCATION",
"message": "Predicate 'inhibits' cannot have object type 'LOCATION'. Expected 'GENE' or 'PROCESS'.",
"remediation": "Check entity resolution for object 'New York'."
}
JSONL output makes the linter composable: pipe it into a dashboard, a review queue, a CI step, or a script that filters by severity.
Conflict Records as First-Class Data¶
One worthwhile design choice for a graph linter: when it detects that two papers disagree -- one says a drug activates a gene and another says it inhibits it -- it should not simply report the contradiction as an error and stop. Instead, it should emit a Conflict Record: a structured record naming both edges, the conflict type, and the resolution status.
The graph is richer for containing the dispute rather than suppressing it. In a typed graph, contradiction is information, not failure. Unresolved scientific disagreements are real and worth representing. A linter that turns unacknowledged contradictions into first-class records allows the graph to represent the messiness of scientific discourse without sacrificing structural rigor.
A Note on What Exists¶
None of this requires the graph linter to exist before the graph is useful. The typed schema and identity server provide meaningful guarantees at insertion time without any separate linter. But as a corpus grows and multiple ingestion runs accumulate, the value of an independent audit pass increases. A linter built along these lines -- schema-driven, structured output, conflict records as data -- would be a natural next tool to build once the core pipeline is running.
Chapter 13: Bias, Limits, and Responsibility¶
What the Graph Cannot Know¶
A knowledge graph built from a corpus knows only what that corpus contains. PubMed indexes a large fraction of biomedical literature, but not all of it. Papers published only in languages other than English are underrepresented. Research from institutions in lower-income countries is underrepresented. Research that was conducted but never published -- because the results were negative, because the funding ran out, because the research group disbanded -- is absent entirely.
The identity server cannot correct for these absences. It can only process what it is given. A graph that achieves high internal consistency through careful identity resolution is not a complete picture of a domain; it is a consistent picture of what the corpus contains.
This is a limitation, not a failure. Every knowledge system has coverage boundaries. The important thing is that the boundaries are known and communicated to users of the graph, not obscured by the system's apparent sophistication.
Bias Encoded at Scale¶
Source biases propagate into the graph and are amplified by confidence weighting. If the corpus contains more RCTs on a particular drug than on comparable drugs -- because the manufacturer funded more research -- the drug's claims will have higher confidence scores than the claims of unfunded comparators. This is not a bug in the confidence weighting formula. It is a faithful representation of what the evidence shows. But it may mislead users who do not understand the relationship between publication patterns and confidence scores.
The identity server cannot eliminate this bias. It can make it visible: by recording the study type of every source, by exposing the provenance of every confidence score, by providing query interfaces that let users examine the evidence distribution behind any claim. Transparency about bias is not the same as absence of bias, but it is a necessary condition for informed use.
The Builder's Responsibility¶
Building a knowledge graph that is used in high-stakes decisions carries responsibilities that do not end at deployment. The graph's coverage boundaries should be documented. The confidence weighting methodology should be transparent and auditable. The provenance records should be sufficient for a user to verify any claim independently.
The identity server's architecture supports these responsibilities: every merge is logged, every promotion is logged, every confidence computation is reproducible from the provenance records. The infrastructure for verification is built in. Using it is a commitment that extends beyond the code.
Credit, Priority, and Provenance¶
When a machine surfaces a connection -- a drug-disease relationship that no single paper states but that the graph implies from combining multiple sources -- who gets credit? The authors of the papers that contributed the underlying facts? The builders of the graph? The user who ran the query? The question matters for scientific priority, intellectual property, and the sociology of research. Scientists are rewarded for discovery. If the discovery is made by a system, the reward structure gets complicated.
Provenance tracking, which this book has treated as a technical concern throughout, turns out to have significant ethical implications. How you record where a fact came from determines who can be credited. A relationship with full provenance -- source document, passage, extraction method -- makes it possible to trace the contribution back to the original authors. A relationship stored without provenance makes that impossible. The technical decision about schema design is also a decision about how credit will flow. The same is true for conflicts: when two sources assert contradictory relationships, provenance lets you represent the conflict rather than silently merging. That representation matters for how disputes get resolved and how the community understands what's known versus what's contested. The builder of the graph is making choices that affect the sociology of the domain, whether or not they intend to.
Who Owns the Graph¶
Open versus proprietary is not a new tension in science. GenBank, the repository of genetic sequences, was built as a public resource; the decision to make it open and freely accessible shaped how molecular biology developed. Clinical trial data, by contrast, has often been held proprietary by sponsors; the fight for access has been long and only partially won. The question of who owns a comprehensive knowledge graph over a significant scientific domain will have similar consequences.
If a single entity -- a company, a government, a consortium -- controls the graph, that entity controls who can query it, what they can do with the results, and how the graph evolves. The incentives may align with the scientific commons, or they may not. A company that built a drug-discovery KG might restrict access to protect competitive advantage. A government might restrict access for national security reasons. An open consortium might make the graph freely available but lack the resources to maintain it. The historical analogies are instructive: GenBank succeeded because the community agreed that sequence data should be a commons; clinical trial data remains contested because the incentives are mixed. A knowledge graph over a domain like medicine or materials science will face the same tensions. What it would mean for a single entity to control it -- the power to shape what gets synthesized, what gets surfaced, what gets updated -- is worth thinking about before it happens.
Capability Is Not Bounded by Intent¶
Consider what it means to build a system that encodes the architecture of expertise for a domain. You built a graph for drug discovery; a user runs a traversal that surfaces a drug-pathway combination that could be repurposed for something harmful. You built a graph for medical literature; a query connects the dots in a way that reveals something about a person's health that they didn't intend to share. You built a graph for materials science; the same structural similarity query that finds promising battery compounds could find promising explosives. None of these are edge cases or failures. They follow directly from the system working as designed.
The graph encodes structure; structure supports inference; inference doesn't respect the boundaries of what you had in mind. A reasoning system with access to rich, typed, provenance-tracked knowledge will expose connections its builders didn't anticipate -- because the value of the system is precisely that it can traverse the graph more exhaustively than any individual human would. That traversal doesn't stop at the edges of your intended use case. Capability is not neatly bounded by intent.
That doesn't mean you shouldn't build. It means you should build with your eyes open. The inferences the system can expose are a feature when they advance science and a risk when they don't. The difference is often context, use case, and the choices you make about access, provenance, and what gets logged. Those choices deserve to be taken seriously.
Dual Use at Graph Scale¶
The drug interaction that saves lives and the synthesis route that enables harm are both pattern-matching problems over structured knowledge. A graph that encodes "compound X inhibits enzyme Y" and "reaction A produces compound X" can answer "what inhibits Y?" for a clinician looking for treatments and for someone looking for precursors. The same query interface serves both. The graph doesn't know the difference. Dual use is not a bug; it's inherent to how knowledge works. Facts don't come with moral valence. The same fact can support healing or harm depending on who uses it and how.
What does responsible construction and deployment look like? There's no clean answer, but there are practices that help. Access control: who can query the graph, and for what? Some graphs should be broadly available; others may need to be restricted to credentialed researchers or vetted use cases. Provenance and transparency: when the system surfaces a connection, can the user trace it to sources? That traceability supports verification and accountability. Logging and monitoring: if the graph is used for something harmful, can you detect it? Auditing: who reviews how the system is used? These are operational questions, not just technical ones. They don't eliminate dual use. They make it harder to misuse the system without leaving a trace, and they create channels for accountability when misuse occurs. The right response to dual use isn't to not build. It's to build with these questions in mind.
The Epistemic Responsibility of the Builder¶
What do you owe to the users of the system you build? At minimum, you owe them honesty about what the system is and isn't. It's a synthesis of the literature, not a representation of ground truth. It has gaps, biases, and limits. Users who don't understand that may over-trust it. You also owe them the infrastructure for verification: provenance, so they can trace claims to sources; confidence, so they can weight what they find; and documentation, so they know what the schema captures and what it doesn't.
Beyond that, the builder's choices about provenance, transparency, access, and schema design are ethical choices, not just technical ones. Deciding what to extract, how to represent it, who gets to query it, and what gets logged -- these decisions shape how the system will be used and what consequences it will have. That doesn't mean every builder must solve every ethical problem before shipping. It means the builder is a stakeholder, with some power to shape outcomes. The right response isn't paralysis. It's to take the responsibility seriously, to build with the foreseeable consequences in mind, and to create the conditions for accountability when things go wrong.
Chapter 14: What This Makes Possible¶
The Three-Book Arc¶
Knowledge Graphs from Unstructured Text solves the extraction problem: how to get structured claims out of unstructured text at scale. This book solves the trustworthiness problem: how to ensure those claims are anchored to stable identities, sourced to their evidence, and aggregated correctly across sources. BFS-QL solves the interface problem: how to get those claims to a language model in a form it can actually reason over.
The three books are independent in the sense that each addresses a distinct problem. They are interdependent in the sense that each one's solution depends on the others being solved. An extraction pipeline without an identity server produces an unusable graph. An identity server without an extraction pipeline has nothing to process. A query protocol without a trustworthy graph is an interface to noise.
The identity server is the connective tissue. It is called by the extraction pipeline and queried by the query layer. It is the service that makes the graph trustworthy, and trustworthiness is what makes the system useful.
Cross-Domain Reasoning¶
The shared canonical ID infrastructure makes cross-domain reasoning possible in a way that was not previously practical. A graph built from biomedical literature can compose with a graph built from clinical trial data, a drug adverse event database, and a genomics resource -- not because any of these sources were designed to interoperate, but because they all anchor to the same authorities.
The biomedical community built MeSH, HGNC, RxNorm, and UniProt over decades for their own purposes: to organize their literature, to name their discoveries, to communicate across research groups. The identity server treats these authorities as the interoperability layer they accidentally became. The cross-domain reasoning capability is an emergent property of the decision to anchor to shared authorities, not a designed feature of any single system.
Democratization and Its Limits¶
Building and maintaining a serious knowledge graph still requires significant resources. You need a corpus, which may be behind paywalls. You need compute for extraction, which costs money. You need domain expertise to design the schema and validate the output. The result is that the first generation of domain-spanning knowledge graphs will likely be built by those who can afford to build them -- pharmaceutical companies, large universities, government agencies, well-funded startups. The question of who gets access then becomes a question of licensing, openness, and governance.
The promise the technology holds out is real nonetheless. A researcher at a small institution, or in a developing country, with access to a comprehensive KG over their domain would have the same structural view of the literature as a researcher at a well-funded lab. The graph doesn't care who queries it. The capability to expose connections that citation networks hide, to ground an LLM in curated knowledge -- that capability could be democratized. The technology enables it; policy and incentive will decide whether it happens.
Grounding LLM Inference¶
The pattern that changes what a language model can do: instead of asking a model to reason from its training data, give it structured, typed, provenance-tracked claims from your graph and ask it to reason from those. The difference in reliability is substantial. A model hallucinating over raw text and a model reasoning over a curated graph with explicit provenance are doing qualitatively different things, even if they look similar from the outside. This is the integration that makes a knowledge graph more than a database.
The mechanics are straightforward. A user asks a question. Your system retrieves relevant subgraphs -- entities and relationships that match the question's scope -- and injects them into the model's context. The model reasons over that context and produces an answer. The answer is grounded in the retrieved graph, not in the model's training. You can cite the sources. You can trace the reasoning path. When the graph is wrong, you fix the graph; you don't retrain the model.
Hypothesis Generation¶
Graph traversal as a discovery tool: not "what do we know about X" but "what's adjacent to X that hasn't been studied," "what entities are structurally similar to X in the graph," "what relationships exist between X and Y that no single paper asserts but that follow from combining multiple sources." These are queries that are impossible over raw text and natural over a well-constructed graph.
Consider a concrete example. Drug A treats disease D. Gene G is associated with disease D. Drug B modulates gene G. No single paper may state that drug B is worth testing for disease D. The inference follows from combining three relationships that exist in the graph. A researcher who had read all the relevant papers might make that connection; the graph makes it queryable. The results are candidate hypotheses -- drug-disease pairs that the graph implies but that may not have been studied together. The graph doesn't decide which are worth pursuing. It surfaces candidates that a human can filter and prioritize.
What Has Changed¶
The extraction bottleneck that held back knowledge representation for fifty years is now broken. The epistemic commons -- the shared identifier infrastructure built by the biomedical, chemical, legal, and geographic communities -- has existed for decades. The identity server is the bridge between them: the service that takes extracted mentions, anchors them to shared authorities, aggregates their evidence, and makes the resulting graph trustworthy.
The vision of machine reasoning over explicit, traceable, cross-domain knowledge -- a vision that animated researchers from McCarthy to Lenat to Berners-Lee -- is now achievable with tools that exist today, at a cost that is no longer prohibitive, for domains that matter.