Skip to content

Chapter10

Chapter 10: Provenance as Architecture

Provenance Is Not Optional

In high-stakes domains -- medicine, law, materials safety -- every claim in the knowledge graph must be traceable to its source. This is not a feature. It is a constraint. A graph that cannot answer "where did this claim come from?" is not suitable for use in these domains regardless of how sophisticated its extraction pipeline is.

Provenance must be architectural, not retrofitted. Adding provenance to an existing graph requires touching every relationship record. The data model for provenance affects the schema, the extraction output format, the ingest stage, the confidence aggregation logic, and the query interface. Getting it right at the start costs little. Getting it wrong costs a full re-extraction.

What Provenance Records

A complete provenance record contains:

  • paper_id: Which paper made this claim
  • section_type: Where in the paper (abstract, introduction, methods, results, discussion, conclusion)
  • paragraph_idx: Exact paragraph within the section
  • extraction_method: How the claim was extracted (LLM model and version, prompt version)
  • confidence: Confidence in this specific piece of evidence
  • study_type: The study design (RCT, meta-analysis, cohort, case report, etc.)

The section type is meaningful for evidence quality: a claim stated in the results section carries more weight than the same claim in the discussion, where it may be speculative. The paragraph index enables a reader to find the exact sentence in the paper that supports the claim -- essential for human verification.

Multi-Source Claims

A claim that appears in multiple papers is stronger than a claim that appears in one. The identity server aggregates evidence across sources as part of its normal operation: when the same relationship is extracted from multiple papers and both subject and object entities resolve to the same canonical IDs, the identity server records a multi-source claim with a composite confidence.

The composite confidence is computed by the domain service. In the medlit reference implementation, it is a weighted mean of the individual confidence scores, where weights are determined by study type. A claim supported by two RCTs and one cohort study has a composite confidence higher than the same claim supported by two case reports.

Replication is a signal of robustness, not a guarantee of correctness. The identity server records replication faithfully; the interpretation of that replication is a human judgment informed by the provenance records.