Skip to content

Chapter12

Chapter 12: The Graph Linter

Linting as Explicit Epistemics

In the Unix philosophy, tools are small, focused, and composable. The ingestion pipeline can enforce schema constraints at insertion time, but there is a complementary tool worth building: a standalone linter that audits the graph independently of the insertion path. Call it kglint.

The idea is straightforward. A code linter checks source files for structural violations that the compiler might not catch -- style problems, unused variables, suspicious patterns. A graph linter does the same for knowledge claims. It doesn't just check for broken data; it checks for broken epistemics. It asks, of every edge in the graph: is this claim well-formed? Is it sourced? Does it contradict something else without acknowledgment?

This separation -- enforcement at insertion, auditing after the fact -- reflects Unix philosophy. The insertion path needs to be fast. A linter can be thorough. Running them as distinct tools makes both easier to reason about and test.

What a Graph Linter Checks

The key design insight is that a graph linter should not have hardcoded rules. Its rule set should be derived entirely from the Domain Spec at runtime. Every predicate in the spec has a domain, a range, a provenance requirement, and optionally a functional flag or a negation pair. The linter reads the spec and generates its checks from those declarations.

The checks fall into the same four layers described in Chapter 11:

  • Vocabulary violations: predicates in the graph not defined in the schema
  • Domain/range violations: edges where subject or object entity type violates the predicate's declared constraints
  • Provenance gaps: edges without a valid provenance record, or with provenance that does not declare an extraction method
  • Unacknowledged contradictions: functional predicates with multiple values for the same subject, or edges whose predicates are declared negation pairs and both exist between the same entity pair

Adding a new predicate to the Domain Spec automatically adds lint coverage for it. The linter requires no maintenance as the schema evolves.

Violation Structure

A useful linter emits structured output, not free text. Each violation should be a typed record that downstream tools can handle programmatically:

{
  "violation_type": "DOMAIN_RANGE_MISMATCH",
  "severity": "ERROR",
  "edge_id": "edge_789",
  "subject_type": "DRUG",
  "predicate": "inhibits",
  "object_type": "LOCATION",
  "message": "Predicate 'inhibits' cannot have object type 'LOCATION'. Expected 'GENE' or 'PROCESS'.",
  "remediation": "Check entity resolution for object 'New York'."
}

JSONL output makes the linter composable: pipe it into a dashboard, a review queue, a CI step, or a script that filters by severity.

Conflict Records as First-Class Data

One worthwhile design choice for a graph linter: when it detects that two papers disagree -- one says a drug activates a gene and another says it inhibits it -- it should not simply report the contradiction as an error and stop. Instead, it should emit a Conflict Record: a structured record naming both edges, the conflict type, and the resolution status.

The graph is richer for containing the dispute rather than suppressing it. In a typed graph, contradiction is information, not failure. Unresolved scientific disagreements are real and worth representing. A linter that turns unacknowledged contradictions into first-class records allows the graph to represent the messiness of scientific discourse without sacrificing structural rigor.

A Note on What Exists

None of this requires the graph linter to exist before the graph is useful. The typed schema and identity server provide meaningful guarantees at insertion time without any separate linter. But as a corpus grows and multiple ingestion runs accumulate, the value of an independent audit pass increases. A linter built along these lines -- schema-driven, structured output, conflict records as data -- would be a natural next tool to build once the core pipeline is running.