Architecture Overview¶

Two-Pass Ingestion Pipeline¶

The framework processes documents in two passes:

┌─────────────┐     ┌─────────────┐     ┌─────────────────┐
│  Raw Docs   │────▶│   Parser    │────▶│  BaseDocument   │
└─────────────┘     └─────────────┘     └────────┬────────┘
                                                 │
                        ┌────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────────┐
│                      PASS 1                             │
│  ┌─────────────────┐     ┌──────────────────┐           │
│  │ Entity Extractor│────▶│  Entity Resolver │           │
│  └─────────────────┘     └────────┬─────────┘           │
│         │                         │                     │
│         ▼                         ▼                     │
│  EntityMention[]           BaseEntity[]                 │
│                            (canonical or provisional)   │
└─────────────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────┐
│                      PASS 2                             │
│  ┌───────────────────────┐                              │
│  │ Relationship Extractor│                              │
│  └───────────┬───────────┘                              │
│              │                                          │
│              ▼                                          │
│       BaseRelationship[]                                │
└─────────────────────────────────────────────────────────┘
                        │
                        ▼
                ┌───────────────┐
                │    Storage    │
                └───────────────┘

Pass 1: Entity Extraction¶

EntityExtractor scans documents and produces EntityMention objects (raw text spans with type classifications)
EntityResolver maps mentions to existing canonical entities or creates provisional entities
EmbeddingGenerator creates semantic vectors for new entities
Entities are stored with usage counts incremented for duplicates

Pass 2: Relationship Extraction¶

RelationshipExtractor identifies connections between entities found in Pass 1
Relationships are validated against the domain schema
Stored relationships link to entity IDs

Entity Lifecycle¶

                 ┌──────────────┐
                 │   Mention    │
                 │ (text span)  │
                 └──────┬───────┘
                        │ resolve
                        ▼
          ┌─────────────────────────┐
          │                         │
          ▼                         ▼
┌─────────────────┐      ┌─────────────────┐
│   Provisional   │      │    Canonical    │
│    Entity       │      │     Entity      │
└────────┬────────┘      └─────────────────┘
         │                        ▲
         │ promote (when          │
         │ thresholds met)        │
         └────────────────────────┘

Promotion Criteria¶

Provisional entities are promoted to canonical when they meet thresholds defined in PromotionConfig:

min_usage_count: Minimum number of document appearances
min_confidence: Minimum confidence score from extraction
require_embedding: Whether embedding must be present

Entity Merging¶

When canonical entities are detected as duplicates (via embedding similarity), they can be merged:

Synonyms and usage counts are combined
All relationship references are updated to point to the target entity
Source entities are removed

Canonical ID System¶

The framework provides abstractions for working with canonical IDs (stable identifiers from authoritative sources):

CanonicalId: Pydantic model representing a canonical identifier with ID, URL, and synonyms
CanonicalIdCacheInterface: Abstract interface for caching canonical ID lookups
CanonicalIdLookupInterface: Abstract interface for looking up canonical IDs
Helper functions: Utilities for extracting canonical IDs from entities (extract_canonical_id_from_entity, check_entity_id_format)

Promotion policies use these abstractions to assign canonical IDs to entities. See Canonical IDs and Entity Resolution for details.

Major Components and How They Relate¶

Component	Role
kgschema	Data structures and ABCs only (entities, relationships, documents, domain, storage). No runtime logic.
kgraph	Ingestion pipeline: orchestrator, promotion, export, canonical_id lookup, in-memory storage, pipeline interfaces.
kgbundle	Lightweight Pydantic models for bundle exchange; used by both kgraph (producer) and kgserver (consumer).
kgserver	Query server: loads bundles, exposes REST, GraphQL, MCP; PostgreSQL/SQLite backends; optional Chainlit chat UI.
examples	Domain implementations (medlit, sherlock) and medlit_schema: concrete schemas, parsers, extractors, scripts.

Data flow: Documents → kgraph pipeline (using kgschema types and domain from examples) → bundle (kgbundle format) → kgserver loads bundle and serves queries.

Module Structure¶

The codebase is organized into these packages:

kgschema/ (Data Structure Definitions)¶

Submodule within kgraph containing only Pydantic models and ABC interfaces, no functional code:

kgschema/
├── entity.py              # BaseEntity, EntityStatus, EntityMention, PromotionConfig
├── relationship.py        # BaseRelationship
├── document.py            # BaseDocument
├── domain.py              # DomainSchema ABC
└── storage.py             # Storage interface ABCs

kgbundle/ (Bundle Exchange Models)¶

Separate lightweight package with minimal dependencies (pydantic only):

kgbundle/
├── models.py              # EntityRow, RelationshipRow, BundleManifestV1
└── pyproject.toml         # Standalone package configuration

Used by both kgraph (producer) and kgserver (consumer) for bundle file exchange.

kgraph/ (Main Framework)¶

Functional code implementing the ingestion pipeline:

kgraph/
├── ingest.py              # IngestionOrchestrator
├── promotion.py           # PromotionPolicy ABC
├── export.py              # Bundle export functionality
├── builders.py            # Builder utilities
├── canonical_id/          # Canonical ID system
│   ├── models.py          # CanonicalId model, CanonicalIdCacheInterface ABC
│   ├── json_cache.py      # JsonFileCanonicalIdCache implementation
│   ├── lookup.py          # CanonicalIdLookupInterface ABC
│   └── helpers.py         # Helper functions for promotion policies
├── storage/               # Storage implementations
│   └── memory.py          # In-memory storage
└── pipeline/              # Pipeline component interfaces
    ├── interfaces.py      # Parser, Extractor, Resolver ABCs
    └── embedding.py       # EmbeddingGeneratorInterface

kgserver/ (Query Server)¶

FastAPI app that loads a bundle and exposes:

REST: entities, relationships
GraphQL + GraphiQL
MCP server for LLM/agent tooling
Optional Chainlit chat UI at /chat
Graph visualization

Storage backends: PostgreSQL, SQLite. Producer (kgraph) and consumer (kgserver) share the kgbundle schema so the bundle is the contract.

examples/¶

medlit — Medical literature: JATS/PMC parser, LLM extraction, authority lookup (UMLS, etc.), dedup, bundle build. Reference implementation.
medlit_schema — Domain schema (entities, relationships, documents) for medlit.
sherlock — Simpler literary example (characters, stories, co-occurrence) showing framework generality.

Immutability¶

All data models (entities, relationships, documents) are immutable Pydantic models with frozen=True. This ensures:

Thread safety for concurrent access
Clear data flow (updates create new instances)
Predictable behavior in storage operations

Use model_copy(update={...}) to create modified copies:

updated_entity = entity.model_copy(update={"usage_count": entity.usage_count + 1})