Architecture Overview¶
Two-Pass Ingestion Pipeline¶
The framework processes documents in two passes:
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐
│ Raw Docs │────▶│ Parser │────▶│ BaseDocument │
└─────────────┘ └─────────────┘ └────────┬────────┘
│
┌────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ PASS 1 │
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ Entity Extractor│────▶│ Entity Resolver │ │
│ └─────────────────┘ └────────┬─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ EntityMention[] BaseEntity[] │
│ (canonical or provisional) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ PASS 2 │
│ ┌───────────────────────┐ │
│ │ Relationship Extractor│ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ BaseRelationship[] │
└─────────────────────────────────────────────────────────┘
│
▼
┌───────────────┐
│ Storage │
└───────────────┘
Pass 1: Entity Extraction¶
- EntityExtractor scans documents and produces
EntityMentionobjects (raw text spans with type classifications) - EntityResolver maps mentions to existing canonical entities or creates provisional entities
- EmbeddingGenerator creates semantic vectors for new entities
- Entities are stored with usage counts incremented for duplicates
Pass 2: Relationship Extraction¶
- RelationshipExtractor identifies connections between entities found in Pass 1
- Relationships are validated against the domain schema
- Stored relationships link to entity IDs
Entity Lifecycle¶
┌──────────────┐
│ Mention │
│ (text span) │
└──────┬───────┘
│ resolve
▼
┌─────────────────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Provisional │ │ Canonical │
│ Entity │ │ Entity │
└────────┬────────┘ └─────────────────┘
│ ▲
│ promote (when │
│ thresholds met) │
└────────────────────────┘
Promotion Criteria¶
Provisional entities are promoted to canonical when they meet thresholds defined in PromotionConfig:
min_usage_count: Minimum number of document appearancesmin_confidence: Minimum confidence score from extractionrequire_embedding: Whether embedding must be present
Entity Merging¶
When canonical entities are detected as duplicates (via embedding similarity), they can be merged:
- Synonyms and usage counts are combined
- All relationship references are updated to point to the target entity
- Source entities are removed
Canonical ID System¶
The framework provides abstractions for working with canonical IDs (stable identifiers from authoritative sources):
CanonicalId: Pydantic model representing a canonical identifier with ID, URL, and synonymsCanonicalIdCacheInterface: Abstract interface for caching canonical ID lookupsCanonicalIdLookupInterface: Abstract interface for looking up canonical IDs- Helper functions: Utilities for extracting canonical IDs from entities (
extract_canonical_id_from_entity,check_entity_id_format)
Promotion policies use these abstractions to assign canonical IDs to entities. See Canonical IDs and Entity Resolution for details.
Major Components and How They Relate¶
| Component | Role |
|---|---|
| kgschema | Data structures and ABCs only (entities, relationships, documents, domain, storage). No runtime logic. |
| kgraph | Ingestion pipeline: orchestrator, promotion, export, canonical_id lookup, in-memory storage, pipeline interfaces. |
| kgbundle | Lightweight Pydantic models for bundle exchange; used by both kgraph (producer) and kgserver (consumer). |
| kgserver | Query server: loads bundles, exposes REST, GraphQL, MCP; PostgreSQL/SQLite backends; optional Chainlit chat UI. |
| examples | Domain implementations (medlit, sherlock) and medlit_schema: concrete schemas, parsers, extractors, scripts. |
Data flow: Documents → kgraph pipeline (using kgschema types and domain from examples) → bundle (kgbundle format) → kgserver loads bundle and serves queries.
Module Structure¶
The codebase is organized into these packages:
kgschema/ (Data Structure Definitions)¶
Submodule within kgraph containing only Pydantic models and ABC interfaces, no functional code:
kgschema/
├── entity.py # BaseEntity, EntityStatus, EntityMention, PromotionConfig
├── relationship.py # BaseRelationship
├── document.py # BaseDocument
├── domain.py # DomainSchema ABC
└── storage.py # Storage interface ABCs
kgbundle/ (Bundle Exchange Models)¶
Separate lightweight package with minimal dependencies (pydantic only):
kgbundle/
├── models.py # EntityRow, RelationshipRow, BundleManifestV1
└── pyproject.toml # Standalone package configuration
Used by both kgraph (producer) and kgserver (consumer) for bundle file exchange.
kgraph/ (Main Framework)¶
Functional code implementing the ingestion pipeline:
kgraph/
├── ingest.py # IngestionOrchestrator
├── promotion.py # PromotionPolicy ABC
├── export.py # Bundle export functionality
├── builders.py # Builder utilities
├── canonical_id/ # Canonical ID system
│ ├── models.py # CanonicalId model, CanonicalIdCacheInterface ABC
│ ├── json_cache.py # JsonFileCanonicalIdCache implementation
│ ├── lookup.py # CanonicalIdLookupInterface ABC
│ └── helpers.py # Helper functions for promotion policies
├── storage/ # Storage implementations
│ └── memory.py # In-memory storage
└── pipeline/ # Pipeline component interfaces
├── interfaces.py # Parser, Extractor, Resolver ABCs
└── embedding.py # EmbeddingGeneratorInterface
kgserver/ (Query Server)¶
FastAPI app that loads a bundle and exposes:
- REST: entities, relationships
- GraphQL + GraphiQL
- MCP server for LLM/agent tooling
- Optional Chainlit chat UI at
/chat - Graph visualization
Storage backends: PostgreSQL, SQLite. Producer (kgraph) and consumer (kgserver) share the kgbundle schema so the bundle is the contract.
examples/¶
- medlit — Medical literature: JATS/PMC parser, LLM extraction, authority lookup (UMLS, etc.), dedup, bundle build. Reference implementation.
- medlit_schema — Domain schema (entities, relationships, documents) for medlit.
- sherlock — Simpler literary example (characters, stories, co-occurrence) showing framework generality.
Immutability¶
All data models (entities, relationships, documents) are immutable Pydantic models with frozen=True. This ensures:
- Thread safety for concurrent access
- Clear data flow (updates create new instances)
- Predictable behavior in storage operations
Use model_copy(update={...}) to create modified copies: