Overview¶
This codebase is a domain-agnostic knowledge graph framework for extracting entities and relationships from unstructured text. It addresses the core problem: turning documents (papers, legal text, reports) into a queryable graph with canonical identities and provenance.
For the conceptual foundation — why knowledge graphs, why extraction from text, and why this approach — see the book Knowledge Graphs from Unstructured Text (kg-book). The technical docs here describe how the framework is built and how to use or extend it.
What this repo does¶
- Two-pass ingestion: Pass 1 extracts entities and resolves them to canonical or provisional IDs; Pass 2 extracts relationships between those entities.
- Pluggable domains: Each domain (medical literature, legal, literary, etc.) defines its own schema (entity types, relationship types, documents) and pipeline components.
- Canonical identity: Entities can be tied to external authorities (e.g. UMLS, RxNorm for medicine) or remain provisional until promotion rules are met.
- Bundle export: Pipelines produce a validated bundle (entities, relationships, manifest) that the query server loads read-only.
Where to go next¶
- Architecture — Components and how they relate.
- Schema Design Guide — Define your domain with kgschema.
- The Pipeline — Parsing, extraction, dedup, resolution, bundle building.
- Adapting to Your Domain — Step-by-step guide to add a new domain.