The Ingestion Pipeline¶
Placeholder — content to be migrated and expanded from
../pipeline.md.
Two-pass architecture¶
The pipeline uses a two-pass architecture:
- Pass 1 — Entity extraction: parse documents, extract entity mentions, resolve to canonical or provisional entities, build the entity table.
- Pass 2 — Relationship extraction: with a stable entity vocabulary in hand, identify relationships between resolved entities within each document.
This ordering matters: resolving entities first means the relationship extractor can refer to canonical IDs rather than raw text spans, which improves consistency and enables cross-document linking.
Stages¶
document
→ parser (raw bytes → structured chunks)
→ extractor (chunks → mentions)
→ resolver (mentions → entities)
→ embedder (entities → vector embeddings)
→ bundle builder (entities + relationships → exportable bundle)
Each stage is defined as an abstract interface. Domain pipelines implement them.
Component interfaces¶
DocumentParserInterface— takes a raw document, returns chunks.EntityExtractorInterface— takes a chunk, returns entity mentions.RelationshipExtractorInterface— takes a chunk + resolved entities, returns edges.ResolverInterface— maps mentions to canonical or provisional entities.EmbedderInterface— generates vector embeddings for entities.