The Ingestion Pipeline¶

Placeholder — content to be migrated and expanded from ../pipeline.md.

Two-pass architecture¶

The pipeline uses a two-pass architecture:

Pass 1 — Entity extraction: parse documents, extract entity mentions, resolve to canonical or provisional entities, build the entity table.
Pass 2 — Relationship extraction: with a stable entity vocabulary in hand, identify relationships between resolved entities within each document.

This ordering matters: resolving entities first means the relationship extractor can refer to canonical IDs rather than raw text spans, which improves consistency and enables cross-document linking.

Stages¶

document
  → parser        (raw bytes → structured chunks)
  → extractor     (chunks → mentions)
  → resolver      (mentions → entities)
  → embedder      (entities → vector embeddings)
  → bundle builder (entities + relationships → exportable bundle)

Each stage is defined as an abstract interface. Domain pipelines implement them.

Component interfaces¶

DocumentParserInterface — takes a raw document, returns chunks.
EntityExtractorInterface — takes a chunk, returns entity mentions.
RelationshipExtractorInterface — takes a chunk + resolved entities, returns edges.
ResolverInterface — maps mentions to canonical or provisional entities.
EmbedderInterface — generates vector embeddings for entities.

The Ingestion Pipeline¶

Two-pass architecture¶

Stages¶

Component interfaces¶

See also¶