Chunking Strategies¶

Placeholder — this page needs to be written.

Before an LLM can extract entities and relationships, a document must be segmented into chunks that fit within a context window and preserve enough surrounding context to make extraction meaningful.

Goals of chunking¶

Keep related content together (a claim and its evidence should be in the same chunk).
Avoid cutting across sentence or paragraph boundaries when possible.
Produce chunks small enough for the LLM context window, large enough to be useful.
Preserve chunk identity so extracted mentions can be traced back to a specific location.

Strategies¶

Fixed-size with overlap¶

Split at a token count (e.g. 512 or 1024 tokens) with a sliding overlap (e.g. 10%). Simple and predictable. Works poorly when paragraph breaks fall mid-chunk.

Structure-aware splitting¶

Use document structure (headings, sections, paragraphs) as natural split points. Works well for structured formats like JATS XML or HTML. The parser stage identifies structure; the chunker respects it.

Semantic splitting¶

Use embedding similarity to find natural topic shifts. More expensive but produces chunks that are semantically coherent. Useful for long-form prose without clear structure.

Chunk metadata¶

Each chunk should carry:

Document identifier and version.
Section or heading path (if available).
Character or token offset within the document.
Chunk index and total chunk count.

This metadata flows through extraction and becomes part of provenance records.