Deduplication¶

Placeholder — this page needs to be written.

Deduplication handles the case where two entity records refer to the same real-world thing but have not yet been linked by canonical ID — either because both are provisional, or because authority lookup returned different IDs that are actually synonymous.

When dedup runs¶

Dedup runs after resolution, as a pass over the entity table. It looks for near-duplicate pairs and either merges them (if confidence is high) or flags them for review.

Signals used¶

Name similarity — string distance, normalized forms, known synonyms.
Embedding similarity — semantic distance between entity description embeddings.
Type agreement — two entities of different declared types are rarely the same thing.
Co-occurrence — entities that appear together in the same contexts are more likely to be distinct.

Merge vs. flag¶

Merge — one entity absorbs the other; all edges are rewritten to the survivor. The absorbed entity's ID is recorded as an alias.
Flag — the pair is marked as a candidate merge and surfaced for human review. Useful when confidence is below threshold or types conflict.

Provenance impact¶

When entities are merged, provenance is preserved: every source mention that pointed to either entity continues to be recorded. The audit trail is never collapsed.

Deduplication¶

When dedup runs¶

Signals used¶

Merge vs. flag¶

Provenance impact¶

See also¶