The Sherlock Example¶
A simpler, literary contrast case that shows the framework’s generality. It ingests Sherlock Holmes stories (e.g. from Project Gutenberg), extracts characters, locations, and stories, and builds relationships such as appears_in and co_occurs_with. No biomedical authorities — a good template for non-medical domains.
Schema¶
- Documents: Plain text or structured story documents (BaseDocument).
- Entities: Character, location, story; entity_id can be domain-minted (e.g.
holmes:char:SherlockHolmes). - Relationships:
appears_in(character → story),co_occurs_with(character ↔ character), etc. - Domain: DomainSchema defines the types and predicates; promotion config can be minimal (e.g. single use → canonical).
Pipeline¶
- Parser — Fetch or read Gutenberg text; produce a document per story (or chunk).
- Entity extraction — LLM or rule-based: identify characters, locations, story titles.
- Resolution — Map mentions to canonical or provisional entities (no external authority; IDs minted by domain).
- Relationship extraction — Who appears in which story; who co-occurs with whom.
- Export — Write bundle (manifest + entities.jsonl + relationships.jsonl) for kgserver.
Code layout¶
examples/sherlock/domain.py— Domain schema and entity/relationship types.examples/sherlock/pipeline/— Parser, extractors, resolver (and optional embeddings).examples/sherlock/sources/gutenberg.py— Fetching Gutenberg content.examples/sherlock/data.py— Data helpers if needed.
Why it’s useful¶
- No external APIs — Good for local runs and demos.
- Small corpus — Fast iteration on schema and prompts.
- Same patterns — DomainSchema, pipeline interfaces, bundle export; reuse the same ideas for legal, financial, or other domains.
Use medlit for authority-backed, production-style ingestion; use Sherlock for learning and for domains without a single canonical ID source.