Adapting to Your Domain¶
Step-by-step guide to add a new domain to the framework: define your schema, write extraction prompts, configure your pipeline, and validate output.
1. Define your schema¶
- Entity types — What gets a node? Subclass
BaseEntityfor each type; register in yourDomainSchema.entity_types. - Relationship types — Which predicates? Subclass
BaseRelationship; register inDomainSchema.relationship_types. - Document type — Subclass
BaseDocumentfor your input format (e.g. JATS for articles, plain sections for stories). - Provenance — Include source document, section, and confidence in your relationship and evidence models so the graph stays traceable.
See Schema Design Guide. Use medlit_schema or sherlock as references.
2. Write extraction prompts¶
If you use an LLM for extraction:
- Bind your entity types and relationship types in natural language so the model knows the allowed vocabulary.
- Prefer a single source of truth: one module (e.g.
domain_spec.py) that defines entity types, predicates, and prompt instructions. The extraction prompt, validation, and dedup all consume it; one edit, no drift. - Handle hedging and negation explicitly (e.g. "did not inhibit" → no relationship or low confidence).
- Ask for provenance (source span, confidence) in the same pass.
- Iterate on the prompt; test on a small corpus before scaling.
The medlit pipeline uses structured prompts in extract and relationship extraction; see The medlit Example.
3. Configure your pipeline¶
- Parser — Implement or reuse a parser that produces your
BaseDocument(e.g. JATS XML → medlit document with sections). - Chunking — If documents are long, chunk by section or sentence so each piece fits model limits; keep boundaries so provenance can point to sections.
- Entity extractor — LLM or NER; output
EntityMentionwith type and confidence. - Resolver — Use a synonym cache and optional authority lookup (e.g. UMLS) to map mentions to canonical or provisional entities.
- Dedup — Merge duplicate entities (e.g. by embedding similarity or name normalization) before or during resolution.
- Relationship extractor — Input: document + resolved entities; output:
BaseRelationshiplist with subject_id, object_id, predicate, provenance. - Bundle builder — Write entities and relationships to the kgbundle format (manifest + JSONL); run export from your storage.
Wire these in a script or orchestrator (e.g. medlit’s fetch_vocab → extract → ingest → build_bundle). See The Pipeline and Storage and Export.
4. Validate output¶
- Schema — Ensure every entity and relationship conforms to your domain schema (types, required fields).
- IDs — Canonical IDs must match the authority format; provisionals should have a unique, stable placeholder ID.
- Bundle — Validate against kgbundle models; load a test bundle in kgserver and run a few REST/GraphQL queries.
- Provenance — Spot-check that source document and (if applicable) section are present and correct.
Add unit tests for parser, extractor, resolver, and bundle build; use the in-memory storage for fast tests. See Contributing for testing approach.