Chapter05
Chapter 5: describe_schema -- Self-Orienting Graphs¶
In the early days of the web, connecting to a new API meant reading its
documentation. The documentation was a separate artifact -- a PDF, a wiki
page, a sequence of example curl commands -- maintained by humans, often
out of sync with the actual API, and unavailable to the software that needed
it. A client that wanted to know what endpoints were available had to be
told by a human who had read the docs.
This was not a fundamental limitation. Roy Fielding's REST
dissertation, published in 2000, included
hypermedia as a first-class constraint: a well-designed REST API should
carry, in its responses, the information a client needs to navigate it.
Links, not documentation. The API tells you what it can do; you don't need
to be told separately. This principle -- that interfaces should be
self-describing -- has become standard in modern API design. OpenAPI
specifications, GraphQL introspection, FastAPI's /docs endpoint: all are
expressions of the same idea.
describe_schema is BFS-QL's implementation of
this principle for knowledge graphs. An LLM connecting to a graph it has
never seen -- a private Fuseki instance, a domain-specific SPARQL endpoint,
a kgraph-derived Postgres store for a hospital's clinical data -- needs to
know what entity types and predicates exist before it can construct a
meaningful query. In the SPARQL world, this required reading documentation.
In BFS-QL, it requires one tool call.
What It Returns¶
A describe_schema response contains three things:
-
graph_description: A human-readable string describing the graph and its domain -- what the data represents, where it came from, what kinds of questions it is meant to answer. This is provided by the graph operator when the BFS-QL server is configured. A well-written description tells the LLM whether this is the right graph for its current question. -
entity_types: The complete list of valid entity type names in the graph. These are exactly the values the LLM can pass asnode_typesin abfs_querycall. Not approximate names, not documentation -- the actual strings the query engine understands. -
predicates: The complete list of valid predicate names. These are exactly the values the LLM can pass aspredicatesin abfs_querycall.
The medlit graph, for example, returns 19 entity types and 16 predicates.
After one call, the LLM knows that drug, disease, and procedure are
valid node types -- and that protein and enzyme are also present, which
tells it something about the level of mechanistic detail in the graph. It
knows that TREATS, CAUSES, and INHIBITS are valid predicates -- and
that CITES and AUTHORED are also present, which tells it that the graph
includes bibliographic structure alongside clinical knowledge.
This is orientation in the strict sense. The LLM knows what it is looking at before it starts navigating.
Two Delivery Modes¶
The describe_schema tool can be called explicitly or made unnecessary
through a second mechanism: schema injection.
At startup, the BFS-QL server calls entity_types() and predicates()
on the backend and holds the results in memory. If the schema is small
enough -- the implementation uses a threshold of 20 entity types and 30
predicates -- the server injects the valid values directly into the
bfs_query tool description. The LLM reads the tool description before
it calls the tool, so it arrives at bfs_query already knowing what
node_types and predicates values are valid. No explicit describe_schema
call required.
This is a zero-cost optimization for small schemas. The LLM doesn't spend a tool call on orientation; the orientation is already embedded in the interface.
The tradeoff is tool description size. A graph with 19 entity types and
16 predicates adds roughly 200 characters to the bfs_query description --
negligible. A graph with 200 entity types and 500 predicates would make the
tool description unwieldy and consume context before the LLM has done
anything. Above the threshold, injection is suppressed and explicit calling
is the path.
Both modes are supported transparently. The server chooses based on schema size. The LLM's behavior is the same either way: it starts a session knowing the schema, whether that knowledge came from injection or from a tool call.
The graph_description as a First-Class Signal¶
The graph description is worth more attention than it usually receives. In the medlit example, it reads: "36 PubMed papers on Cushing disease and related endocrinology." That sentence tells an LLM several things that affect how it should reason:
- The corpus is small (36 papers). Claims that seem universal may be specific to this literature.
- The domain is focused (Cushing disease). Entities and relationships outside that domain are unlikely to be well-represented.
- The data source is biomedical literature. Relationships have provenance and carry confidence scores.
A graph operator deploying BFS-QL should treat the description as they
would treat a system prompt: an opportunity to shape how the LLM approaches
the data. "This graph contains inferred relationships; verify important
claims against source documents." "The entity type provisional indicates
entities whose canonical IDs could not be resolved." "Predicates are
directional; TREATS runs from drug to disease, not the reverse."
The server instructions mechanism serves a similar function. BFS-QL's
server sends a block of instructions to the LLM at session initialization,
before any tool calls. These instructions can include graph-specific
guidance that doesn't fit in the tool descriptions -- in the medlit
deployment, for example, the instructions note that entity IDs beginning
with prov: are provisional artifacts from the ingestion pipeline, carry
no external canonical meaning, and should be treated as anonymous
placeholders. Without that note, an LLM might waste reasoning cycles
wondering what a provisional ID like
prov:2e02b663d97c45499d4ce644abf81b8a refers to.
Self-description is not just schema. It is everything the graph operator knows about the data that the LLM would benefit from knowing before it starts.