Skip to content

Chapter05

Chapter 5: describe_schema -- Self-Orienting Graphs

In the early days of the web, connecting to a new API meant reading its documentation. The documentation was a separate artifact -- a PDF, a wiki page, a sequence of example curl commands -- maintained by humans, often out of sync with the actual API, and unavailable to the software that needed it. A client that wanted to know what endpoints were available had to be told by a human who had read the docs.

This was not a fundamental limitation. Roy Fielding's REST dissertation, published in 2000, included hypermedia as a first-class constraint: a well-designed REST API should carry, in its responses, the information a client needs to navigate it. Links, not documentation. The API tells you what it can do; you don't need to be told separately. This principle -- that interfaces should be self-describing -- has become standard in modern API design. OpenAPI specifications, GraphQL introspection, FastAPI's /docs endpoint: all are expressions of the same idea.

describe_schema is BFS-QL's implementation of this principle for knowledge graphs. An LLM connecting to a graph it has never seen -- a private Fuseki instance, a domain-specific SPARQL endpoint, a kgraph-derived Postgres store for a hospital's clinical data -- needs to know what entity types and predicates exist before it can construct a meaningful query. In the SPARQL world, this required reading documentation. In BFS-QL, it requires one tool call.

What It Returns

A describe_schema response contains three things:

  • graph_description: A human-readable string describing the graph and its domain -- what the data represents, where it came from, what kinds of questions it is meant to answer. This is provided by the graph operator when the BFS-QL server is configured. A well-written description tells the LLM whether this is the right graph for its current question.

  • entity_types: The complete list of valid entity type names in the graph. These are exactly the values the LLM can pass as node_types in a bfs_query call. Not approximate names, not documentation -- the actual strings the query engine understands.

  • predicates: The complete list of valid predicate names. These are exactly the values the LLM can pass as predicates in a bfs_query call.

The medlit graph, for example, returns 19 entity types and 16 predicates. After one call, the LLM knows that drug, disease, and procedure are valid node types -- and that protein and enzyme are also present, which tells it something about the level of mechanistic detail in the graph. It knows that TREATS, CAUSES, and INHIBITS are valid predicates -- and that CITES and AUTHORED are also present, which tells it that the graph includes bibliographic structure alongside clinical knowledge.

This is orientation in the strict sense. The LLM knows what it is looking at before it starts navigating.

Two Delivery Modes

The describe_schema tool can be called explicitly or made unnecessary through a second mechanism: schema injection.

At startup, the BFS-QL server calls entity_types() and predicates() on the backend and holds the results in memory. If the schema is small enough -- the implementation uses a threshold of 20 entity types and 30 predicates -- the server injects the valid values directly into the bfs_query tool description. The LLM reads the tool description before it calls the tool, so it arrives at bfs_query already knowing what node_types and predicates values are valid. No explicit describe_schema call required.

This is a zero-cost optimization for small schemas. The LLM doesn't spend a tool call on orientation; the orientation is already embedded in the interface.

The tradeoff is tool description size. A graph with 19 entity types and 16 predicates adds roughly 200 characters to the bfs_query description -- negligible. A graph with 200 entity types and 500 predicates would make the tool description unwieldy and consume context before the LLM has done anything. Above the threshold, injection is suppressed and explicit calling is the path.

Both modes are supported transparently. The server chooses based on schema size. The LLM's behavior is the same either way: it starts a session knowing the schema, whether that knowledge came from injection or from a tool call.

The graph_description as a First-Class Signal

The graph description is worth more attention than it usually receives. In the medlit example, it reads: "36 PubMed papers on Cushing disease and related endocrinology." That sentence tells an LLM several things that affect how it should reason:

  • The corpus is small (36 papers). Claims that seem universal may be specific to this literature.
  • The domain is focused (Cushing disease). Entities and relationships outside that domain are unlikely to be well-represented.
  • The data source is biomedical literature. Relationships have provenance and carry confidence scores.

A graph operator deploying BFS-QL should treat the description as they would treat a system prompt: an opportunity to shape how the LLM approaches the data. "This graph contains inferred relationships; verify important claims against source documents." "The entity type provisional indicates entities whose canonical IDs could not be resolved." "Predicates are directional; TREATS runs from drug to disease, not the reverse."

The server instructions mechanism serves a similar function. BFS-QL's server sends a block of instructions to the LLM at session initialization, before any tool calls. These instructions can include graph-specific guidance that doesn't fit in the tool descriptions -- in the medlit deployment, for example, the instructions note that entity IDs beginning with prov: are provisional artifacts from the ingestion pipeline, carry no external canonical meaning, and should be treated as anonymous placeholders. Without that note, an LLM might waste reasoning cycles wondering what a provisional ID like prov:2e02b663d97c45499d4ce644abf81b8a refers to.

Self-description is not just schema. It is everything the graph operator knows about the data that the LLM would benefit from knowing before it starts.