Chapter06
Chapter 6: The Query Model¶
The core of BFS-QL is a single query structure with five parameters. Understanding why each parameter is present -- and why the others are not -- is the key to using the protocol well and to implementing it correctly.
The Parameters¶
seeds is a list of canonical entity IDs. This is the starting point
of the traversal. Multiple seeds are supported because many useful questions
are inherently relational: not "what connects to this entity?" but "what do
these two entities have in common?" A multi-seed query issues a single BFS
from all seeds simultaneously and returns their combined neighborhood,
deduplicated. The LLM doesn't need to issue separate queries and merge the
results manually.
max_hops is an integer controlling traversal depth. A value of 1
returns only immediate neighbors; 2 returns neighbors of neighbors; and so
on up to a maximum of 5. The practical guidance is to start at 1 and expand
only if the first result doesn't contain what you need. A 2-hop traversal
from a well-connected node in the medlit graph returns 84 nodes and 99
edges. A 3-hop traversal from the same node would return most of the graph.
Depth is a context budget decision, not a correctness decision -- the graph
is the same either way.
node_types is an optional list of entity type names. Nodes whose type
matches receive full metadata in the response. Nodes whose type does not
match are returned as stubs -- present in the result with their ID and type,
but no metadata. Omitting node_types gives full metadata for all nodes,
which is appropriate when the graph is small or when the LLM needs
comprehensive information. Providing node_types focuses the context budget
on what matters.
predicates is an optional list of predicate names. Edges whose
predicate matches receive full metadata in the response, including confidence
scores, source documents, and provenance. Edges whose predicate does not
match are returned as bare subject-predicate-object triples. The behavior
is symmetric with node_types: topology is always present, detail is
selectively paid for.
topology_only is a boolean that, when true, suppresses all metadata
from the response. Every node is returned as a bare ID and type; every edge
as a bare subject-predicate-object triple. No node metadata, no edge
metadata, no provenance. The response is pure structural skeleton.
exclude_node_types is an optional list of entity type names to remove
entirely from the result. Unlike node_types (which demotes non-matching
nodes to stubs but keeps them), exclude_node_types removes the specified
types and all edges that touch them. The topology is no longer guaranteed
complete when this parameter is used -- that is the point. Use it to
suppress high-volume types that dominate large traversals without adding
conceptual value. The canonical use case is exclude_node_types=["paper",
"author"] on a concept-oriented query: papers and authors are the
connective tissue of a literature-derived graph and account for the majority
of nodes in a deep traversal, but an LLM reasoning about disease mechanisms
rarely needs them.
min_mentions is an optional integer (default 1, no filtering) that
removes nodes whose total_mentions field in metadata is below the
threshold, along with all edges touching them. This suppresses
low-confidence provisional entities that appear in only one or two source
documents and are structurally present but semantically unreliable. Nodes
without a total_mentions field are always included regardless of
threshold, so the filter is safe on backends that do not populate it. Note
that min_mentions filters the result, not the traversal -- a
low-mention node can still serve as a bridge to high-mention nodes at deeper
hops, but it will not appear in the returned result.
limit and offset are optional integers for paginating large
results. limit caps the number of nodes returned; offset skips the
first N nodes. Together they allow an LLM to page through a large
neighborhood without requesting everything at once. node_count and
edge_count always reflect the full traversal regardless of pagination, so
the LLM can see the total size and decide whether to request more pages.
Edges are filtered to those whose both endpoints appear in the returned
node window, so each page is a self-consistent subgraph. When neither
parameter is specified the full result is returned unchanged.
The Flat Format¶
These five parameters are passed as a flat JSON object. There is no nesting, no sub-query structure, no boolean expression language. The query either specifies seeds, a depth, and optional filters, or it doesn't. This flatness is a deliberate choice.
Query languages like SPARQL and GraphQL support arbitrarily nested structures because they need to -- they are designed to express complex constraints precisely. BFS-QL is not designed for precise constraint expression. It is designed for reliable generation by a language model. Every level of nesting in a query format is an opportunity for the model to make a structural error -- a misplaced bracket, a wrong level of indentation, a filter applied at the wrong scope. A flat format has no levels. The model either provides the parameter or it doesn't.
This is not a limitation on expressiveness. The five parameters cover the full space of what BFS-QL needs to express. The flatness is expressiveness appropriate to the operation.
Context Budget Management¶
The central design constraint of the query model is the context window. Every token in the response consumes context budget; too many tokens degrade reasoning. The query parameters are the mechanism for managing that budget.
The recommended query progression reflects this:
First: topology survey. Call bfs_query with topology_only=True
and max_hops=2.
This returns the complete structural skeleton of the
neighborhood -- every node and edge -- at minimum token cost. For the
medlit desmopressin example, this is 14,000 characters for 84 nodes and
99 edges. The LLM can read the full topology and identify what matters
before committing context budget to metadata.
Second: selective expansion. Call describe_entities with the IDs of
the nodes the topology survey identified as significant. This retrieves full
metadata for multiple nodes in a single call. The LLM pays for exactly the
information it has decided it needs, and nothing else. (The single-node
describe_entity remains available for one-off lookups; use
describe_entities when expanding several stubs at once.)
Third: targeted re-query. If a follow-up traversal is needed -- perhaps
the topology survey revealed an unexpected cluster that warrants its own
exploration -- issue a new bfs_query with node_types and predicates
filters focused on what matters. The third query is more expensive than the
first but more targeted: it retrieves full metadata only for the entity
types and predicates the LLM has decided are relevant.
This progression from cheap-and-broad to expensive-and-targeted is the working set principle in practice. The first query establishes the topological working set. The second and third queries fill in detail selectively.
Alternative for concept-dense graphs. On large literature-derived graphs, a topology survey at max_hops=2 may itself exceed the context budget -- hundreds of paper and author nodes dominate the result. In this case, skip the topology survey and issue a direct concept-only query:
This returns only concept entities (diseases, genes, drugs, pathways, etc.)
with 2 or more corpus mentions -- high-signal nodes with full metadata --
in a single in-band response. The breast cancer 1-hop query on the
graphwright corpus returns 73 nodes and 86 edges this way, compared to
1,347 nodes in the unfiltered 2-hop result. Use max_hops=1 as the default
and expand to 2 only if the 1-hop result is too sparse.
Multi-Seed Queries¶
The multi-seed case deserves more attention than it typically receives, because it is the natural form for a large class of clinically and scientifically interesting questions.
bfs_query with multiple seeds returns the union of their neighborhoods,
deduplicated. This is useful for many questions: "What connects this disease
to this gene?" returns the combined neighborhood of both seeds, and the
structural answer -- the nodes that appear in both halves of the union --
is present in the result for an LLM to inspect. For small result sets, this
works well.
For larger graphs, union-and-inspect becomes unreliable. When each seed's
1-hop neighborhood contains hundreds of nodes, asking the LLM to identify
which nodes appear in both is structured bookkeeping that language models
do poorly -- they miss nodes, conflate similar IDs, and produce inconsistent
results. This is the problem intersect_subgraphs solves: it returns only
the nodes within k hops of every seed, without the LLM performing any
manual set operations.
The medlit example illustrates the bfs_query case. A 1-hop multi-seed
query from desmopressin (RxNorm:3251) and Cushing syndrome (MeSH:D003480)
returns 35 nodes and 37 edges. Of those, exactly two nodes are in the
direct neighborhood of both seeds: PMC11128938, the paper that co-describes
both entities, and DBPedia:Cushing's_disease, the specific disease subtype
that desmopressin treats. For a 36-paper graph at 1-hop depth, the LLM can
inspect the union reliably. For a larger graph or deeper traversal,
intersect_subgraphs is the right tool.
What the Response Contains¶
A BFS-QL response contains:
seeds: The seed IDs used. Included for reference -- in a multi-turn session, the LLM may need to recall which seeds were used for a given result.max_hops: The depth used.node_countandedge_count: Total counts. These are useful for calibrating follow-up queries -- a result with 200 nodes warrants a more targeted re-query than a result with 15.nodes: A list of node records. Each is either a fullNode(with metadata) or a stubEntityStub(ID and type only), depending on whether its type matchednode_types.edges: A list of edge records. Each is either a fullEdgeWithMetadata(with confidence, source documents, and provenance) or a bareEdge(subject, predicate, object only), depending on whether its predicate matchedpredicates.schema_summary: The entity types and predicates actually present in this result subgraph, regardless of the filters applied. See the next section.
One design choice worth noting: stub nodes are always included. If a
Disease node is present in the topology but node_types=["drug"], the
Disease node appears as a stub -- ID and type, no metadata. It is not
omitted. The topology is always complete. This is the separation of
topology from presentation that Chapter 3 argued for: filtering controls
detail level, not presence.
Schema Discovery in Results¶
Every BFS-QL query response includes a schema_summary field containing the
entity types and predicates actually present in that result subgraph.
This applies to both bfs_query and intersect_subgraphs.
This is a first-class feature, not implementation detail.
"schema_summary": {
"entity_types_found": ["disease", "drug", "gene", "paper"],
"predicates_found": ["associated_with", "targets", "treats"]
}
The value of schema_summary is especially clear in two situations.
Large or open-world graphs. describe_schema may return
comprehensive=False when the graph is too large to enumerate entity
types and predicates exhaustively -- a Wikidata endpoint, for instance,
has thousands of predicates that cannot all be listed upfront. In this
case, the LLM cannot know what filters are valid before issuing a query.
schema_summary solves the problem by reporting the vocabulary
actually present in the neighborhood. After a topology_only survey,
the LLM can read schema_summary and use those values as node_types
and predicates filters in a targeted follow-up query. No documentation
needed, no guessing at predicate names.
Paginated results. When limit and offset are used to page through
a large traversal, schema_summary always reflects the full traversal,
not just the current page. The LLM sees the complete vocabulary of the
neighborhood even if it is only reading a window of nodes. This matters
because the decision about which types and predicates to filter on should
be made with knowledge of the whole subgraph, not just the first page.
schema_summary closes the loop that describe_schema opens. Together
they ensure an LLM always has valid filter values available, whether from
the static schema at startup or from the live vocabulary of a result.
Name Disambiguation in search_entities¶
search_entities accepts a node_types parameter that restricts results
to entities of the specified types. This exists to address a common
disambiguation problem.
Common scientific terms match multiple entity types. "Breast cancer"
matches the disease concept (MeSH:D001943) and also dozens of papers
whose titles contain the phrase. When an LLM calls search_entities to
resolve a disease name, it typically wants the disease concept, not the
papers. Without node_types, the results may be dominated by papers; the
disease entity may not appear in the top results at all.
```python search_entities("breast cancer", node_types=["disease"])