Skip to content

Chapter06

Chapter 6: The Query Model

The core of BFS-QL is a single query structure with five parameters. Understanding why each parameter is present -- and why the others are not -- is the key to using the protocol well and to implementing it correctly.

The Parameters

seeds is a list of canonical entity IDs. This is the starting point of the traversal. Multiple seeds are supported because many useful questions are inherently relational: not "what connects to this entity?" but "what do these two entities have in common?" A multi-seed query issues a single BFS from all seeds simultaneously and returns their combined neighborhood, deduplicated. The LLM doesn't need to issue separate queries and merge the results manually.

max_hops is an integer controlling traversal depth. A value of 1 returns only immediate neighbors; 2 returns neighbors of neighbors; and so on up to a maximum of 5. The practical guidance is to start at 1 and expand only if the first result doesn't contain what you need. A 2-hop traversal from a well-connected node in the medlit graph returns 84 nodes and 99 edges. A 3-hop traversal from the same node would return most of the graph. Depth is a context budget decision, not a correctness decision -- the graph is the same either way.

node_types is an optional list of entity type names. Nodes whose type matches receive full metadata in the response. Nodes whose type does not match are returned as stubs -- present in the result with their ID and type, but no metadata. Omitting node_types gives full metadata for all nodes, which is appropriate when the graph is small or when the LLM needs comprehensive information. Providing node_types focuses the context budget on what matters.

predicates is an optional list of predicate names. Edges whose predicate matches receive full metadata in the response, including confidence scores, source documents, and provenance. Edges whose predicate does not match are returned as bare subject-predicate-object triples. The behavior is symmetric with node_types: topology is always present, detail is selectively paid for.

topology_only is a boolean that, when true, suppresses all metadata from the response. Every node is returned as a bare ID and type; every edge as a bare subject-predicate-object triple. No node metadata, no edge metadata, no provenance. The response is pure structural skeleton.

exclude_node_types is an optional list of entity type names to remove entirely from the result. Unlike node_types (which demotes non-matching nodes to stubs but keeps them), exclude_node_types removes the specified types and all edges that touch them. The topology is no longer guaranteed complete when this parameter is used -- that is the point. Use it to suppress high-volume types that dominate large traversals without adding conceptual value. The canonical use case is exclude_node_types=["paper", "author"] on a concept-oriented query: papers and authors are the connective tissue of a literature-derived graph and account for the majority of nodes in a deep traversal, but an LLM reasoning about disease mechanisms rarely needs them.

min_mentions is an optional integer (default 1, no filtering) that removes nodes whose total_mentions field in metadata is below the threshold, along with all edges touching them. This suppresses low-confidence provisional entities that appear in only one or two source documents and are structurally present but semantically unreliable. Nodes without a total_mentions field are always included regardless of threshold, so the filter is safe on backends that do not populate it. Note that min_mentions filters the result, not the traversal -- a low-mention node can still serve as a bridge to high-mention nodes at deeper hops, but it will not appear in the returned result.

limit and offset are optional integers for paginating large results. limit caps the number of nodes returned; offset skips the first N nodes. Together they allow an LLM to page through a large neighborhood without requesting everything at once. node_count and edge_count always reflect the full traversal regardless of pagination, so the LLM can see the total size and decide whether to request more pages. Edges are filtered to those whose both endpoints appear in the returned node window, so each page is a self-consistent subgraph. When neither parameter is specified the full result is returned unchanged.

The Flat Format

These five parameters are passed as a flat JSON object. There is no nesting, no sub-query structure, no boolean expression language. The query either specifies seeds, a depth, and optional filters, or it doesn't. This flatness is a deliberate choice.

Query languages like SPARQL and GraphQL support arbitrarily nested structures because they need to -- they are designed to express complex constraints precisely. BFS-QL is not designed for precise constraint expression. It is designed for reliable generation by a language model. Every level of nesting in a query format is an opportunity for the model to make a structural error -- a misplaced bracket, a wrong level of indentation, a filter applied at the wrong scope. A flat format has no levels. The model either provides the parameter or it doesn't.

This is not a limitation on expressiveness. The five parameters cover the full space of what BFS-QL needs to express. The flatness is expressiveness appropriate to the operation.

Context Budget Management

The central design constraint of the query model is the context window. Every token in the response consumes context budget; too many tokens degrade reasoning. The query parameters are the mechanism for managing that budget.

The recommended query progression reflects this:

First: topology survey. Call bfs_query with topology_only=True and max_hops=2. This returns the complete structural skeleton of the neighborhood -- every node and edge -- at minimum token cost. For the medlit desmopressin example, this is 14,000 characters for 84 nodes and 99 edges. The LLM can read the full topology and identify what matters before committing context budget to metadata.

Second: selective expansion. Call describe_entities with the IDs of the nodes the topology survey identified as significant. This retrieves full metadata for multiple nodes in a single call. The LLM pays for exactly the information it has decided it needs, and nothing else. (The single-node describe_entity remains available for one-off lookups; use describe_entities when expanding several stubs at once.)

Third: targeted re-query. If a follow-up traversal is needed -- perhaps the topology survey revealed an unexpected cluster that warrants its own exploration -- issue a new bfs_query with node_types and predicates filters focused on what matters. The third query is more expensive than the first but more targeted: it retrieves full metadata only for the entity types and predicates the LLM has decided are relevant.

This progression from cheap-and-broad to expensive-and-targeted is the working set principle in practice. The first query establishes the topological working set. The second and third queries fill in detail selectively.

Alternative for concept-dense graphs. On large literature-derived graphs, a topology survey at max_hops=2 may itself exceed the context budget -- hundreds of paper and author nodes dominate the result. In this case, skip the topology survey and issue a direct concept-only query:

bfs_query(
    seeds=[seed_id],
    max_hops=1,
    exclude_node_types=["paper", "author"],
    min_mentions=2,
)

This returns only concept entities (diseases, genes, drugs, pathways, etc.) with 2 or more corpus mentions -- high-signal nodes with full metadata -- in a single in-band response. The breast cancer 1-hop query on the graphwright corpus returns 73 nodes and 86 edges this way, compared to 1,347 nodes in the unfiltered 2-hop result. Use max_hops=1 as the default and expand to 2 only if the 1-hop result is too sparse.

Multi-Seed Queries

The multi-seed case deserves more attention than it typically receives, because it is the natural form for a large class of clinically and scientifically interesting questions.

bfs_query with multiple seeds returns the union of their neighborhoods, deduplicated. This is useful for many questions: "What connects this disease to this gene?" returns the combined neighborhood of both seeds, and the structural answer -- the nodes that appear in both halves of the union -- is present in the result for an LLM to inspect. For small result sets, this works well.

For larger graphs, union-and-inspect becomes unreliable. When each seed's 1-hop neighborhood contains hundreds of nodes, asking the LLM to identify which nodes appear in both is structured bookkeeping that language models do poorly -- they miss nodes, conflate similar IDs, and produce inconsistent results. This is the problem intersect_subgraphs solves: it returns only the nodes within k hops of every seed, without the LLM performing any manual set operations.

The medlit example illustrates the bfs_query case. A 1-hop multi-seed query from desmopressin (RxNorm:3251) and Cushing syndrome (MeSH:D003480) returns 35 nodes and 37 edges. Of those, exactly two nodes are in the direct neighborhood of both seeds: PMC11128938, the paper that co-describes both entities, and DBPedia:Cushing's_disease, the specific disease subtype that desmopressin treats. For a 36-paper graph at 1-hop depth, the LLM can inspect the union reliably. For a larger graph or deeper traversal, intersect_subgraphs is the right tool.

What the Response Contains

A BFS-QL response contains:

  • seeds: The seed IDs used. Included for reference -- in a multi-turn session, the LLM may need to recall which seeds were used for a given result.
  • max_hops: The depth used.
  • node_count and edge_count: Total counts. These are useful for calibrating follow-up queries -- a result with 200 nodes warrants a more targeted re-query than a result with 15.
  • nodes: A list of node records. Each is either a full Node (with metadata) or a stub EntityStub (ID and type only), depending on whether its type matched node_types.
  • edges: A list of edge records. Each is either a full EdgeWithMetadata (with confidence, source documents, and provenance) or a bare Edge (subject, predicate, object only), depending on whether its predicate matched predicates.
  • schema_summary: The entity types and predicates actually present in this result subgraph, regardless of the filters applied. See the next section.

One design choice worth noting: stub nodes are always included. If a Disease node is present in the topology but node_types=["drug"], the Disease node appears as a stub -- ID and type, no metadata. It is not omitted. The topology is always complete. This is the separation of topology from presentation that Chapter 3 argued for: filtering controls detail level, not presence.

Schema Discovery in Results

Every BFS-QL query response includes a schema_summary field containing the entity types and predicates actually present in that result subgraph. This applies to both bfs_query and intersect_subgraphs. This is a first-class feature, not implementation detail.

"schema_summary": {
  "entity_types_found": ["disease", "drug", "gene", "paper"],
  "predicates_found": ["associated_with", "targets", "treats"]
}

The value of schema_summary is especially clear in two situations.

Large or open-world graphs. describe_schema may return comprehensive=False when the graph is too large to enumerate entity types and predicates exhaustively -- a Wikidata endpoint, for instance, has thousands of predicates that cannot all be listed upfront. In this case, the LLM cannot know what filters are valid before issuing a query. schema_summary solves the problem by reporting the vocabulary actually present in the neighborhood. After a topology_only survey, the LLM can read schema_summary and use those values as node_types and predicates filters in a targeted follow-up query. No documentation needed, no guessing at predicate names.

Paginated results. When limit and offset are used to page through a large traversal, schema_summary always reflects the full traversal, not just the current page. The LLM sees the complete vocabulary of the neighborhood even if it is only reading a window of nodes. This matters because the decision about which types and predicates to filter on should be made with knowledge of the whole subgraph, not just the first page.

schema_summary closes the loop that describe_schema opens. Together they ensure an LLM always has valid filter values available, whether from the static schema at startup or from the live vocabulary of a result.

Name Disambiguation in search_entities

search_entities accepts a node_types parameter that restricts results to entities of the specified types. This exists to address a common disambiguation problem.

Common scientific terms match multiple entity types. "Breast cancer" matches the disease concept (MeSH:D001943) and also dozens of papers whose titles contain the phrase. When an LLM calls search_entities to resolve a disease name, it typically wants the disease concept, not the papers. Without node_types, the results may be dominated by papers; the disease entity may not appear in the top results at all.

```python search_entities("breast cancer", node_types=["disease"])