Chapter10

Chapter 10: The SPARQL Backend\index{SPARQL backend}¶

The Postgres backend covers kgraph-derived graphs -- graphs that were built by the extraction pipeline and live in a database the developer controls. But there are thousands of knowledge graphs that predate BFS-QL, that were built by other teams for other purposes, and that expose themselves through SPARQL 1.1 endpoints. DBpedia, Wikidata, UniProt, ChEMBL, the Gene Ontology, the NCI Thesaurus -- these are public graphs with public endpoints, accumulated over decades, containing knowledge that no extraction pipeline will soon replicate. A SPARQL backend makes all of them accessible through the same six-tool interface.

The SPARQL Endpoint Model¶

A SPARQL 1.1 endpoint accepts HTTP POST requests with a query parameter containing a SPARQL query string and returns results as JSON, XML, or CSV. The endpoint URL is the only configuration. The backend sends queries over HTTP and parses the JSON binding response format.

The edges_from and edges_to methods translate directly to SPARQL property path queries:

-- edges_from(entity_id)
SELECT ?predicate ?object WHERE {
    <{entity_id}> ?predicate ?object .
    FILTER(!isBlank(?object))
}
LIMIT 500

-- edges_to(entity_id)
SELECT ?subject ?predicate WHERE {
    ?subject ?predicate <{entity_id}> .
    FILTER(!isBlank(?subject))
}
LIMIT 500

The FILTER(!isBlank(?object)) clause excludes blank nodes -- anonymous intermediate nodes that appear in RDF data but have no canonical ID and cannot be meaningfully referenced in BFS-QL. Blank nodes are a modeling convenience in RDF; they are a navigational dead end for graph traversal.

URI Normalization¶

SPARQL endpoints represent entities as URIs:

<http://dbpedia.org/resource/Cushing%27s_disease>
<http://www.wikidata.org/entity/Q183417>

BFS-QL canonical IDs are strings. The SPARQL backend must map between them.

The mapping strategy is endpoint-specific. For DBpedia, the URI prefix http://dbpedia.org/resource/ maps to the prefix DBpedia:. For Wikidata, http://www.wikidata.org/entity/ maps to Wikidata:. The backend's initialization takes a prefix map:

backend = SparqlBackend(
    endpoint="https://dbpedia.org/sparql",
    prefixes={
        "DBpedia": "http://dbpedia.org/resource/",
        "DBpedia-owl": "http://dbpedia.org/ontology/",
    }
)

Outgoing URIs are expanded to full form before insertion into SPARQL queries. Incoming URIs are compressed using the prefix map. URIs that match no known prefix are included as-is -- they are valid canonical IDs, just opaque to the user.

Schema Discovery¶

entity_types and predicates are the BFS-QL methods that return the graph's vocabulary. For SPARQL endpoints, these translate to SELECT DISTINCT queries:

-- entity_types()
SELECT DISTINCT ?type WHERE {
    ?s a ?type .
}
ORDER BY ?type
LIMIT 200

-- predicates()
SELECT DISTINCT ?pred WHERE {
    ?s ?pred ?o .
    FILTER(?pred != rdf:type)
}
ORDER BY ?pred
LIMIT 500

These queries can be slow on large endpoints -- Wikidata has hundreds of millions of triples and SELECT DISTINCT over all predicates is not instantaneous. The CachedGraphDb wrapper handles this: both methods are cached indefinitely after the first call. The server's lifespan handler calls them once at startup and caches the results in _state.

For endpoints where SELECT DISTINCT is prohibitively slow, an alternative is to probe from a known seed: start from a well-connected entity and collect the entity types and predicates that appear in its neighborhood. This produces a partial schema -- sufficient for the BFS-QL server to inject into tool descriptions -- without scanning the entire graph.

`search_entities` Against SPARQL¶

SPARQL endpoints vary in their full-text search support. Virtuoso (which backs DBpedia) supports bif:contains for full-text matching. GraphDB supports Lucene-backed text search. Many endpoints support no full-text search at all.

The most portable approach is rdfs:label matching with FILTER(CONTAINS(...)):

SELECT ?entity ?type WHERE {
    ?entity rdfs:label ?label ;
            a ?type .
    FILTER(CONTAINS(LCASE(?label), LCASE("{query}")))
}
LIMIT 20

This is not fast on large graphs, but it works everywhere and avoids endpoint-specific extensions. For production use against a specific endpoint, the backend should be configured with that endpoint's preferred search mechanism.

Handling Endpoint Variance¶

SPARQL 1.1 is a standard, but implementations differ. Virtuoso requires DEFINE sql:describe-mode "CBD" for some queries. GraphDB has different timeout behavior. Stardog enforces stricter blank node handling. Amazon Neptune does not support all property path expressions.

The SPARQL backend handles this through a small set of configuration knobs: query timeout (in seconds), result set size limit (LIMIT clause), and a flag for whether SELECT DISTINCT over the full graph is safe to issue. These are set at initialization and applied to all generated queries.

The abstraction boundary is clean: an LLM querying a BFS-QL server backed by Virtuoso, GraphDB, or Neptune sees identical behavior. The endpoint variance is confined entirely to the backend implementation.