Skip to content

Chapter08

Chapter 8: The GraphDbInterface ABC

There is a tension in interface design between expressiveness and simplicity. An expressive interface gives the caller more power -- more ways to phrase a request, more control over execution, more options to tune. A simple interface gives the implementor less to build and the caller fewer things to get wrong. The usual engineering response is to find a balance, to offer enough expressiveness without crossing into complexity.

BFS-QL resolves this tension by separating it into two layers. The GraphDbInterface abstract base class is the implementor-facing interface. It is deliberately primitive -- eight methods, all basic graph navigation, no traversal intelligence. The BFS-QL server layer is the caller-facing interface, where all the expressiveness lives: multi-seed BFS, stub/full filtering, topology mode, caching, the full six-tool surface. These are not in tension because they are in different places. The ABC is simple so that backends are easy to write. The server is expressive because it has to be. Neither layer compromises the other.

Eight Methods

The complete interface:

class GraphDbInterface(ABC):

    @abstractmethod
    async def search_entities(
        self, query: str, node_types: list[str] | None = None
    ) -> list[EntityStub]:
        """Resolve a name or alias to candidate stubs."""

    @abstractmethod
    async def edges_from(self, entity_id: str) -> list[Edge]:
        """Return all outgoing edges from the given entity."""

    @abstractmethod
    async def edges_to(self, entity_id: str) -> list[Edge]:
        """Return all incoming edges to the given entity."""

    @abstractmethod
    async def get_node(self, entity_id: str) -> Node:
        """Return the node record for the given entity ID."""

    @abstractmethod
    async def metadata_for_node(
        self, entity_id: str
    ) -> dict[str, Any]:
        """Return all available metadata for the entity."""

    @abstractmethod
    async def metadata_for_edge(
        self, edge: Edge
    ) -> dict[str, Any]:
        """Return full metadata, including provenance."""

    @abstractmethod
    async def entity_types(self) -> list[str]:
        """Return valid entity type names in this graph."""

    @abstractmethod
    async def predicates(self) -> list[str]:
        """Return valid predicate names in this graph."""

Three pairs and two singletons. edges_from / edges_to are the traversal primitives -- directed graph navigation in both directions. get_node / metadata_for_node separate identity from detail: the first returns just the entity ID and type, the second returns everything else. metadata_for_edge pairs with the traversal primitives to provide full provenance. Then the two singletons: search_entities maps natural-language names to canonical IDs, and entity_types / predicates expose the graph's own vocabulary.

The separation of get_node from metadata_for_node is deliberate. During BFS expansion, the engine calls both concurrently for nodes that need full records, but calls only get_node for nodes that will become stubs. A backend that fetches metadata lazily -- or from a separate service -- can implement both cheaply without conflating the two concerns.

What the Interface Does Not Contain

There is no bfs_query method. There is no count_neighbors method. There is no find_shortest_path method. There is no filter parameter, no hop limit, no traversal mode.

All of that is in the server layer. The bfs_query function in engine.py implements multi-seed BFS entirely in terms of edges_from, edges_to, get_node, and metadata_for_node. The stub/full decision is made at the server layer. The topology-only mode is a serialization choice at the server layer. The caching layer wraps the backend transparently.

The consequence is that a backend implementor answers only one question: how do I perform these eight operations against this particular graph store? Not: how do I implement BFS? Not: how do I cache edge lists efficiently? Not: how do I serialize results? Those questions are already answered, once, in the server layer, and the answers apply to every backend automatically.

The Caching Layer

CachedGraphDb wraps any backend in a dict-keyed cache at the primitive level. Every call to edges_from(id) checks a dict before hitting the backend. Every metadata_for_node(id) is cached after the first fetch. entity_types and predicates are cached indefinitely -- they are stable for the lifetime of a session.

class CachedGraphDb(GraphDbInterface):
    def __init__(
        self, backend: GraphDbInterface, maxsize: int = 1024
    ) -> None:
        self._backend = backend
        self._edges_from_cache: dict[str, list[Edge]] = {}
        self._node_meta_cache: dict[str, dict[str, Any]] = {}
        # ... per-method dicts for all eight methods

    async def edges_from(self, entity_id: str) -> list[Edge]:
        if entity_id not in self._edges_from_cache:
            res = await self._backend.edges_from(entity_id)
            self._edges_from_cache[entity_id] = res
        return self._edges_from_cache[entity_id]

The critical property is that caching operates at the level where it pays. BFS traversal at depth 2 may visit the same node from multiple directions. Without caching, each visit triggers a backend round-trip. With primitive-level caching, the second visit is a dict lookup. The backend sees each distinct call at most once per session. A multi-hop traversal over a well-connected graph can reduce backend round-trips by an order of magnitude.

Because the cache is transparent -- CachedGraphDb implements GraphDbInterface -- backends do not implement caching themselves. They return fresh data on every call. The server layer decides caching policy. Backends stay simple.

All Methods Are Async

Every method in GraphDbInterface is async. This is not a formality. BFS expansion calls edges_from and edges_to for every node in the current frontier concurrently, via asyncio.gather. A 2-hop BFS over a frontier of 40 nodes issues 80 concurrent edge queries. Against a Postgres backend, these resolve as concurrent connection pool requests. Against a SPARQL endpoint, they resolve as concurrent HTTP requests. Against a Neo4j backend, they resolve as concurrent Bolt protocol calls.

A synchronous interface would serialize this work unnecessarily. The async design is a performance contract: backends that can serve concurrent queries concurrently will. Backends that cannot (in-memory dicts, file-backed stores) pay no penalty -- async def with no await inside is just a regular function in async clothing.