Declarative Caching in MotherDuck

2024/10/31

Niclas Haderer

Overview

When you run a query in the MotherDuck UI, the system has to shuttle data between the cloud and your browser. Every round trip adds latency. For interactive data exploration — where you run a query, scan the results, tweak a filter, run it again — those round trips add up fast.

This thesis tackles that problem head-on with a new concept called Accelerated Approximate Views (AAVs). The idea: create a partial, client-side cache of query results that builds in the background. When a subsequent query can be satisfied from the cache, it reads locally instead of round-tripping to the server. The result is lower latency for the interactive exploration patterns that dominate notebook and dashboard use.

AAVs have been implemented in DuckDB and DuckDB Wasm within MotherDuck, with a custom SQL syntax for creating, reading, and managing them.

PDF: https://homepages.cwi.nl/~boncz/msc/2024-NiclasHaderer.pdf

The Problem

MotherDuck’s hybrid execution model splits queries between client and server. This is powerful — it lets you process data locally when it makes sense, and offload heavy lifting to the cloud when it doesn’t. But it also means query results often need to travel from the cloud to your browser before you can see them.

In the MotherDuck UI, a single SELECT statement triggers multiple follow-up queries: a tabular preview of the data, column-level statistics in the explorer panel, pivot table operations. Each of these re-executes parts of the original query. If the data lives in the cloud, that’s multiple round trips for what feels like one operation.

The old approach used a CREATE TABLE AS SELECT (CTAS) to materialize the full result locally before running follow-up queries against it. This works, but it’s all-or-nothing: you have to wait for the entire result to materialize before seeing anything. For large result sets, that means a long wait before the first row appears.

MotherDuck UI showing hybrid query execution with pivot table and column explorer The MotherDuck UI after executing a SELECT statement, showing the pivot table and column explorer that rely on cached query results.

How Accelerated Approximate Views Work

AAVs introduce a new database object — a view with an attached cache. You create one with a custom SQL extension:

Copy code
CREATE RESULT my_result AS SELECT * FROM large_table ORDER BY id;

Behind the scenes, the system:

Registers the view like a normal database view
Starts building a cache in the background — populating it incrementally without blocking other queries
Intercepts queries against the view — if the cache can satisfy the query (e.g., the requested rows are already cached), it reads locally; otherwise, it falls back to the original query

The key insight is the decision tree for cache usage. When a query accesses an AAV, the optimizer checks whether the cache is done building, whether the query is guaranteed satisfiable from the partial cache, and routes to one of three outcomes:

Decision tree for determining whether to use the AAV cache, wait for it, or re-execute the original query Figure 4.2: The cache decision tree. If the cache is sufficient, read from it directly. If the query is guaranteed satisfiable but the cache is still building, wait. Otherwise, fall back to re-executing the original query.

This decision happens during query optimization, after transformations like limit push-downs have been applied.

Background Materialization

One of the trickiest parts of the implementation is building the cache without interfering with the user’s actual queries. In native C++ environments, this uses DuckDB’s thread pool — the cache builds on background threads while the main query runs on the foreground.

In WebAssembly, things get harder. The browser’s main thread is single-threaded, so background work has to be cooperative. The thesis implements a custom background execution driver that periodically allocates time slices for cache building within the JavaScript event loop. It’s like a cooperative scheduler: run cache work for a fixed interval, yield control, then continue on the next cycle.

The full caching workflow in the MotherDuck UI shows how these pieces fit together:

Queries run in the MotherDuck UI when a SELECT is executed, showing the cache creation and usage workflow Figure 5.1: The complete query lifecycle in the MotherDuck UI. A local cache is created, its size is checked, and downstream queries (tabular preview, row count, column explorer) are routed through either the cache or the original query depending on fit.

This approach works, but Wasm’s single-threaded nature limits the gains. The thesis proposes a future improvement: a centralized task executor for DuckDB Wasm that would enable fairer scheduling between foreground queries and background cache work.

User Study: How People Actually Query MotherDuck

To evaluate whether AAVs would actually help real users, the thesis includes a study of actual MotherDuck UI workloads. Key findings:

Most queries return small result sets — the median is well under 1,024 rows, meaning most results fit entirely in the AAV cache
SELECT is dominant — the vast majority of UI queries are SELECT statements, exactly the pattern AAVs are designed for
Simple operators prevail — WHERE, LIMIT, and ORDER BY are by far the most common. Complex joins and subqueries are relatively rare in interactive exploration
Data types cluster around strings, integers, and timestamps — consistent with patterns seen in Redshift and Snowflake workload studies

These findings confirm that AAVs target the right workload: interactive, exploratory queries that return moderate-sized results and are often re-queried with slight modifications.

Performance Results

In native C++ environments, AAVs improve the time-to-first-tuple metric — the time from issuing a query until the first row appears. This matters for interactive use: seeing results start to appear immediately feels fast, even if the full result takes longer.

AAVs outperform CTAS statements on time-to-first-tuple because they’re designed to be buffer-less — results stream out as soon as they’re available, rather than waiting for full materialization. In many cases, AAVs even beat DuckDB’s streaming query API, which is specifically designed for early tuple emission.

The picture is more nuanced in Wasm. The background execution driver’s coarse granularity limits gains, and bridge operations that transfer data between local and remote contexts introduce blocking that slows cache building. The thesis identifies this as the primary area for future improvement and has already contributed patches to DuckDB to lay the groundwork.

Contributions to DuckDB

This thesis resulted in six merged contributions to DuckDB (including the main DuckDB repository and duckdb-node), covering scheduler improvements, streaming query enhancements, and infrastructure needed for background task execution. These changes benefit not just AAVs but the broader DuckDB ecosystem.

Key Takeaways

Declarative caching at the SQL level lets the database make caching decisions automatically, removing the need for manual cache management in application code
Partial caching is effective — you don’t need to materialize the entire result to get benefits. Most interactive queries only need the first few hundred or thousand rows
Wasm remains challenging for background work, but the cooperative scheduling approach demonstrates feasibility and points toward better solutions
Real workload analysis confirms that interactive MotherDuck users run exactly the kind of queries that benefit from client-side caching

Get the full document

View Paper