Results on Results: Building New Results from Cached Partial Results

2025/08/25

Marcio Brito Barbosa

Overview

The interactive experience of writing SQL has not kept up with the performance advancements of modern analytic engines. Analytical cloud services almost always incur WAN latency, meaning even short queries can feel slow to analysts. Users are left in a “write–run–wait” cycle that hampers exploratory analysis.

This thesis — conducted at MotherDuck’s Amsterdam office under the supervision of Peter Boncz and Boaz Leskes — develops a framework for powering real-time query previews like MotherDuck’s Instant SQL feature. The core idea: treat cached partial results as composable building blocks, and when composition fails, use intelligent heuristics to recover cheaply.

Figure: The Instant SQL feature in action, providing a real-time result preview that updates as the user edits their query.

PDF: https://homepages.cwi.nl/~boncz/msc/2025-MarcioBarbosa.pdf

The Problem: Insufficient Cardinality from Composed Caches

MotherDuck’s hybrid architecture runs DuckDB both in the cloud and in the browser (via WebAssembly). To make SQL editing feel instant, the system caches partial query results on the client side — called Results. When a user modifies their query, the system tries to compose an answer from these cached Results rather than re-executing on the server.

This works well until it doesn’t. When Results are built on data samples, composing them (especially through joins and filters) can produce a result set too small to be useful — a MINROWS failure. The naive fix is to throw away all caches and re-execute the full query on the server, but that’s expensive and defeats the purpose of caching.

Figure: Initial plan using only cached Results (no base-table access). If the tuples produced are too few, a MINROWS failure is triggered.

Targeted Re-creation: A Smarter Recovery

Instead of full replacement, this thesis proposes Targeted Re-creation: identify a single upstream cached Result to re-fetch from the server, choosing the one that will resolve the cardinality shortfall at minimum cost. Three heuristics are developed and evaluated:

1. Optimizer Cardinality Estimation

Uses the query optimizer’s built-in cardinality estimates to predict how many rows each candidate re-creation would produce. Advantage: zero local cost. Problem: optimizer estimates proved highly unreliable — often off by orders of magnitude — leading to poor re-creation choices.

2. Dynamic Sampling

Runs the candidate query locally on progressively larger samples and extrapolates via linear regression. Advantage: accurate predictions. Problem: the exhaustive local sampling is computationally expensive, especially in DuckDB Wasm’s single-threaded environment, making it impractical for interactive use.

3. Data Lineage (Recommended)

Augments each cached Result with lightweight provenance metadata — tracking which source Results contributed to each tuple. When a MINROWS failure occurs, the system inspects the lineage to identify which upstream Result is the bottleneck (i.e., contributed the fewest tuples relative to what’s needed). Advantage: near-zero local overhead with strong predictive ability.

Targeted recovery: re-create only the selected cached Result Figure: Targeted recovery — re-create only the selected cached Result predicted to resolve the cardinality shortfall, rather than bypassing all caches.

Experimental Results

The strategies were benchmarked with multi-step analytical workloads generated by LLMs on both TPC-H (uniform) and JCC-H (skewed) data:

Figure: Predictive accuracy (Hit Ratio) of each strategy’s non-baseline choices. Data Lineage achieves 81.6% accuracy with near-zero local cost.

Data Lineage achieved the best overall performance: lowest total execution time, fewest user interruptions, and negligible client-side overhead
Cardinality Estimation had zero local cost but made poor selections due to unreliable optimizer estimates
Dynamic Sampling was accurate but too expensive for interactive use in a single-threaded Wasm environment
All intelligent strategies significantly outperformed naive full replacement

Total execution time for each strategy Figure: Final total execution time (remote + local) for each strategy. Data Lineage closely approaches the Oracle (optimal) baseline.

The Data Lineage approach requires only a simple metadata column on cached results — storing sets of origin Result identifiers that are propagated and unioned as new Results are composed. No row-level index tracking is needed, keeping the mechanism lightweight.

Key Takeaways

Composing partial caches is powerful but brittle — joins and filters on sampled data can produce empty or near-empty results
Data Lineage provides the best cost/accuracy tradeoff for selecting which cache to re-create
A naive fallback is still essential — when no targeted fix works, full replacement must remain available
Client-side overhead matters enormously in Wasm-based hybrid architectures where the browser runtime is single-threaded
The framework directly enables features like MotherDuck’s Instant SQL, making real-time query previews practical in production hybrid database systems

Get the full document

View Paper