Talk

DuckLake: Making BIG DATA feel small (Coalesce 2025)

2025/10/14

TL;DR: DuckLake is a new open lakehouse format that combines the simplicity of a database catalog with the scalability of open data formats—eliminating the "big data tax" and enabling a full lakehouse setup in just 5 minutes.

The Big Data Tax

Current cloud data warehouses were designed in 2012 when hardware was much weaker. Their distributed architecture comes with penalties:

  • Latency: Small queries take longer than they should due to coordination overhead
  • Cost: Network shuffling between nodes isn't free
  • Complexity: Scheduling, planning, and routing across nodes adds operational burden

The key insight: "Big compute is dead" (though it's not as catchy as "big data is dead"). Most queries (P99) touch under 256GB of data—well within single-node capability.

DuckDB: Pushing Single-Node Performance

  • In-process: Runs inside Python, Node, Go, Rust, and 15+ languages
  • Lightweight: 20MB binary, zero dependencies, installs in seconds
  • Fast: #1 on ClickBench, beating ClickHouse, Snowflake, Redshift, and BigQuery

DuckLake vs Iceberg Architecture

IcebergDuckLake
Multiple metadata layers (manifests, metadata files, catalog)Single transactional database holds all metadata
Metadata overhead grows with commitsDatabase scales efficiently
Complex setup5-minute setup
Requires Java ecosystemPure SQL, any language that wraps DuckDB

Key insight: DuckLake uses the same architecture as Snowflake (FoundationDB) and BigQuery (Spanner)—a transactional database for metadata.

5-Minute Lakehouse Demo with dbt

The demo shows setting up a complete lakehouse using:

  • A dbt profile configured to use DuckDB with the DuckLake extension
  • Postgres as the metadata catalog backend
  • Local or cloud storage for the actual data files

Maintenance operations include merging small files, expiring old snapshots, and cleaning up expired files—all callable through dbt run-operation.

Production Considerations

  1. Cloud compute: Serverless preferred for simplicity
  2. Large instances: Sometimes you need beefy compute for repartitioning or full scans
  3. Access control: Lock down your lakehouse
  4. Caching: Lakehouse files are immutable—perfect for caching
  5. Scheduled maintenance: Automate file compaction and snapshot expiration

MotherDuck: Ducklings of Unusual Size

SizeSpecsUse Case
StandardVariousDay-to-day queries
Mega64 cores, 256GB RAMHeavy transformations
Giga192 cores, 1.5TB RAMMost problems fit here

Real-World Migration

A customer replaced a 5-server distributed cluster (largest AWS instances) running Iceberg with one serverless DuckLake on MotherDuck.

  • Migration: Metadata-only (no data copying)
  • Iceberg import: Supported for bringing in existing Iceberg data
  • Iceberg export: Also supported for interoperability

Key Takeaways

  • 10-100x data scale with existing SQL/dbt skills—no new stack or team required
  • Instant import from Iceberg—leverage existing data investments
  • Local dev parity: Same lakehouse runs on laptop and in production
  • Future: Spark connector in development for multi-engine support

Related Videos

"LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics" video thumbnail

2025-11-19

LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics

LLMs excel at natural language understanding but struggle with factual accuracy when aggregating business data. Ryan Boyd explores the architectural patterns needed to make LLMs work effectively alongside analytics databases.

AI, ML and LLMs

MotherDuck Features

SQL

Talk

Python

BI & Visualization

"In the Long Run, Everything is a Fad" video thumbnail

2025-11-05

In the Long Run, Everything is a Fad

Benn Stancil uses Olympics gymnastics scoring to argue data's quantification obsession is generational. We went from vibes to math and may return to AI-powered vibes. Will dashboards matter to the next generation?

Talk

BI & Visualization

"The Unbearable Bigness of Small Data" video thumbnail

2025-11-05

The Unbearable Bigness of Small Data

MotherDuck CEO Jordan Tigani shares why we built our data warehouse for small data first, not big data. Learn about designing for the bottom left quadrant, hypertenancy, and why scale doesn't define importance.

Talk

MotherDuck Features

Ecosystem

SQL

BI & Visualization

AI, ML and LLMs