This Month in the DuckDB Ecosystem: December 2025

2025/12/10 - 8 min read

Hey, friend 👋

I hope you're doing well. I'm Simon, and I am excited to share the last monthly newsletter 🎄 of the year with highlights and the latest updates about DuckDB, delivered straight to your inbox.

In this December issue, I gathered the usual highlights, including insights on key developments like data-at-rest encryption and new spatial extensions (Gaggle and osmextract), various ecosystem integrations (DuckLake, dlt, Ibis, Marimo), and data engineering advice based on Reddit.

If you have feedback, news, or any insights, they are always welcome. 👉🏻 duckdbnews@motherduck.com.

Featured Community Member

Nico Ritschel

Nico Ritschel is director of engineering at ATM.com and the creator of Sidequery, a privacy-first SQL analytics notebook for Mac that uses DuckDB as its core engine. He recently released two community extensions that push DuckDB into new territory.

Yardstick implements Julian Hyde's "Measures in SQL" paper, adding measure-aware SQL to DuckDB. Measures are aggregations that know how to re-aggregate themselves when query context changes, enabling percent-of-total calculations and year-over-year comparisons without the usual CTE gymnastics. With syntax like AGGREGATE(revenue) AT (ALL region), you can compute totals across dimensions directly in your query.

duckdb-acp brings natural language querying to DuckDB via the Agent Client Protocol. Write queries like CLAUDE what is the total revenue by product? and an AI agent explores your schema and generates the SQL. It supports Claude Code, Codex, and Gemini, with built-in safety controls to block mutations by default.

Thanks, Nico, for tackling two hard problems and making them accessible through DuckDB extensions!

Top DuckDB Links this Month

Gaggle: A DuckDB extension for working with Kaggle datasets

TL;DR: Gaggle is a Rust-based DuckDB extension that enables direct SQL queries on Kaggle datasets via the Kaggle API, eliminating manual download/extraction workflows.

Ever had to search for a good dataset? Heard of Kaggle.com? Gaggle uses the Kaggle API to search, download, and manage datasets, supporting CSV, Parquet, JSON, and XLSX files. The extension introduces a kaggle: prefix for replacement scans, allowing queries like SELECT * FROM 'kaggle:vikrishnan/boston-house-prices/housing.csv' LIMIT 5 to read files directly without explicit path management, similar to the Hugging Face extension.

Setup requires Kaggle API credentials (export KAGGLE_USERNAME and KAGGLE_KEY). Key functions include gaggle_ls('owner/dataset') for listing dataset contents, gaggle_download() for explicit caching, and gaggle_file_path() for resolving local paths.

Book on Spatial Data Management with DuckDB

TL;DR: Qiusheng has authored a new comprehensive book, Spatial Data Management with DuckDB, that serves as a practical guide with open-source code for implementing geospatial analytics workflows using DuckDB's spatial capabilities.

The book leverages DuckDB for spatial data with SQL and Python API integration and more advanced operations. Key technical topics covered include loading and exporting spatial data formats, executing geometry operations and functions, performing advanced spatial joins, and working with cloud-native vector tiles such as PMTiles.

Qiusheng demonstrates these concepts using real datasets, including the US National Wetlands Inventory and global building footprints. All code examples are provided as Python notebooks, runnable directly on MyBinder or Google Colab, offering a hands-on resource for building and optimizing spatial data pipelines within the DuckDB ecosystem, and are available on GitHub.

osmextract: OpenStreetMap data extraction tool powered by DuckDB

TL;DR: Osmextract is a high-performance, single-binary Rust tool that leverages DuckDB's spatial extension to extract and filter OpenStreetMap PBF data directly into GeoParquet or DuckDB files.

This ties into the spatial book above and provides a tool for a zero-copy pipeline from PBF input to Parquet output. It offers flexible filtering via JSON for tags, geometry filters, and a --custom-sql-filter option for applying custom DuckDB SQL WHERE clauses, such as "cardinality(tags) > 5", directly during extraction. The application automatically detects system resources like CPU cores and memory to optimize performance settings, though these can be manually overridden.

4 Senior Data Engineers Answer 10 Top Reddit Questions

TL;DR: Sharing advice on common challenges, emphasizing robust design, iterative development, and focusing on business fundamentals over chasing trends.

In this article, I collaborated with Ben, Julien, and Mehdi, and we answered the most frequently asked questions from the r/dataengineering subreddit. This Q&A synthesizes practical knowledge we learned over the years in data engineering and common problems.

We go from managing schema change, such as freezing schemas and always creating new columns rather than editing existing ones to avoid downstream breaks, or discussing the cognitive debt of auto-evolution, data quality, and much more.

Data-at-Rest Encryption in DuckDB

TL;DR: DuckDB v1.4 introduces transparent, data-at-rest encryption for database files, the Write-Ahead Log (WAL), and temporary files with negligible performance overhead.

The feature supports AES-GCM-256 and AES-CTR-256 ciphers. The implementation keeps the main database header in plaintext but adds an encryption flag, a salt, and an encrypted "canary" to verify the key. All subsequent data blocks, WAL entries, and temporary files used in out-of-core operations are fully encrypted.

How do I use it? Encryption is enabled via the ATTACH statement, e.g., ATTACH 'encrypted.db' AS enc (ENCRYPTION_KEY 'secret');. While a fallback Mbed TLS implementation exists, DuckDB auto-loads the httpfs extension to use its hardware-accelerated OpenSSL backend, which results in minimal performance impact.

This enables new ways of working with DuckDB, for example safely passing around DuckDB database files with all information inside or new deployment models where you put an encrypted DuckDB database file on a Content Delivery Network (CDN).

Tech Review: DuckLake - From Parquet to Powerhouse

TL;DR: This review says that DuckLake provides core lakehouse capabilities with minimal infrastructure, making it a straightforward alternative to more complex systems for managing versioned data.

The article demonstrates executing ACID-compliant DML like UPDATE and DDL like ALTER TABLE ... ADD COLUMN directly on the Parquet-based table. The versioning system allows for time-travel queries to specific data states using ... AT (VERSION => 1). Hoyt also shows how to add detailed commit messages and author metadata to transactions for governance purposes by wrapping DDL in a transaction block and using CALL my_ducklake.set_commit_message(...). Also make sure to check out part 2, where he showcases using MotherDuck to create a DuckLake.

Related: Check out another deep-dive Is DuckLake a Step Backward? by Alireza Sadeghi where he compares it to log-oriented formats like Iceberg by storing detailed file and column statistics in SQL tables to avoid Hive's performance pitfalls.

KEYNOTE: Data Architecture Turned Upside Down | PyData Amsterdam 2025

TL;DR: The keynote guides you through the history of data architecture and showcases how we arrive at compute to client devices, leveraging modern single-node query engines and lakehouse formats to eliminate centralized data warehouse bottlenecks.

In this great PyData Amsterdam 2025 keynote, Hannes goes back from 1985 all the way to today and demonstrates DuckDB processing a 265GB dataset (6 billion rows) with a COUNT(DISTINCT l_orderkey query completing in 46 seconds using only 2GB memory, thanks to larger-than-memory intermediate result handling.

He illustrates how his earlier research paper showed database client-server data transfer protocols performing worse than raw netcat CSV transfers, which he attributes to "strategic incompetence" that creates vendor lock-in. The proposed 2025 architecture flips the traditional pyramid. Clients become first-class compute nodes while storage (S3/object stores) and metadata (via formats like Iceberg or DuckLake) become commoditized infrastructure layers.

dlt + MotherDuck: Workshop material for Small Data SF 2025

TL;DR: The small-data-sf workshop repository provides a guide for building a modern ELT pipeline using Python's dlt library, with DuckDB for local development and MotherDuck for production deployment.

In the workshop material, contributors Thierry and Elvis demonstrate the full lifecycle of a data project, from ingesting data from the GitHub REST API to building a data product. The technical stack is centered around a local-first workflow, using duckdb for initial data exploration and transformation, complemented by tools like LLM scaffolding, Continue, Marimo, and Ibis.

There's a tutorial part where speakers explain and demonstrate concepts and exercises that you can go through if you like to learn how to solve data engineering tasks.

A Deep Dive into DuckDB for Data Scientists

TL;DR: Khuyen's deep dive demonstrates how DuckDB offers a zero-configuration, high-performance alternative to pandas for local data analysis, enabling direct SQL queries on various file formats and in-memory DataFrames.

The article highlights DuckDB's ability to directly query pandas and Polars DataFrames using duckdb.sql("SELECT * FROM my_df"). A key performance benchmark on a 1-million-row dataset shows an 8.7x speedup over pandas for aggregations. Khuyen showcases streamlined file handling, including automatic delimiter detection in read_csv, querying multiple files with wildcards like 'data/sales/*.csv', and flattening nested Parquet/JSON structures with simple dot notation.

The piece also covers crucial database features like ACID transactions (BEGIN TRANSACTION, COMMIT) and safe, parameterized queries using ? placeholders.

Upcoming Events

DuckDB Developer Meeting #1

Pakhuis de Zwijger, Amsterdam : Jan 30, 4:00 PM GMT+1

The first ever DuckDB Developer Meeting, organized by DuckDB Labs. The event will feature talks from DuckDB developers, and is aimed at developers who build DuckDB extensions or complex applications on top of DuckDB.

Virtual Workshop: Build a Serverless Lakehouse with DuckLake

Online, Dec 17, 10:00 AM PST

in this hands-on webinar (an encore of our Small Data SF session), Jacob Matson will walk you through building a serverless lakehouse from scratch using DuckLake. Ditch the heavy JVM baggage and get straight to the query.

TABLE OF CONTENTS

Hey, friend 👋

Featured Community Member

Branch, Test, Deploy: A Git-Inspired Approach for Data

This article explores how to bring Git style workflows like branching, testing, and deploying to your data stack. Learn how concepts like zero copy cloning and metadata pointers can finally give you isolated test environments.

2025/12/03 - Simon Späti

Simplicity of a Database, but the Speed of a Cache: OLAP Caches for DuckDB

Speed up slow dashboards without adding new infrastructure. Learn how DuckDB's caching extensions can drop query times from minutes to seconds.

View all

This Month in the DuckDB Ecosystem: December 2025

Hey, friend 👋

Featured Community Member

Nico Ritschel

Top DuckDB Links this Month

Gaggle: A DuckDB extension for working with Kaggle datasets

Book on Spatial Data Management with DuckDB

osmextract: OpenStreetMap data extraction tool powered by DuckDB

4 Senior Data Engineers Answer 10 Top Reddit Questions

Data-at-Rest Encryption in DuckDB

Tech Review: DuckLake - From Parquet to Powerhouse

KEYNOTE: Data Architecture Turned Upside Down | PyData Amsterdam 2025

dlt + MotherDuck: Workshop material for Small Data SF 2025

A Deep Dive into DuckDB for Data Scientists

Upcoming Events

DuckDB Developer Meeting #1

Virtual Workshop: Build a Serverless Lakehouse with DuckLake

Subscribe to motherduck blog

PREVIOUS POSTS

Branch, Test, Deploy: A Git-Inspired Approach for Data

Simplicity of a Database, but the Speed of a Cache: OLAP Caches for DuckDB