DuckDB Ecosystem: September 2025

2025/09/09 - 8 min read

Hey, friend 👋

This month, we've achieved major performance breakthroughs with DuckDB's new spatial joins, delivering 58× speed improvements, and pg_duckdb 1.0 officially launched to bring vectorized analytics directly into PostgreSQL. Plus, real-world cost savings stories, including one team that slashed their Snowflake BI spend by 79% using DuckDB as a smart caching layer.

- Simon

Featured Community Member

Matt Martin

Matt Martin is a Staff Engineer at State Farm. He is a highly experienced data architect and ETL practitioner with strong BI expertise.

Matt also shares his knowledge through his newsletter, High Performance DE Newsletter, where he writes about topics like DuckDB and the Semantic Layer. He recently made his first contribution to DuckDB, giving back to the open-source community.

Thanks, Matt, for your energy and for spreading your DuckDB knowledge!

Top DuckDB Links this Month

Interactive SQL & DuckDB Tutorial

TL;DR: DB Quacks introduces an interactive, browser-based SQL learning platform powered by DuckDB, targeting developers and data practitioners through gamified, progressive learning challenges.

The project creates a hands-on tutorial environment where users can write and execute real SQL queries. By integrating an interactive tutorial with a potential data-driven adventure game, the platform aims to provide an engaging learning experience that allows immediate query execution and result visualization. The tutorial follows a character named Duckbert and offers progressively challenging exercises that help users master SQL fundamentals. It's a fun way of learning more about DuckDB.

Spatial Joins in DuckDB

TL;DR: DuckDB v1.3.0 introduced a dedicated spatial operator that delivers up to 58× performance improvement over the previous version.

The new SPATIAL_JOIN operator builds an R-tree index on-the-fly for the smaller (right) table during join execution, then streams the left table's rows through this index to efficiently find matches. This approach replaces the previous PIECEWISE_MERGE_JOIN optimization that relied on bounding box inequality checks.

In benchmarks using NYC bike trip data with 58 million rows, Max demonstrated that the new point-in-polygon join now finishes in just 28.7 seconds compared to 107.6 seconds with the previous approach and 1799.6 seconds with a naive nested loop join. Future improvements will include support for larger-than-memory build sides, increased parallelism, and faster native predicate functions.

TL;DR: Patrick demonstrates how to download and process 3.5TB of GitHub event data using Modal's serverless infrastructure and DuckDB in just 15 minutes.

Modal is a serverless cloud engine for Python that provides highly concurrent infrastructure to efficiently download 50K+ files of GitHub event data from GitHub Archive. The implementation uses pycurl for faster downloads with failure handling, container-level concurrency, and parallelization across hundreds of containers simultaneously. While DuckDB can directly query remote files, downloading the files first enables faster and more reliable querying across billions of events.

DuckDB In Production

TL;DR: Jordan demonstrates how DuckDB can serve as a lightweight federated query engine to validate ELT pipelines and detect deduplication issues between SQL Server and Snowflake.

When facing unexplained duplicate records in Snowflake despite proper CDC configuration in Airbyte, Jordan leveraged DuckDB's ODBC extension to simultaneously query both source and target databases for comparison. The implementation uses straightforward SQL to compare row counts, distinct primary key counts, and identify duplicates with set variables like SET ss_conn = 'Driver={ODBC Driver 18 for SQL Server};[...] and SET sf_conn = 'Driver=SnowflakeDSIIDriver;Server=[...].

This approach required minimal setup with no additional infrastructure, making it an ideal validation layer that can be further extended through Python scripting and containerization.

DuckDB Can Query Your PostgreSQL. We Built a UI For It.

TL;DR: DataKit integrates DuckDB with PostgreSQL, allowing users to perform OLAP queries on OLTP data through a browser-based UI without needing data replicas.

The integration leverages DuckDB's PostgreSQL extension to create virtual tables on top of PostgreSQL tables. The UI comes with data preview, query, notebook, visualization and an assistant. The architecture enables users to work seamlessly with multiple data sources in a single interface, opening 15GB CSV files in seconds while also connecting to PostgreSQL databases, HuggingFace datasets, and other sources.

How we used DuckDB to save 79% on Snowflake BI spend

TL;DR: A smart caching layer using DuckDB significantly reduced Snowflake BI spend by 79% while improving query performance, shared on Reddit.

The r/dataengineering post details a cost-saving strategy implemented using DuckDB as a caching layer for a Snowflake-based BI setup. The key implementation involves unloading Snowflake tables as Parquet files, which DuckDB then reads. A custom-built proxy routes queries to either DuckDB or Snowflake based on table size, operators, and explain plan heuristics.

The cluster setup with DuckDB nodes (32 CPU, 128GB RAM spot instances) handles a peak of 48K daily queries, with an average query time reduction from 3.7s (Snowflake) to 0.455s (DuckDB). Essentially, building a smart caching layer for Snowflake.

news-nlp-pipeline: A serverless, event-driven data pipeline for real-time news

TL;DR: A fully serverless, event-driven data pipeline for financial news sentiment analysis built entirely on AWS free-tier services and managed with Terraform.

This project claims to provide production-grade architecture for real-time financial news processing, utilizing AWS Lambda functions triggered by EventBridge schedules and S3 events. The pipeline processes the data with VADER sentiment analysis, validates data quality using Pandas, and stores results as Parquet files in S3. It uses DuckDB to query Parquet files directly from S3 and display results through Streamlit.

The entire infrastructure is defined as code with Terraform, including Lambda functions, S3 buckets, IAM roles, and event triggers. This makes it reproducible.

MySQL's New Storage and Execution Engine: DuckDB

TL;DR: Alibaba Cloud RDS has integrated DuckDB as a new storage engine to enhance analytical query performance within MySQL.

The integration uses MySQL's pluggable storage engine architecture to run DuckDB alongside InnoDB, with data automatically replicated and converted via Binlog replication in read-only instances. A key technical challenge was ensuring compatibility between MySQL and DuckDB, which involved extending DuckDB's parser and rewriting many functions to achieve 99% SQL compatibility based on a test suite of 170,000 SQL tests.

news-nlp-pipeline: A serverless, event-driven data pipeline for real-time news

TL;DR: This project demonstrates a cost-efficient, serverless data pipeline for real-time financial news sentiment analysis using AWS services and DuckDB for querying Parquet files directly from S3.

The architecture implements an event-driven pipeline where AWS EventBridge triggers Lambda functions to fetch news data twice daily, process it with VADER sentiment analysis, and store enriched data as Parquet files in S3, leveraging DuckDB to query Parquet files directly from S3. The entire infrastructure is deployed as code using Terraform, including S3 buckets, Lambda functions, IAM roles, and event notifications, making it reproducible and maintainable.

A great showcase of how combining serverless components with DuckDB's ability to query Parquet files can create efficient, low-cost data pipelines.

Announcing pg_duckdb Version 1.0

TL;DR: pg_duckdb 1.0 is now available, embedding DuckDB's vectorized analytical engine directly into PostgreSQL for faster analytical queries without leaving your database environment.

The 1.0 release introduces enhanced MotherDuck integration, expanded support for various data types, improved stability, and enhanced performance, including parallel table scanning. It allows you to join PostgreSQL data with remote data lake files in a single query, making it possible to perform in-database ETL operations that traditionally required external tools. See full release notes.

Why Semantic Layers Matter — and How to Build One with DuckDB

TL;DR: This is my article that explores building a simple semantic layer using DuckDB, Ibis, and YAML to manage and query data consistently across different tools. It answers questions about semantic layers and how to define metrics and dimensions in YAML files, abstracting the physical data layer.

Upcoming Events

AI Native Summit 2025 (by Zetta)

September 10 - Online : 9:00 PM CET

This event brings together AI leaders across research, startups and global companies for a day of discussion about the state of enterprise AI. MotherDuck CEO Jordan Tigani is speaking. We are also sponsoring and have a demo booth.

Modern Data Infra Summit

September 18 - 🇺🇸 San Francisco, CA - 9:30 AM US, Pacific

Ryan Boyd, co-founder at MotherDuck, speaking about DuckLake. We're also sponsoring this event along with Dragonfly, ScyllaDB, GreptimeDB, Redpanda and TiDB.

MotherDuck'ing Big Data London Party

September 24 - Kindred, London - 7:00 PM GMT-1

Join us for for Beers, Bites, & Good Ducking Fun at Big Data London! Following day 1 of BDL, escape from the expo floor and come shake your tail feathers with us. Expect pints aplenty and flocking-fun game

Hey, friend 👋

Featured Community Member