This Month in the DuckDB Ecosystem: July 2025

2025/07/08 - 7 min read

BY

Hey, friend 👋

I hope you're doing well. I'm Simon, and I am excited to share another monthly newsletter with highlights and the latest updates about DuckDB, delivered straight to your inbox.

In this July issue, I gathered 9 (+2 DuckLake) links highlighting updates and news from DuckDB's ecosystem. The highlight this time is the seamless Kafka integration with Tributary and YamlQL, which enables querying your YAML files with SQL, making it convenient for long declarative data stacks. Additionally, we explore Foursquare's SQLRooms framework for browser-based data applications and various integrations with PostgreSQL, AWS SageMaker, and other enterprise tools that continue to expand DuckDB's reach across the data ecosystem.

Post Image
Post Image

Rusty Conover

Rusty Conover is an experienced software executive and engineer with a deep background in distributed systems, databases, and real-time data processing. At DuckCon 2025, he presented Airport for DuckDB: Letting DuckDB Take Apache Arrow Flights, exploring how to connect DuckDB to Apache Arrow Flight for high-performance data transfer.

He also recently released Tributary, a DuckDB community extension built at Query.Farm that enables real-time SQL access to Kafka streams—making it possible to query Kafka topics directly in DuckDB without external pipelines.

Rusty is focused on practical solutions that simplify complex systems, and on building strong engineering teams that deliver meaningful tools for developers.

Post Image

YamlQL: Query your YAML files with SQL and Natural Language

TL;DR: YamlQL is a new tool that transforms YAML files into queryable relational databases using DuckDB, allowing users to run SQL queries against complex YAML structures.

YamlQL converts YAML structures into relational schemas by flattening nested objects with underscore separators, transforming lists of objects into multi-row tables, and extracting nested lists into separate tables with appropriate JOIN capabilities. The tool features both a CLI and Python library interface, with commands for SQL querying (yamlql sql), schema discovery (yamlql discover), and natural language querying through various LLM providers, all without sending your data externally. Practically: Works well with complex configuration files such as Kubernetes manifests applications, where traditional tools like jq/yq fall short for relational queries.

Kafka: Tributary DuckDB Extension

TL;DR: The Tributary DuckDB extension provides Apache Kafka integration, enabling real-time data streaming and querying directly within DuckDB's SQL interface.

The Tributary extension, developed by Query.Farm, introduces native Kafka topic scanning capabilities through SQL functions like tributary_scan_topic(), which allows developers to consume messages from Kafka topics with minimal configuration. It supports a set of Kafka connection parameters and enables multi-threaded consumption of topics across partitions. It acts as a "bridge between the stream of data and the data lake".

Foursquare Introduces SQLRooms

TL;DR: Foursquare has released SQLRooms, an open-source React framework for building single-node data applications powered by DuckDB, which run entirely in browsers or on laptops without requiring backend infrastructure.

SQLRooms combines five core components: RoomShell (UI container), RoomStore (state management), an embedded DuckDB instance, an AI-powered analytics assistant, and a reusable component library.

The framework automatically handles DuckDB operations, including format recognition (CSV, Parquet, JSON, Arrow), schema inference, and table registration for immediate querying. It leverages recent advances in browser capabilities (PWAs, WebAssembly, OPFS) and local AI deployment, enabling data applications that process multi-gigabyte datasets in sub-seconds while maintaining data privacy. Find its code and dedicated website.

Quacks & Stacks: DuckLake's One‑Table Wonder vs Iceberg's Manifest Maze

TL;DR: DuckLake introduces a simplified metadata management approach for data lakes by centralizing metadata tracking in SQL tables, contrasting with Apache Iceberg's distributed file-based approach.

Thomas demonstrates how DuckLake reimagines table metadata management by storing all tracking information directly in SQL tables, utilizing functions such as ducklake_snapshots() and ducklake_table_info() to provide transparent metadata access. Unlike Iceberg's complex manifest hierarchy (involving JSON → manifest lists → manifests → data files), DuckLake uses a single-transaction model for updates: UPDATE lake.sales_data SET amount = amount * 1.15 WHERE region = 'North'.

More about DuckLake: 

📺 Understanding DuckLake: A Table Format with a Modern Architecture 

📰 MotherDuck Managed DuckLakes Now in Preview: Scale to Petabytes 

📝 Digging into Ducklake

DuckDB Wizard: A DuckDB extension that executes JS and returns a table

TL;DR: Nico's Wizard extension for DuckDB enables natural language queries and direct JavaScript execution within SQL via an embedded V8 interpreter.

The Wizard extension leverages LLMs (OpenAI/Anthropic) to translate natural language into JavaScript code that executes in a sandboxed Deno environment, returning results as DuckDB tables. Users can either use the wizard() function for natural language queries like SELECT * FROM wizard('bitcoin price') or execute arbitrary JavaScript directly with js(). Nico emphasizes that this is highly experimental and not for production use. If you need production-ready, check out MotherDucks’s PROMPT() function.

How to Enable DuckDB/Smallpond to Use High-Performance DeepSeek 3FS

TL;DR: The Open3FS community has developed a DuckDB-3FS plugin enabling DuckDB and Smallpond to access DeepSeek's 3FS storage using its high-performance user-space interface (hf3fs_usrbio).

The plugin supports two path formats ( 3fs://3fs/path and /3fs/path) and requires minimal configuration. DeepSeek reported that with 3FS and Smallpond, 50 compute nodes sorted 110.5 TiB of data in just over 30 minutes (3.66 TiB/minute throughput). The implementation is available in two open-source repositories: duckdb-3fs and smallpond-3fs, allowing the DuckDB ecosystem to leverage 3FS storage performance fully.

Using Amazon SageMaker Lakehouse with DuckDB

TL;DR: Tobias demonstrates how to integrate Amazon SageMaker Lakehouse with DuckDB using AWS Glue Iceberg REST endpoints to query S3 Tables.

In this technical walkthrough, we learn how to connect DuckDB to AWS SageMaker Lakehouse, starting with the necessary IAM setup. Once the AWS infrastructure is configured, the DuckDB integration is straightforward, requiring only two key commands: CREATE SECRET with STS assume role configuration and ATTACH with ICEBERG type and GLUE endpoint parameters. After this setup, users can run standard SQL queries directly against the data lake. The resulting DuckDB integration provides a lightweight, SQL-based access layer to data stored in S3 Tables.

PostgreSQL and Ducks: The Perfect Analytical Pairing

TL;DR: This article explores three methods for integrating PostgreSQL with DuckDB/MotherDuck for analytical workloads: DuckDB Postgres Extension, pg_duckdb, and Supabase's ETL (CDC).

The DuckDB Postgres Extension offers the most straightforward approach, requiring minimal setup with commands like INSTALL postgres; LOAD postgres; ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE postgres, READ_ONLY); to query PostgreSQL data remotely. The pg_duckdb extension embeds DuckDB directly within PostgreSQL, delivering impressive performance gains (up to 1,500x speedup on one TPC-DS query, according to Jacob and Aditya), but requires careful resource management, ideally on a dedicated read replica. And finally, Supabase's ETL provides near real-time data synchronization through PostgreSQL's logical decoding capabilities.

Announcing DuckDB 1.3.0

TL;DR: DuckDB 1.3.0 "Ossivalis" introduces a file cache for remote data, a new spatial join operator, and improved Parquet handling alongside several breaking changes.

Besides the major DuckLake announcement, we also got the latest release 1.3.0 (and bug-fixes 1.3.1). The 1.3 release introduces performance improvements through an external file cache that dynamically stores data from remote files, resulting in reduced query times on subsequent runs (e.g., S3 queries experience a 4x speedup).

New features include Python-style lambda syntax (lambda x: x + 1), the TRY expression for error handling (TRY(log(0)) returns NULL instead of erroring), UUID v7 support, and a specialized spatial join operator that's up to 100x faster than previous implementations. Internal improvements include a complete rewrite of the Parquet reader/writer and a new string compression method (DICT_FSST).

Post Image

Upcoming Events

Small Data SF: Workshop Day!

San Francisco, CA, USA - 12:00 PM America, Los Angeles - In Person

Make your big data feel small, and your small data feel valuable. Join leading data and AI innovators on November 4th and 5th in San Francisco!

Small Data SF: Keynotes and Sessions

San Francisco, CA, USA - 8:30 AM America, Los Angeles - In Person

Make your big data feel small, and your small data feel valuable. Join leading data and AI innovators on November 4th and 5th in San Francisco!

CONTENT
  1. Hey, friend 👋
  2. Upcoming Events

Start using MotherDuck now!

blog subscription icon

Subscribe to motherduck blog

PREVIOUS POSTS

MotherDuck Managed DuckLakes Now in Preview: Scale to Petabytes

2025/07/01 - Ryan Boyd

MotherDuck Managed DuckLakes Now in Preview: Scale to Petabytes

Preview support of MotherDuck includes both fully-managed DuckLake support and ability to bring your own bucket. Combined with MotherDuck's storage, you get both high-speed access to recent data and support for massive scale historical data.

The Data Engineer Toolkit: Infrastructure, DevOps, and Beyond

2025/07/03 - Simon Späti

The Data Engineer Toolkit: Infrastructure, DevOps, and Beyond

A comprehensive guide to advanced data engineering tools covering everything from SQL engines and orchestration platforms to DevOps, data quality, AI workflows, and the soft skills needed to build production-grade data platforms.