DuckDB Monthly #41: DuckDB internals course, FTS walkthrough, and a satellite pipeline with H3 + Parquet

- 7 min read

BY

HEY, FRIEND 👋

I hope you're doing well. I'm Simon, and I am happy to share another monthly newsletter with highlights and the latest updates about DuckDB, delivered straight to your inbox.

In this May issue, I gathered the usual 10 updates and news highlights from DuckDB's ecosystem. This month leans toward the depth that makes DuckDB itself interesting: Torsten's 15-week University of Tübingen course on DuckDB internals, Pete's hands-on full-text search walkthrough on a multi-GB email corpus, and Mark's satellite-tracking pipeline that turns the GCAT catalog into H3-cell heatmaps and ZSTD Parquet. You'll also find Adam's duck_lineage extension for automatic column-level data lineage, a story-driven SQL learning game, and notable releases including DuckDB 1.5.0 "Variegata" and DuckLake v1.0.

📍 PS If you're in SF June 1-3, swing by The Dive, MotherDuck's home base during Snowflake Summit, with talks & panels from folks at Anthropic, Lovable, Notion, a16z and more. No Summit badge needed. Register here.

If you have feedback, news, or any insights, they are always welcome. 👉🏻 duckdbnews@motherduck.com.

Post asset
Hoyt Emerson

Hoyt Emerson

Data engineer & DuckDB educator

Hoyt Emerson has become one of the most consistent and approachable voices teaching DuckDB on the open web. Through his YouTube channel and Substack, Hoyt publishes hands-on walkthroughs that turn DuckDB features into practical workflows, from local-first analytics patterns to thoughtful deep dives like his recent piece on where DuckDB fits in the modern data stack.

What stands out is the educator's instinct: short, focused content that meets people exactly where they are. Catch Hoyt live on June 3 in San Francisco at the MotherDuck + DuckDB June Meetup, where he'll be talking about Agents & ETL with DuckLake.

Connect with Hoyt on LinkedIn.

Post asset

Automatic Column-Level Data Lineage for DuckDB

TL;DR: Adam released duck_lineage, an open-source DuckDB extension that provides automatic column-level data lineage by intercepting the logical plan pre-optimization.

It integrates with OpenLineage (by generating OL events), an open framework for data lineage collection and analysis. Just load it with LOAD duck_lineage; SET duck_lineage_url = 'http://localhost:5000..'; and you get a local web interface with ilum showing column-level lineage. But duck_lineage is fully open source and works with any OpenLineage-compatible backend. It's a great implementation and addition to DuckDB for improving data quality issues, something everyone deals with.

For orchestration, it links DuckDB runs to parent pipelines (e.g., Airflow, Dagster) by reading environment variables like OPENLINEAGE_PARENT_RUN_ID. The code is at GitHub, and you can read a more technical deep dive in addition to the above LinkedIn overview post.

Full-Text Search with DuckDB

TL;DR: DuckDB's Full-Text Search (FTS) extension offers a powerful and easily deployable solution for initial text data exploration and analysis.

Pete showcases pre-processing .eml files into JSON with Python before ingesting them via read_json('*.eml.json') for rapid indexing and querying of a multi-GB email corpus. He shows the simple installation and PRAGMA create_fts_index('table', 'id', 'column1', 'column2') for indexing multiple columns with configurable stemming, stop words, and accent stripping. Queries allow fine-tuning with Okapi BM25 parameters for exact phrase matching.

SQL Protocol: The SQL Game That Teaches Real Queries

TL;DR: SQL Protocol offers a free, browser-based game enabling users to write and execute SQL queries through interview drills and 1v1 PvP.

SQL Protocol teaches SQL through story-driven missions where you play as the character and need to explore the world with arrow keys and solve the quests by hitting space. The quests are SQL quizzes. Every solved quest makes you level up, a really fun way to learn.

Announcing DuckDB 1.5.0

TL;DR: DuckDB 1.5.0 "Variegata" introduces technical improvements, including a revamped CLI, native semi-structured and geospatial data types, and performance gains.

The friendly CLI features are more ergonomic and support dynamic prompts with database.schema D, the .tables dot command, and result paging, alongside an experimental PEG parser. A native VARIANT type now stores typed, binary semi-structured data with functions like variant_typeof() and variant_extract(), offering better compression and query performance over JSON.

Other notable additions include COPY support for Azure writes (az://...), an ODBC scanner, and a configurable geometry_always_xy setting to manage a gradual breaking change in spatial axis order.

📝 Also see the two newer minor releases:

  • Announcing DuckDB 1.5.1: A patch release with bugfixes, performance improvements and support for the Lance lakehouse format.
  • Announcing DuckDB 1.5.2: A patch release with bugfixes and performance improvements, and support for the DuckLake v1.0 lakehouse format.

10K+ Satellites in Space

TL;DR: Mark details a data pipeline for converting the General Catalog of Artificial Space Objects (GCAT) TSV datasets into optimized Parquet files using DuckDB for comprehensive spatial and attribute analysis.

Mark ingests several GCAT TSV files including organizations, launch platforms, launch sites, launch vehicles, and satellites into DuckDB along with H3, JSON, Lindel, Parquet, and Spatial extensions. He uses robust data cleaning and type casting during the COPY process, exporting to ZSTD-compressed Parquet.

With H3 he's generating heatmaps of organization and launch site locations by converting latitude/longitude to H3 cells using H3_LATLNG_TO_CELL and then to WKT boundaries via H3_CELL_TO_BOUNDARY_WKT.

Design and Implementation of DuckDB Internals

TL;DR: Torsten's "Design and Implementation of DuckDB Internals" (DiDi) course provides an in-depth exploration of core engineering principles underpinning DuckDB's analytical capabilities.

Torsten's 15-week course, developed at the University of Tübingen, systematically unpacks the internal components of DuckDB and the advanced techniques that enable its high-performance analytical query processing. It covers efficient memory management, sophisticated grouped aggregation, and optimized strategies for sorting large tables.

You'll find the course slides and code example materials on GitHub.

DuckDB uses RDBMS to tackle lakehouse 'small changes' issue

TL;DR: DuckDB Labs has introduced the DuckLake v1.0 format to address the inefficiency of handling small database changes in lakehouse architectures.

DuckLake v1.0 leverages an RDBMS to manage metadata for lakehouse implementations, such as those using Apache Iceberg and Delta Lake formats. The new approach batches small changes through the metadata database, such as PostgreSQL or DuckDB, instead of writing new files to the object store.

Why I'm replacing Polars with DuckDB

TL;DR: Daniel is replacing Polars with DuckDB in his AWS Lambda data processing workflows due to recurring production stability issues and concerns over Polars' maintainer support and developer experience.

Daniel is a heavy Polars user in AWS Lambdas for S3-based data ingestion, transformation, and more. However, he encountered constant challenges, including dismissed memory issues and unexpected breaking changes when upgrading to polars==1.31.0 within a public.ecr.aws/lambda/python:3.13 environment, leading to Lambda failures. This is a paid post, but you can read the first part already.

Rethinking the Semantic Layer: AI Query Discovery vs. Manual Data Modeling

TL;DR: Jacob proposes rethinking the semantic layer from a static definition problem to a dynamic search problem using AI to discover business logic from query history.

The system mines query logs to learn from how data is actually queried, instead of relying on manually configured metric definitions.

Jacob compares the semantic layer approach with an LLM approach, illustrating when an LLM is enough and what the semantic layer is used for.

Internal vs. External Storage: What's the Limit of External Tables?

TL;DR: External tables offer significant cost benefits for archival data storage but involve a performance tradeoff.

This article was written by me, but as external tables continue to be re-added to new platforms, I decided to include it. External tables act as pointers to data files, allowing SQL querying without moving data. It explores their history from Oracle's 2001 version to modern implementations like Cloud versions, dbt, or DuckLake.

External tables can drastically lower storage costs by utilizing cheap object storage, although you pay for it in performance. Further, I notice that modern external tables aren't that external anymore, and that they are increasingly managed.

Post asset

Upcoming Events

MotherDuck + DuckDB May Meetup

2026-05-21. h: 18:00. San Francisco, CA, USA

DuckDB and MotherDuck users, we're back for May. Come hang, hear what people are building, and grab a drink. Co-hosted with Greybeam. Talks, food, drinks, and merch!

The Dive — MotherDuck at Snowflake Summit 2026

2026-06-01 to 2026-06-03. San Francisco, CA, USA

MotherDuck's home base during Snowflake Summit. Free coffee, workshops, panels and a party with talks from folks at Anthropic, Lovable, Notion, a16z, Datadog, Hex and more. No Summit badge needed.

MotherDuck + DuckDB June Meetup (with Hoyt Emerson)

2026-06-03. h: 17:30. San Francisco, CA, USA

Special edition meetup at The Dive! Featuring Hoyt Emerson on Agents and ETL with DuckLake, and Alex Monahan on contributing C++ to DuckDB with AI.

Subscribe to motherduck blog

PREVIOUS POSTS

Vibe Coding has come for BI

2026/04/30 - Jacob Matson

Vibe Coding has come for BI

The agent writes the SQL and the data viz now. Here's what's still your job — and how to implement data-layer changes that took a very hard natural language to SQL benchmark from 30% to 93% accuracy with Gemini Flash.