I hope you're doing well and enjoying the summer vibes. I'm Simon, and I am excited to share another monthly newsletter with highlights and the latest updates about DuckDB, delivered straight to your inbox.
In this August issue, I gathered 11 links highlighting updates and news from DuckDB's ecosystem. This month, we've got data analysis showing DuckDB's explosive 50.7% growth in developer interest, practical serverless RAG architectures, and several innovative open-source projects demonstrating DuckDB's versatility from Apple Health analytics to real-time news pipelines.
![]() | Featured Community Member |

hafenkran
hafenkran is a Berlin-based software engineer and the author of the BigQuery community extension for DuckDB—one of the top 5 most downloaded community extensions last week with 21.7k downloads. He’s been actively collaborating to improve performance. Thanks for your contributions to the DuckDB ecosystem! If you ever need to move or query data between BigQuery and DuckDB, give it a try!
![]() | Top DuckDB Links this Month |
Analyzing Database Trends Through 1.8 Million Hacker News Headlines
TL;DR: Miguel analyzed 1.8 million Hacker News headlines spanning 18 years to track database popularity trends, revealing PostgreSQL's dominance, DuckDB's explosive 50.7% year-over-year growth, and declining interest in cloud SaaS database engines.
The analysis examined both raw headline counts and engagement metrics (points/comments) across 13 database engines from 2007 to 2025. DuckDB topped the growth chart with half its lifetime headlines appearing in the last year alone, averaging 22.2 points per story. The data shows open-source engines driving most new discussions, with analytics-focused stores like DuckDB and ClickHouse gaining traction as work shifts from batch to interactive. Miguel noted that proprietary cloud databases (DynamoDB, BigQuery, Redshift) showed declines, with Redshift mentions down 84% from its peak.
Quacking Performance: DuckDB
TL;DR: Stuart provides an introduction to DuckDB's column-based architecture and its performance advantages for analytical workloads.
The article explains how DuckDB's columnar storage excels at analytical queries by only processing relevant columns, reducing memory usage and runtime compared to traditional row-based databases. Stuart uses a warehouse analogy to illustrate the difference: row-based systems store complete orders together (efficient for retrieving full records), while column-based systems organize data by column type (efficient for aggregations across many records).
He recommends using DuckDB as an analytics cache with pre-joined datasets stored in columnar formats like Parquet for faster speed.
Summer Data Engineering Roadmap
TL;DR: I have written a 3-week roadmap for learning data engineering fundamentals, from SQL basics to advanced topics like streaming and DevOps.
The roadmap is structured into three progressive weeks: Week 1 covers foundations (SQL, Git, Linux basics), Week 2 addresses core engineering concepts (Python, cloud platforms, data modeling), and Week 3 explores advanced topics (streaming, data quality, DevOps).
AI Write Perfect SQL
TL;DR: Two complementary approaches enable AI assistants to work effectively with DuckDB: ensuring access to current documentation standards, and creating self-correcting SQL workflows that let AI autonomously debug queries.
Mehdi addresses the core challenge of outdated LLM knowledge about DuckDB (where version differences between 0.10 and current 1.3.2 can break code) by implementing the llms.txt specification at duckdb.org/docs/stable/ and motherduck.com/docs/. This standard provides AI-friendly documentation through /llms.txt (structured index) and /llms-full.txt (comprehensive documentation dump), accessible in Cursor via Settings > Features > Docs with @docs references.
Jacob takes this further by building a self-correcting workflow combining Cursor, MotherDuck, and DuckDB that enables autonomous SQL debugging through safe local replication, schema extraction as XML context, and execution rules (duckdb local.db -f {file}) that create a closed feedback loop. In his spatial analysis case study, the AI independently diagnosed empty results by running diagnostic queries and adjusting parameters.
LLM Takeaway Combining current documentation access with isolated execution environments eliminates traditional AI-SQL debugging cycles, letting AI assistants see and fix their own mistakes while working with the latest DuckDB capabilities.
DuckLake 0.2
TL;DR: DuckLake v0.2 introduces credential management with secrets, enhanced Parquet file settings, and improved file organization capabilities.
Key updates include named/unnamed secrets for simplified connection management, granular Parquet controls (compression, versioning, row group sizing), and three-layer relative paths for prefix-based access control. The release also adds name mapping to integrate existing Parquet files without field identifiers.
MCP server for querying Apple Health data with natural language and SQL
TL;DR: Neil provides a Model Context Protocol (MCP) server that enables natural language and SQL querying of Apple Health data using DuckDB as the underlying engine.
This MCP server implementation allows you to analyze your Apple Health data through direct SQL queries or natural language processing via Claude Desktop or other MCP clients. The server expects CSV files exported from the Simple Health Export CSV iOS app and processes various health metrics, including quantitative measurements, categorical health data, and workout information. The project uses DuckDB for efficient data analysis.
Leveraging Claude Code to Build a dlt & Visivo Project
TL;DR: Jared built a complete Spotify analytics solution for Coldplay's music catalog in under 15 minutes using Claude Code, dlt, DuckDB, and Visivo.
Jared demonstrated how Claude Code orchestrated the entire development process, automatically implementing a dlt pipeline that extracted data from Spotify's API (including artist information, 110 albums, and 405 tracks), storing it efficiently in DuckDB, and generating optimized SQL queries for analysis. Claude handled complex analytical requirements, such as musical evolution trends and album release patterns, through automatically generated SQL. It demonstrates how AI-native development can accelerate BI implementation, especially with declarative tools that provide configs and a fast SQL engine.
Serverless single tenant RAG with DuckDB
TL;DR: Summer introduces DuckRAG, a serverless single-tenant RAG architecture powered by DuckDB for secure, pre-computed vector search without burdening production systems.
DuckRAG creates per-user DuckDB database files (4-6MB each) stored on S3 containing embeddings of all content a user has access to. When users prompt the AI, the system generates embeddings for the prompt and uses DuckDB's VSS extension to find relevant content, followed by vector similarity search. Despite data duplication, the architecture is cost-effective (approximately $0.001/month for their team's usage) while eliminating load on production systems. Future improvements include hot caching for repeated prompts and batched writes.
A fully serverless, event-driven data pipeline that ingests, enriches, validates, and visualizes real-time news data using AWS services
TL;DR: This project demonstrates a cost-efficient, serverless data pipeline for real-time financial news sentiment analysis using AWS services and DuckDB for querying Parquet files directly from S3.
The architecture implements an event-driven pipeline where AWS EventBridge triggers Lambda functions to fetch news data twice daily, process it with VADER sentiment analysis, and store enriched data as Parquet files in S3, leveraging DuckDB to query Parquet files directly from S3. The entire infrastructure is deployed as code using Terraform, including S3 buckets, Lambda functions, IAM roles, and event notifications, making it reproducible and maintainable.
A great showcase of how combining serverless components with DuckDB's ability to query Parquet files can create efficient, low-cost data pipelines.
![]() | Upcoming Events |
MotherDuck x molab Show-and-Tell
August 12 - Online : 9:00 PM CET
Live demo: Connect MotherDuck cloud data to Marimo's reactive notebooks in molab. See how to query and visualize your data seamlessly in an interactive Python environment.
Modern Data Infra Summit
September 18 - 🇺🇸 San Francisco, CA - 9:30 AM US, Pacific
Big Data London
September 24 - Olympia, London - 9:00 AM GMT-1
Don’t miss MotherDuck CEO Jordan Tigani speaking in the Data Engineer Theatre—and swing by booth B27 to say hello! We've got some exciting news we'll be sharing that you won't want to miss!
CONTENT
- Featured Community Member
- Top DuckDB Links this Month
- Upcoming Events
Subscribe to DuckDB Newsletter
