Hey, friend 👋
Hello. I'm Simon, and I am excited to share another monthly newsletter with highlights and the latest updates about DuckDB delivered straight to your inbox. But first, I wish you a happy new year and the best start to 2025.
In this January issue, I gathered ten exciting links, ranging from PyIceberg and SQLite Catalog to 0$ data distribution and using AWS Lambda+DuckDB as a simplified pipeline. We also examine Arrow Flight and gRPC as a middle layer in front of DuckDB, LLM-driven dbt models, and much more. Please enjoy.If you have feedback, news, or any insights, they are always welcome. 👉🏻 duckdbnews@motherduck.com.
Featured Community Member |
Julien Hurault
Julien, based in Geneva, is a experienced data engineering consultant specializing in the development of modern data platforms for organizations aiming to become AI-ready. He is no stranger to this newsletter, as we have previously featured several insightful DuckDB posts from his blog. Notably, one of his articles has also been included in this edition. A big thank you to Julien for consistently contributing great technical content to the community!
Top DuckDB Links this Month |
PyIceberg: Trying out the SQLite Catalog
Tyler showcases a local catalog, loads the Star Wars dataset, creates an Iceberg table, populates it, and then queries it with the PyIceberg API. He uses both Ibis and PyIceberg. Nifty features include table operations like deleting rows and exploring snapshots using PyIceberg's API.
0$ Data Distribution
In this article, Julien explores the "0$ Data Distribution" using Apache Iceberg and DuckDB, leveraging Cloudflare R2 buckets as they don’t charge for egress (data going out, meaning access by users). He demonstrates how, once uploaded to R2, you can freely read the data with ATTACH 'https://catalog.boringdata.io/catalog' as boringdata;
(if you want to do the same with Bluesky data, check: How to Extract Analytics from Bluesky). Julien discusses the potential applications of this approach, such as direct data integration from services like Stripe, LinkedIn, and Notion, using a single command. With the key innovation where data providers pay for storage, and consumers pay only for compute.
Learning SQLFlow Using the Bluesky Firehose
SQLFlow is a new stream processing engine powered by DuckDB. SQLFlow brings DuckDB to streaming data using a lightweight Python-powered service. SQLFlow executes SQL against streaming data, such as Kafka or webhooks. Think of SQLFlow as a way to run SQL against a continuous data stream. The data outputs can be shipped to sinks, such as Kafka. The article shows examples such as directly streaming data from Bluesky Firehose to Kafka, transforming streams, and writing to stdout. A key feature, SQLFlow, supports rolling window aggregations, which can reduce thousands of events into summarized time-based buckets (e.g., 5-minute windows), making it efficient for processing high-volume data streams.
AWS Lambda + DuckDB (and Delta Lake)
Daniel checks out DuckDB once more, this time with Lambda functions, and asks, "Is it the Ultimate Data Pipeline?". He uses CSV files from S3 into Delta Lake with minimal infrastructure complexity thanks to DuckDB. He sets up a Docker image, an AWS ECR repository, configures a Lambda function, and demonstrates how data can be processed in real-time when files are uploaded to an S3 bucket. The example uses hard drive test data from Backblaze to showcase the pipeline's capabilities. All code is available on GitHub.
Databases in 2024: A Year in Review
Andy reviews the whole year with all the various databases in mind. In his discussion of DuckDB, he notes that according to the Fivetran article, the median amount of data scanned by queries is only 100 MB—a volume that a single DuckDB instance can easily handle. Beyond this, Andy goes into the Redis and Elasticsearch license changes, examines the ongoing rivalry between Snowflake and Databricks, and shares fascinating backstories about Oracle's legendary creator, Larry Ellison.
Unlocking DuckDB from Anywhere: A Guide to Remote Access with Apache Arrow and Flight RPC (gRPC)
Mike demonstrates remote access to DuckDB using Apache Arrow and Flight RPC (built on top of gRPC) and sharing it as a web app with Streamlit. The flight protocol acts as an intermediate layer between different clients and the DuckDB Server instead of directly accessing DuckDB. The code is shared on a git repo.
Should You Ditch Spark for DuckDB or Polars?
Miles investigates single-machine compute engines like DuckDB and Polars and compares them to Spark. He wants to determine which single compute engine is better based on his benchmark (testing at both 10GB and 100GB scales). His research reveals that Spark remains competitive, especially on larger scales. He tests beyond just performance, evaluating development cost, engine maturity, and compatibility. The takeaway seems not to abandon Spark completely but to strategically integrate these engines based on specific use cases. Polars and DuckDB for interactive queries, embedded database operation, and other specialized capabilities.
LLM-driven data pipelines with prompt() in MotherDuck and dbt
The new prompt() function enables the transformation of unstructured data sitting in a data warehouse into structured data that can be easily analyzed. It applies LLM-based operations to each row in a dataset while automatically handling parallel model requests, batching, and data type conversions in the background. Adithya demonstrates this capability by transforming single customer product reviews into multiple extracted attributes using dbt and MotherDuck. This approach is particularly valuable for processing thousands of free text reviews with varying attributes—a task that would be difficult to automate without LLMs.
DuckDB Node Neo Client
The new DuckDB Node client, Neo, provides a powerful and friendly way to use your favorite database. It is an API for using DuckDB in Node.js. Replaces the old callback-based Node.js API, offering native TypeScript support and intuitive methods for data handling. It allows developers to access column names and types easily and read data in column-major and row-major formats, making it more developer-friendly than its predecessor. While currently in alpha status, Neo's roadmap includes completing several features for the upcoming DuckDB 1.2 release.
owl: Web-based SQL query editor
A simple, open-source, web-based SQL query editor for your files, databases (e.g. Postgres & DuckDB), and cloud storage data.
Upcoming Events |
Webinar | Shifting Left and Moving Forward with MotherDuck and Dagster
14 January, Online - 9 AM PT
Explore how MotherDuck and Dagster streamline data workflows, empower teams, and enable seamless transitions from local development to cloud analytics. Perfect for optimizing your processes and accelerating insights.
Compete for a $10,000 prize pool with the Airbyte + MotherDuck Hackathon!
21 January, Online
Webinar | Getting Started with MotherDuck
23 January, Online - 9AM PT
Looking to get started with MotherDuck and DuckDB? Join us for a live session to learn how MotherDuck makes analytics fun, frictionless, and ducking awesome!
Supercharge DuckDB with MotherDuck: Scale, Share, and Simplify Analytics
31 January, Amsterdam NL - 9 AM CET
Level up your DuckDB experience with a MotherDuck Workshop.
DuckCon #6: Amsterdam
31 January, Amsterdam NL - 3 PM CET
DuckCon #6, DuckDB's next user group meeting in Amsterdam, the Netherlands. The event will be in person + streamed online on the DuckDB YouTube channel. Talks will be announced in late October / early November.
Post-DuckCon Drinks: Quack & Cheers
31 January, Amsterdam NL - 7:30 PM CET
Join us for a relaxed and casual gathering with the data community, just a 10-minute walk from DuckCon!
CONTENT
- Hey, friend 👋
- Featured Community Member
- Top DuckDB Links this Month
- Upcoming Events
Subscribe to DuckDB Newsletter
PREVIOUS POSTS
2024/12/20 - Mehdi Ouazza
We made a fake duck game: compete to win!
Spot the fake (AI generated) duck to win!
2024/12/21 - Sheila Sitaram
What’s New: Streamlined User Management, Metadata, and UI Enhancements
December’s feature roundup is focused on improving the user experience on multiple fronts. Introducing the User Management REST API, the Table Summary, and a read-only MD_INFORMATION_SCHEMA for metadata.