Final days: Grab your Small Data SF Ticket for workshops and technical talks on 9/23 + 9/24!small data, big fomo 🚀

This Month in the DuckDB Ecosystem: August 2024

2024/08/01

BY

Subscribe to the newsletter

Hey, friend 👋

It's Mehdi for this edition. And yes, if I'm not behind the camera, I'm behind the keyboard. This month is full of pragmatic projects from the community and interesting blogs from both DuckDB themselves and MotherDuck. It's great to see the community starting to build more complex end-to-end solutions! Note that the StackOverflow Survey 2024 is out and DuckDB usage has grown from 0.6% to 1.4%, ranking it at #3 of the most desired databases to use!

MotherDuck, Cloudflare and Turso announced also announced Small Data SF, an IRL gathering in San Francisco for data people and developers to learn together and celebrate the simple joys of local development and building with small data and AI. DuckDB Newsletter readers get $100 off tickets with code ‘DuckDB100’. With only 250 tickets total and speakers like Chris Laffra (PySheets) and Wes McKinney (Posit, Pandas), once they’re gone, they’re gone.

Finally, the book DuckDB in Action is officially out 🎉, you can get a free sample here.

If you have feedback, news, or any insight, they are always welcome. 👉🏻 duckdbnews@motherduck.com.

Post Image

Mark Needham, Michael Hunger, and Michael Simons

Can you guess what these three folks have in common? Well, I just spoiled the answer above—they all contributed to the DuckDB in Action book! Mark Needham is not only a skilled blogger and video creator at @LearnDataWithMark, but he's also an avid educator in the data community. Michael Hunger has been pioneering product innovation at Neo4j, a leader in graph databases. Lastly, Michael Simons, a Java Champion and Engineer at Neo4j, adds his profound expertise to the mix.

Together, they've contributed to the very first DuckDB book. This is a tremendous effort, especially considering that DuckDB 1.0 was just released—adapting and ensuring everything is current was no small feat. Congratulations to them!

Thank you, Mark, Michael, and Michael, for your significant contributions and for pushing the boundaries of data technology.

Food Transparency in the Palm of Your Hand: Explore the Largest Open Food Database using DuckDB

In this blog, Jeremy tackles a medium-sized dataset (10 - 43 GB) of compressed JSON with ease using DuckDB. He showcases how effectively DuckDB can parse JSON files. The blog provides clear, step-by-step code samples and introduces an interesting dataset about food!

Memory Management in DuckDB

Memory management might seem boring in the sense that if it works, it "just works." However, it is a critical component for a high-performance analytics engine. In this blog, Mark, co-creator of DuckDB, dives into three main behind-the-scenes features that make DuckDB great: streaming execution, intermediate spilling, and the buffer manager. If you are curious about how DuckDB can process files larger than memory, or if you want to learn more about tuning and profiling memory usage, this is a must-read!

Build a Dashboard to Monitor Your Python Package Usage with DuckDB & MotherDuck

This is the last part of a series on an end-to-end data engineering project using DuckDB. I started this series a couple of months ago, and in this blog, we explore how to build a dashboard using Evidence and MotherDuck.

The project is live at duckdbstats.com, and you can find the full source code on GitHub. There's also a video tutorial you can watch here.

A Hybrid Information Retriever with DuckDB

Search is a very hot topic around vector databases and AI, but DuckDB doesn't have to shy away from them, as several features enable it to offer search functionality with embeddings. Francois Pacull explores the implementation of search functions in Python with DuckDB, open-source embedding models, and uses it on a DBpedia text dataset. For those new to these concepts, he also provides a gentle introduction to hybrid search, lexical search, and fused score.

Crunchy Bridge Adds Iceberg to Postgres & Powerful Analytics Features

Crunchy Data (one Postgres for Cloud) is extending Postgres features with DuckDB functionality. This makes sense as the Postgres extension is quite powerful for querying tables directly from Postgres, but what if you could directly use the power of DuckDB without leaving Postgres?

Note: They are not the only ones working on this; watch out for other Postgres Cloud providers 👀.

DuckDB Community Extensions

We shared this during our last newsletter, but there was an official announcement from DuckDB regarding Community Extensions. There's now also a website to highlight these. If you want to add your extension there, head over to the community extension repository and open a PR!

Querying Datasets with the Datasets Explorer Chrome Extension

DuckDB Wasm is great because it enables you to run DuckDB directly in a browser! This opens up interesting use cases for browser extensions, like creating a Firefox extension to display Parquet's metadata or, in this blog, exploring HuggingFace datasets. Caleb Fahlgren walks us through various creative case studies using the spatial extension of DuckDB and some HuggingFace datasets. It's great to see how we can enhance our querying capabilities in our browser, directly on the client, with just an extension!

Data Stack in a Box — New South Wales Department of Education

Data Stack in a Box is not a new concept. As the landscape of data tools becomes complicated, data professionals are looking for ways to consolidate things. David walks us through another pragmatic end-to-end case study using DuckDB, and you can play with your own data stack in a box with just a click on GitHub Codespace.

Using DuckDB+dbt, FastAPI for Real-Time Analytics

This is an interesting setup if you need to provide an external interface for common pipelines. The idea here is to put DuckDB + dbt in front of an API using FastAPI. I have already seen such a setup when providing "pipelines as a service" to software engineers where the only thing they would need to do is make an API call. Or, if you have a front-end with lightweight transformations that you want to run, everything can operate here within a Python process with DuckDB!

Delta Lake Meets DuckDB via Delta Kernel

During the DATA+AI Summit 2024 by Databricks, a major announcement was the support of Delta Lake in DuckDB through an extension. The talk is now online and dives into how this extension works. I also delved into that topic during a livestream of Quack&Code with Holly from Databricks, where we discussed table formats, how Delta works generally, and especially with DuckDB.

Upcoming Events

Data Discoverability with Secoda and MotherDuck

31 July

Join Secoda and MotherDuck for a masterclass in using dbt, MotherDuck, and Secoda to enable data producers and consumers, regardless of technical ability, to easily locate and access the data they need.

Location: Online 🌐 - 7:00 PM Central European Summer Time

Type: Online

MotherDuck/DuckDB Meetup: NYC Edition

7 August, New York, NY, USA

We are pleased to announce our next in-person user group meetup in NYC to talk about MotherDuck, DuckDB, and all things data and analytics featuring talks from Nick Ursa, Matt Forrest, and Joseph Machado!

Location: New York, NY 🗽 - 5:00 PM America/New_York

Type: In Person

DuckCon #5 in Seattle

15 August, Seattle, WA, USA

Join us for DuckCon #5, the DuckDB user group meeting, at the SIFF Cinema Egyptian.

Location: Seattle, WA, USA 🇺🇸 - 1:30 PM US/Pacific

Type: In Person

Small Data SF

24 September, San Francisco, CA, USA

Small data and AI is more powerful than you think. Data and AI that was once "Big" can now be handled by a single machine. Join MotherDuck, Ollama, Turso, and Cloudfare in San Francisco.

Location: San Francisco, CA 🌁 - 8:00 AM America/Los_Angeles

Type: In Person

CONTENT
  1. Hey, friend 👋
  2. Upcoming Events

Subscribe to the newsletter