PythonTalk

Taming file zoos: Data science with DuckDB database files

2025/06/02

Problem statement

Data scientists working in Python often spend the majority of their time cleaning input data, frequently from files. These files have many formats, can be located anywhere, and sometimes have names like ‘data_final_final_v3.csv’. Data scientists often produce similar files! We call these “file zoos”.

Taming file zoos with DuckDB

DuckDB fits perfectly with Python

The MIT-licensed DuckDB database management system was designed to fit perfectly into data scientists’ workflows. Install DuckDB’s pre-compiled, dependency-free binary from pip. It can read and write dataframes (Pandas, Polars, and Apache Arrow) for interoperability. It also has an advanced persistent file format.

Read and write files with confidence

DuckDB can read and write to and from csv, parquet, json - even xlsx and Google Sheets. The csv reader in DuckDB is world-class, quickly querying even messy csvs. DuckDB interoperates with object stores across clouds and reads lakehouse formats like Delta and Iceberg.

Organize using the DuckDB format

Use DuckDB’s highly compressed columnar file format to persist many large tables all in the same file. Store processing logic in views and functions and even update just portions of the file. DuckDB serves as a catalog when files should remain in place.

Beyond the format itself, DuckDB provides ACID transactional safety and parallel processing, it can be read in 15+ languages, and is guaranteed to be readable for years to come. It unlocks larger-than-memory analyses to solve 2TB problems, not 16GB ones!

Extensions

Community extensions enable DuckDB to read additional formats and are provided through a pip-like package repository.

Takeaways

Attendees will learn how to install and use DuckDB locally, how to integrate it seamlessly in their existing Python scripts or Jupyter Notebooks, and how to smoothly manage the deluge of files in their workflow.

Related Videos

"Data-based: Going Beyond the Dataframe" video thumbnail

2025-11-20

Data-based: Going Beyond the Dataframe

Learn how to turbocharge your Python data work using DuckDB and MotherDuck with Pandas. We walk through performance comparisons, exploratory data analysis on bigger datasets, and an end-to-end ML feature engineering pipeline.

Webinar

Python

AI, ML and LLMs

"LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics" video thumbnail

2025-11-19

LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics

LLMs excel at natural language understanding but struggle with factual accuracy when aggregating business data. Ryan Boyd explores the architectural patterns needed to make LLMs work effectively alongside analytics databases.

AI, ML and LLMs

MotherDuck Features

SQL

Talk

Python

BI & Visualization

"The Unbearable Bigness of Small Data" video thumbnail

2025-11-05

The Unbearable Bigness of Small Data

MotherDuck CEO Jordan Tigani shares why we built our data warehouse for small data first, not big data. Learn about designing for the bottom left quadrant, hypertenancy, and why scale doesn't define importance.

Talk

MotherDuck Features

Ecosystem

SQL

BI & Visualization

AI, ML and LLMs