PythonTalk

Taming file zoos: Data science with DuckDB database files

2025/06/02

Problem statement

Data scientists working in Python often spend the majority of their time cleaning input data, frequently from files. These files have many formats, can be located anywhere, and sometimes have names like ‘data_final_final_v3.csv’. Data scientists often produce similar files! We call these “file zoos”.

Taming file zoos with DuckDB

DuckDB fits perfectly with Python

The MIT-licensed DuckDB database management system was designed to fit perfectly into data scientists’ workflows. Install DuckDB’s pre-compiled, dependency-free binary from pip. It can read and write dataframes (Pandas, Polars, and Apache Arrow) for interoperability. It also has an advanced persistent file format.

Read and write files with confidence

DuckDB can read and write to and from csv, parquet, json - even xlsx and Google Sheets. The csv reader in DuckDB is world-class, quickly querying even messy csvs. DuckDB interoperates with object stores across clouds and reads lakehouse formats like Delta and Iceberg.

Organize using the DuckDB format

Use DuckDB’s highly compressed columnar file format to persist many large tables all in the same file. Store processing logic in views and functions and even update just portions of the file. DuckDB serves as a catalog when files should remain in place.

Beyond the format itself, DuckDB provides ACID transactional safety and parallel processing, it can be read in 15+ languages, and is guaranteed to be readable for years to come. It unlocks larger-than-memory analyses to solve 2TB problems, not 16GB ones!

Extensions

Community extensions enable DuckDB to read additional formats and are provided through a pip-like package repository.

Takeaways

Attendees will learn how to install and use DuckDB locally, how to integrate it seamlessly in their existing Python scripts or Jupyter Notebooks, and how to smoothly manage the deluge of files in their workflow.

Related Videos

"Data-based: Going Beyond the Dataframe" video thumbnail

2025-11-20

Data-based: Going Beyond the Dataframe

Learn how to turbocharge your Python data work using DuckDB and MotherDuck with Pandas. We walk through performance comparisons, exploratory data analysis on bigger datasets, and an end-to-end ML feature engineering pipeline.

Webinar

Python

AI, ML and LLMs

"DuckLake: Making BIG DATA feel small (Coalesce 2025)" video thumbnail

2025-10-14

DuckLake: Making BIG DATA feel small (Coalesce 2025)

MotherDuck’s managed DuckLake data lakehouse blends the cost efficiency, scale, and openness of a lakehouse with the speed of a warehouse for truly joyful dbt pipelines. They will show you how!

Talk

"What can Postgres learn from DuckDB? (PGConf.dev 2025)" video thumbnail

20:44

2025-06-13

What can Postgres learn from DuckDB? (PGConf.dev 2025)

DuckDB an open source SQL analytics engine that is quickly growing in popularity. This begs the question: What can Postgres learn from DuckDB?

YouTube

Ecosystem

Talk