PythonTalk

Taming file zoos: Data science with DuckDB database files

2025/06/02

Problem statement

Data scientists working in Python often spend the majority of their time cleaning input data, frequently from files. These files have many formats, can be located anywhere, and sometimes have names like ‘data_final_final_v3.csv’. Data scientists often produce similar files! We call these “file zoos”.

Taming file zoos with DuckDB

DuckDB fits perfectly with Python

The MIT-licensed DuckDB database management system was designed to fit perfectly into data scientists’ workflows. Install DuckDB’s pre-compiled, dependency-free binary from pip. It can read and write dataframes (Pandas, Polars, and Apache Arrow) for interoperability. It also has an advanced persistent file format.

Read and write files with confidence

DuckDB can read and write to and from csv, parquet, json - even xlsx and Google Sheets. The csv reader in DuckDB is world-class, quickly querying even messy csvs. DuckDB interoperates with object stores across clouds and reads lakehouse formats like Delta and Iceberg.

Organize using the DuckDB format

Use DuckDB’s highly compressed columnar file format to persist many large tables all in the same file. Store processing logic in views and functions and even update just portions of the file. DuckDB serves as a catalog when files should remain in place.

Beyond the format itself, DuckDB provides ACID transactional safety and parallel processing, it can be read in 15+ languages, and is guaranteed to be readable for years to come. It unlocks larger-than-memory analyses to solve 2TB problems, not 16GB ones!

Extensions

Community extensions enable DuckDB to read additional formats and are provided through a pip-like package repository.

Takeaways

Attendees will learn how to install and use DuckDB locally, how to integrate it seamlessly in their existing Python scripts or Jupyter Notebooks, and how to smoothly manage the deluge of files in their workflow.

CONTENT
  1. Taming file zoos with DuckDB

Related Videos

"Instant SQL Mode - Real Time Feedback to Make SQL Data Exploration Fly" video thumbnail

2025-04-23

Instant SQL Mode - Real Time Feedback to Make SQL Data Exploration Fly

Hamilton Ulmer shares insights from MotherDuck's Instant SQL Mode, exploring how real-time query result previews eliminate the traditional write-run-debug cycle through client-side parsing and DuckDB-WASM caching.

SQL

Talk

MotherDuck Features

"More Than a Vibe: AI-Driven SQL That Actually Works" video thumbnail

2025-04-22

More Than a Vibe: AI-Driven SQL That Actually Works

Jacob Matson shares insights from AI-powered spatial data analysis, exploring how to "vibe code" with AI-generated SQL using MotherDuck and DuckDB for real-world decision-making scenarios.

Talk

AI, ML and LLMs

"Big Data is Dead: Long Live Hot Data 🔥" video thumbnail

25:18

2024-11-15

Big Data is Dead: Long Live Hot Data 🔥

Over the last decade, Big Data was everywhere. Let's set the record straight on what is and isn't Big Data. We have been consumed by a conversation about data volumes when we should focus more on the immediate task at hand.

Talk

YouTube