Integrating with the Python Ecosystem

This is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.

DuckDB offers a sophisticated Python integration that goes far beyond the standard Python Database API (DB-API 2.0). While it fully supports the traditional cursor-based approach used by libraries like sqlite3, the real power lies in DuckDB's Relational API. This API blurs the line between the database engine and the Python runtime, allowing for high-performance analytics without the overhead of constantly moving data between processes.

6.1 Getting started

Getting up and running with DuckDB is intentionally minimal, designed to allow data analysts to integrate the embedded database into their local environment or CLI-based pipelines in seconds.

Note: While the chapter focuses on the native DuckDB API, the book briefly mentions that enterprise users can also use SQLAlchemy, the popular Python SQL toolkit, via the duckdb_engine driver.

6.1.1 Installing the Python package

The package is published to PyPi and requires no external server configuration. You can install it with a single command:

Copy code
pip install duckdb

Once installed, you can verify the setup by importing the library and checking duckdb.__version__.

6.1.2 Opening up a database connection

Unlike client-server databases (like PostgreSQL or MySQL), DuckDB runs embedded within your Python process. This means there is no network overhead. The API provides two primary ways to connect:

Implicit In-Memory Connection: By simply running commands on the duckdb module (e.g., duckdb.sql()), you interact with a default, global in-memory database.
Explicit Connections: For better control, you can create connection objects using duckdb.connect(). This allows you to connect to:
- In-memory databases: Using :memory:.
- Persistent files: By providing a file path (e.g., 'my-db.duckdb').
- Read-only modes: To safely access production data files.

6.2 Using the relational API

The centerpiece of DuckDB’s Python integration is the DuckDBPyRelation object. This object represents a symbolic query or a relation—essentially a generalized version of a table composed of tuples and attributes.

The API follows the "Duck Test" philosophy: If it looks like a table, you can query it. This allows DuckDB to execute SQL queries directly against diverse Python objects. The relational API is lazy; defining a relation does not execute the query. Execution is only triggered when results are requested via methods like show(), fetchall(), or print().

6.2.1 Ingesting CSV data with the Python API

While you can use SQL COPY commands, the Python API offers a native read_csv function that integrates seamlessly with Python workflows. When you call read_csv, it returns a relational object immediately without loading all data into memory. The book demonstrates this by ingesting a remote CSV file directly via HTTP:

Copy code
population = con.read_csv("https://bit.ly/3KoiZR0")

For iterative analysis, the chapter highlights a crucial optimization: using the .to_table("table_name") method. Because the relational object represents a view of the file, repeated actions might trigger repeated file reads. Persisting the relation to a DuckDB table in memory ensures that subsequent queries are instantaneous.

6.2.2 Composing queries

The Relational API serves as a powerful query builder that is safer and more composable than raw SQL strings. Instead of concatenating strings—which is prone to SQL injection—you can use context-aware methods:

Filter & Project: relation.filter('col > 10').project('col1, col2')
Aggregation: relation.aggregate('avg(price)')
Set Operations: The API supports multi-relation algebra, including .union(), .join(), .intersect(), and .except_().

A key advantage demonstrated in the chapter is reusability. You can assign a filtered relation to a Python variable (e.g., high_value_sales) and use that variable as the base for multiple distinct downstream queries.

6.2.3 SQL querying

While the Relational API is powerful, there are times when standard SQL is preferred, particularly for parameterized queries. DuckDB allows you to transition fluidly between Python objects and SQL text.

duckdb.sql(): Used for quick, ad-hoc queries on the default connection.
duckdb.execute(): Conforms to the Python DB-API 2.0. This is critical for parameterized queries, which protect against SQL injection and improve performance by allowing the engine to parse the query plan once and reuse it with different values.

Example of a parameterized query:

Copy code
con.execute("SELECT * FROM table WHERE id = $id", {"id": 5})

6.3 Querying pandas DataFrames

DuckDB’s popularity in the Data Science community stems from its seamless integration with pandas. You can execute SQL queries directly against pandas DataFrames held in memory.

If you have a DataFrame named df, you can simply run duckdb.sql("SELECT * FROM df WHERE col > 100"). DuckDB reads the memory structures of pandas directly, bypassing the slow serialization/deserialization steps usually required by other databases. You can also export query results back to pandas efficiently using the .fetchdf() method.

Note: The SQLAlchemy driver also supports querying DataFrames by registering them on an instance of the driver.

6.4 User-defined functions

When built-in SQL functions aren't enough, DuckDB allows you to register Python functions as User-Defined Functions (UDFs). This brings the vast Python ecosystem (libraries like locale, numpy, or specialized math libraries) into your SQL queries.

The workflow involves:

Defining a standard Python function with type hints.
Registering it via con.create_function().
Invoking it within a SQL statement.

The chapter provides a practical example of cleaning messy data using Python's locale library to parse non-standard currency formats, solving data quality issues that would be incredibly difficult using standard SQL alone.

6.5 Interoperability with Apache Arrow and Polars

DuckDB acts as a high-performance bridge in the modern data stack by supporting Apache Arrow. Since libraries like Polars and PyArrow share this columnar memory format, DuckDB can query them with zero-copy overhead.

Polars Integration: You can convert a DuckDB relation to a Polars DataFrame using .pl().
Arrow Integration: You can export results to Arrow tables using .to_arrow_table().

This interoperability allows developers to "delay the transition" to Python objects as long as possible, pushing the heavy lifting to DuckDB's robust query engine, and only converting to Polars or Arrow when specific non-SQL processing is required.

Summary

DuckDB's Python API seamlessly blends SQL and Python, allowing you to treat in-memory objects and database tables uniformly.
The Relational API ("fluent API") offers a secure, programmatic way to build and reuse query logic.
Zero-copy integrations with pandas, Polars, and Apache Arrow make DuckDB a high-performance hub for data workflows.
User-Defined Functions (UDFs) allow you to extend SQL capabilities using standard Python libraries.
The system facilitates easy data export and conversion between formats, making it an ideal tool for modern data engineering pipelines.