Chapter 1

10 min read

An Introduction to DuckDB

This is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.

'DuckDB In Action' book cover

1.1 What is DuckDB?

DuckDB is a modern, embedded analytics database designed for efficient processing and querying of gigabytes of data from various sources. It functions as an embeddable SQL OLAP (Online Analytical Processing) database management system. To break that down:

  • Embeddable: DuckDB runs directly within your application, eliminating the need for a separate server process.
  • SQL: It supports SQL fully, making it familiar to those with database experience.
  • OLAP: It optimizes for analytical processing, excelling at complex queries on large datasets. Its vectorized query engine processes data in parallel chunks to fully utilize modern multicore CPU architectures.

Unlike traditional client-server databases, DuckDB operates within the same process as your application or notebook, eliminating network overhead. Developed by researchers at CWI, the project is governed by the non-profit DuckDB Foundation (ensuring it remains open-source under the MIT License) and supported by DuckDB Labs. The ecosystem also includes MotherDuck, which extends DuckDB into a distributed cloud analytics system. There is also a thriving community on platforms like Discord and GitHub.

Its ability to handle diverse data formats—such as CSV, JSON, Parquet, and Apache Arrow, as well as remote files from cloud buckets—is a key strength. Furthermore, it integrates seamlessly with other databases (MySQL, Postgres) and allows you to query Python objects like pandas and Polars DataFrames directly without copying data.

DuckDB receives data from sources, runs SQL, and outputs to Pandas and charts.

1.2 Why should you care about DuckDB?

DuckDB simplifies data analytics by providing a fast and efficient way to process datasets—ranging up to a few hundred gigabytes—locally or close to the source, without the overhead of distributed computing frameworks (like Spark) or expensive cloud data warehouses. It offers a unique combination of features that solve modern data problems:

  • Compute Where Data Lives: It can access data directly (e.g., S3 buckets) without copying it over the wire, significantly reducing data transfer costs and complexity compared to services like AWS Athena.
  • Resource Efficiency: It handles larger-than-memory workloads gracefully, requiring fewer resources than pandas and avoiding the complex setup of PostgreSQL or Redshift.
  • Speed: Its columnar-vectorized engine outperforms SQLite for analytical tasks and processes data in parallel batches.

Beyond standard SQL, DuckDB improves developer productivity with powerful dialect extensions:

  • GROUP BY ALL & ORDER BY ALL: Eliminates the need to type out every column name.
  • SELECT * EXCLUDE / REPLACE: Makes filtering columns from wide tables effortless.
  • PIVOT / UNPIVOT: Simplifies reshaping rows and columns.
  • STRUCT: Provides native support for complex nested data structures.

1.3 When should you use DuckDB?

DuckDB is suitable for analytical tasks involving structured data (tables or documents) up to a few hundred gigabytes, provided the data is already available (not streaming). Its columnar engine excels at handling both wide tables (many columns) and large tables (many rows).

DuckDB specifically excels in these real-world scenarios:

  • In-Situ Log Analysis: Analyzing log files exactly where they are stored without needing to copy or move them to a new location.

  • Privacy-First Analytics: Ideal for analyzing sensitive private data—such as financial statements or personal medical data (e.g., heart rate monitoring)—because the data never leaves the local system or privacy-compliant host.

  • Edge Computing: Efficiently processing sensor data on edge devices, such as reporting on power consumption from smart meters.

  • Machine Learning Pipelines: Rapid pre-processing and pre-cleaning of user-generated data for ML training models.

  • Data Science Acceleration: It offers zero-copy integration with Python/R. You can query pandas/Polars DataFrames directly, and crucially, the query results can be used as DataFrames without additional memory usage or data copying.

  • Prototyping & Testing: Rapidly testing database schemas and complex data models locally before deploying them to larger, expensive production data warehouses.

1.4 When should you not use DuckDB?

DuckDB is explicitly an analytics database. While it provides strong ACID guarantees to ensure data integrity, it is not optimized for high-concurrency transactional applications (OLTP) or backend APIs that must process and store arbitrary input data as it arrives.

It also has specific limitations regarding scale and latency:

  • Scale: DuckDB focuses on the "long tail" of analytics (up to a few hundred gigabytes). While it supports out-of-memory processing (spilling to disk), this feature is intended for exceptional situations where the final portion of a query exceeds RAM, not for routine processing of multi-terabyte enterprise datasets.

  • Streaming: It is not a real-time streaming database. It cannot handle continuous live updates; instead, updates must happen in bulk. To process streams, you must implement your own "mini-batching" architecture to buffer data into chunks before loading them into DuckDB.

1.5 What are the best use cases for DuckDB?

DuckDB's versatility allows it to function as the "SQLite for analytics"—becoming the standard embedded engine for analytical processing just as SQLite is for transaction storage. Its primary use cases include:

  • Edge Computing & Privacy: It is ideal for applications where data should not leave the user's device, such as health, financial, or home automation data. By processing sensor data (e.g., from smart meters) locally, it significantly reduces the volume of data that needs to be transported to the cloud.
  • Cloud Cost Reduction: It serves as a cheaper alternative to expensive cloud services like BigQuery, Redshift, or AWS Athena (which charge by scanned data volume). A common pattern is to replace these services with DuckDB running inside scheduled cloud functions, allowing you to process log files directly in storage and chain functions together for auditing.
  • Data Science Empowerment: Beyond just outperforming pandas in speed, DuckDB simplifies the infrastructure required for data science. It allows complex analysis in Python/R notebooks without needing to involve a dedicated "data operations group" to manage clusters or databases.
  • Hybrid Analysis: It enables distributed analysis workflows where data is queried across cloud storage, the edge network, and local devices simultaneously (a capability being developed by MotherDuck).

1.6 Where does DuckDB fit in?

DuckDB serves as a powerful engine for analyzing and transforming data residing in various formats like CSV, Parquet, JSON, or databases such as PostgreSQL and SQLite. Note: You can find the example data used in the book's code snippets in the DuckDB in Action GitHub repository.

It fits into data pipelines in two primary ways:

  • Transient Transformation: You can use DuckDB as a temporary processing engine to filter, clean, and pass data through to another format (e.g., converting JSON to Parquet) without ever storing the data permanently.

  • Persistent Analysis: Alternatively, you can ingest data into DuckDB tables for high-performance local analysis.

Crucially, it bridges the gap between raw data and relational analysis by allowing you to:

  • Unnest and Flatten: deeply nested structures (like JSON documents) into flat tables.

  • Clean and Correct: standardization of column names, data types, and values during the ingestion process.

1.7 How does the data processing flow work?

DuckDB's data processing flow involves a series of steps: 1. Loading & Ingestion: DuckDB minimizes setup ceremony. Unlike traditional databases, it uses sensible defaults to infer schema information automatically. It supports predicate pushdown for Parquet files (filtering data at the storage layer to minimize I/O) and handles complex types natively, including STRUCTs, MAPs, LISTs, and ENUMs.

2. Exploratory Data Analysis (EDA): Before diving into complex queries, you can instantly understand your data distribution using the DESCRIBE statement and the powerful SUMMARIZE clause. SUMMARIZE automatically calculates column statistics like min/max, approximate unique counts, null percentages, and quartiles (q25, q50, q75).

3. Analytical Queries: DuckDB extends standard SQL with advanced analytical functions. Beyond basic sums and averages, it supports statistical aggregations like histogram, entropy, skewness, and approx_count_distinct.

4. Consumption & Visualization: Results can be exported to formats like CSV, JSON, or Apache Arrow. For quick insights directly in the terminal, DuckDB offers a built-in bar function to render inline charts. For richer visualization, results can be converted to DataFrames for use with matplotlib, ggplot, or D3.js.

Flowchart of DuckDB steps: load data, populate tables, analyze with SQL, use results.

1.7.1 Data formats and sources

DuckDB's strength lies in its "no-ceremony" approach to data: it uses sensible defaults to automatically infer schemas, allowing you to query data immediately without defining table structures upfront. It includes specialized optimizations for each format:

  • Parquet: Supports predicate pushdown, meaning query filters are evaluated at the storage layer to drastically reduce the amount of data loaded.
  • CSV: Files are loaded in bulk and parallel, with columns mapped automatically.
  • JSON: Nested data can be destructured, flattened, and transformed into relational tables, supported by a native JSON data type.
  • DataFrames: It accesses pandas or Polars DataFrames directly within the same Python process without copying data.
  • Apache Arrow: Accessible via ADBC (Arrow Database Connectivity) to minimize serialization overhead.
  • Cloud Storage: It queries S3 or GCP buckets directly, allowing for cheap processing of large volumes without moving data.

1.7.2 Data structures

DuckDB supports a robust set of data types. Beyond traditional SQL types like DECIMAL, INTERVAL, and BLOB, it natively handles complex, nested structures:

  • Enums: Indexed, named elements of a set that are stored and processed efficiently.
  • Lists: Arrays holding multiple elements of the same type, supported by a wide range of manipulation functions.
  • Maps: Efficient key-value pairs, particularly useful for processing JSON data.
  • Structs: Consistent key-value structures (where specific keys always map to specific types), allowing for highly optimized storage and processing. Additionally, DuckDB allows you to define "virtual" (or derived) columns generated dynamically from expressions, and even create custom types via extensions.

1.7.3 Developing the SQL

DuckDB supports an iterative workflow for developing SQL. The book recommends starting with DESCRIBE to inspect schemas and using LIMIT to prototype queries on a small data subset before scaling up. Its SQL dialect includes powerful features for advanced analytics:

  • Date Handling: Functions like strptime simplify date/time conversion.
  • Advanced Aggregations: Beyond standard math, it supports histogram, bitstring_agg, list (creating arrays from rows), and approx_count_distinct.
  • Statistical Functions: Built-in support for entropy, skewness, regression, and percentiles.
  • Window Functions: Full support for complex moving calculations, including PARTITION BY, ORDER BY, and RANGE frames for time-series analysis.
  • Modularity: Complex logic can be organized using Common Table Expressions (CTEs).

1.7.4 Use or process the results

Once processed, DuckDB results can be handled in three distinct ways:

  • Persist & Export: You can save results internally using CREATE TABLE ... AS SELECT, or export them to formats including Excel, Apache Arrow, CSV, JSON, and Parquet. You can even write results directly into other databases (like SQLite or Postgres) via extensions.

  • Visualizing: For quick checks, you can use the built-in bar function to render charts directly in the terminal. For rich visualizations, results can be seamlessly passed to Python/R libraries (matplotlib, ggplot2) or embedded into applications using Streamlit.

  • Serving APIs: Because DuckDB is so fast, it can be used to serve aggregated results directly via APIs to mobile or web clients. This is ideal for scenarios where the source data is massive (too big to move), but the query results are comparatively small (<1% of the volume) DuckDB connects to databases, Jupyter, Python, and outputs to diverse charts and tables.

1.8 Summary

DuckDB emerges as a powerful in-memory analytical database, excelling in processing and querying gigabytes of data efficiently. Its support for an extended SQL dialect, diverse data formats, and seamless integration with popular programming languages makes it a versatile tool for data professionals. DuckDB's ability to handle complex data structures, perform efficient aggregations, and visualize results directly enhances its analytical capabilities. Its ease of use, performance, and versatility position it as a valuable asset for various data-intensive tasks, from ad-hoc analysis to building data pipelines.

'DuckDB In Action' book cover