Chapter 1

13 min read

What Is DuckDB? Introduction, Use Cases & Architecture | DuckDB in Action

Q: Can DuckDB query remote files (S3, Parquet) without downloading them?

Yes. DuckDB uses the httpfs extension to query remote files directly on S3, Azure, or GCS. It leverages predicate pushdown and HTTP Range requests to download only the data rows and columns required for your query, rather than the entire file. This "partial reading" capability makes it incredibly fast for querying large Parquet datasets over the network.

Q: How do I install DuckDB and run a basic query in Python?

You can install DuckDB via pip. It has zero external dependencies , making installation instant: pip install duckdb To run a basic SQL query immediately without any configuration: import duckdb duckdb.sql("SELECT 42 AS answer").show() DuckDB also integrates seamlessly with Pandas . You can query existing DataFrames directly using SQL, treating variables as tables.

INFOThis is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.

1.1 What is DuckDB?

DuckDB is a modern embedded analytics database that runs on your machine and lets you efficiently process and query gigabytes of data from different sources. Often described as "the SQLite for analytics," it's an open-source, in-process SQL OLAP database optimized for analytical queries on local data.

DuckDB was created in 2018 by Mark Raasveldt and Hannes Mühleisen, researchers in database systems at Centrum Wiskunde & Informatica (CWI)—the national research institute for mathematics and computer science in the Netherlands. The founders spun off DuckDB Labs as a startup to further develop the database, while the nonprofit DuckDB Foundation governs the project, safeguarding its intellectual property and ensuring continuity under the MIT license.

To understand what makes DuckDB unique:

Embedded: DuckDB runs directly within your application or notebook—not over a network. This eliminates server setup and network overhead entirely.
Full SQL Support: It implements standard SQL with powerful extensions, making it immediately familiar to anyone with database experience.
Vectorized Query Engine: It processes data in parallel chunks, fully utilizing modern multicore CPU architectures for fast analytical processing.

The ecosystem includes MotherDuck, which extends DuckDB into a distributed cloud analytics system that can process data in the cloud and on the edge, adding collaboration and sharing capabilities. The DuckDB community is active and helpful—you can find them on Discord and GitHub, and the documentation is comprehensive enough to answer most questions.

DuckDB handles diverse data formats natively—CSV, JSON, Parquet, Apache Arrow, and Excel files—as well as remote files from cloud buckets like S3 and GCP. It integrates with databases like MySQL, SQLite, and PostgreSQL, and lets you query pandas and Polars DataFrames directly from Python scripts or Jupyter notebooks without copying data.

DuckDB receives data from sources, runs SQL, and outputs to Pandas and charts.

1.2 Why Should You Care About DuckDB?

DuckDB makes data analytics fast and fun again. You no longer need to set up large Apache Spark clusters or run a cloud data warehouse (What is a data warehouse?) just to process a few hundred gigabytes of data. Accessing data from many sources directly—and running the processing where the data resides without copying it over the wire—makes your work faster, simpler, and cheaper.

How DuckDB Compares to Other Tools

The book makes these direct comparisons:

Tool	DuckDB Advantage
SQLite	Faster for analytical workloads (DuckDB vs SQLite comparison)
Apache Spark	Easier to set up—no cluster configuration required
pandas	Lower resource requirements, handles larger-than-memory datasets
Polars	More predictable behavior and error messages
PostgreSQL / Redshift	Simpler setup—no server administration needed
Talend	Faster and more powerful for data transformations

Key Benefits

Compute Where Data Lives: Access data directly from S3 buckets without expensive transfers. For example, instead of paying AWS Athena to scan compressed JSON log files (charged by data volume), you can deploy DuckDB to an EC2 instance close to the data and query for a fraction of the cost.
Rapid Experimentation: Run experiments and validate hypotheses quickly and locally using SQL—no infrastructure setup required.
Extended SQL Dialect: Beyond ANSI SQL, DuckDB adds productivity-boosting extensions:
- SELECT * EXCLUDE() / REPLACE() — Filter or transform columns from wide tables effortlessly
- GROUP BY ALL / ORDER BY ALL — Automatically include all relevant columns without typing them out
- PIVOT / UNPIVOT — Transpose rows and columns with simple syntax
- STRUCT — Native support for complex nested data structures

1.3 When Should You Use DuckDB?

Use DuckDB for analytics tasks that can be expressed in SQL and work on structured data (tables or documents), as long as your data is already available (not streaming) and volumes don't exceed a few hundred gigabytes. Its columnar engine handles both wide tables with many columns and large tables with many rows efficiently.

Because data doesn't leave your system (local machine or privacy-compliant hosting), DuckDB is ideal for analyzing private data like health information, home automation data, patient records, personal identifying information, and financial statements.

Ideal Use Cases

In-Situ Log Analysis: Analyze log files exactly where they're stored without copying them to new locations.
Privacy-First Analytics: Process sensitive data—financial statements, medical records, personal information—locally without data ever leaving your system.
Edge Computing: Efficiently process sensor data on edge devices, such as reporting on power generation and consumption from smart meters.
Machine Learning Preprocessing: Rapid pre-cleaning of user-generated data for ML training pipelines.
Data Science Acceleration: Query pandas or Polars DataFrames directly with zero-copy integration. Results can be used as DataFrames without additional memory usage or data transfer. This puts advanced analytics capabilities in the hands of data scientists without involving a data operations group.
Rapid Prototyping: Test database schemas and complex data models locally before deploying to expensive production data warehouses.
Transport Optimization: Analyze ride data from modern transport operations for bikes and cars.

1.4 When Should You NOT Use DuckDB?

DuckDB is explicitly an analytics database (OLAP), not a transactional database (OLTP). It has minimal support for transactions and parallel write access, so you cannot use it in applications and APIs that process and store input data arriving arbitrarily.

Scale Limitations

The data volumes you can process are mostly limited by main memory. While DuckDB supports out-of-memory processing (spilling to disk), this feature targets exceptional situations where the final portion of processing won't fit into memory—not routine processing of massive datasets.

DuckDB focuses on the "long tail" of data analytics use cases. If you're in an enterprise environment with complex data sources, tools, and applications processing many terabytes, DuckDB may not be the right choice.

No Real-Time Streaming

DuckDB does not process live data streams that update continuously. Data updates should happen in bulk by loading new tables or large chunks of data at once. To process streams, you must implement your own mini-batching architecture—setting up a process to create batches from the stream and store them somewhere DuckDB can query.

1.5 What Are the Best Use Cases for DuckDB?

DuckDB can serve as the "SQLite for analytics"—the standard embedded engine for analytical processing, just as SQLite is for transactional storage. Its primary use cases include:

Edge Computing & Privacy

When analyzing data that shouldn't leave the user's device—health, training, financial, or home automation data—an efficient local infrastructure is essential. Local analytics and preprocessing also reduce the volume of data transported from edge devices like smart meters or sensors.

Cloud Cost Reduction

DuckDB serves as a cheaper alternative to expensive cloud analytics services like BigQuery, Amazon Redshift, AWS Athena, and Snowflake, which charge by processed data volume. A common pattern: replace these services with DuckDB running inside scheduled cloud functions, processing log files directly in storage. You can chain processing functions by writing intermediate results to cloud storage for auditing, helping you reduce cloud data warehouse costs.

Data Science Empowerment

Using DuckDB's state-of-the-art query engine makes data preparation, analysis, filtering, and aggregation more efficient than using pandas—all without leaving the comfortable environment of a Python or R notebook. This puts advanced analytics capabilities in the hands of data scientists working with larger data volumes while reducing complexity and eliminating the need for dedicated data operations support.

Hybrid Cloud-Edge Analysis

A key emerging use case is distributed analysis of data between cloud storage, the edge network, and local devices. MotherDuck enables this by allowing DuckDB to run both in the cloud and locally.

1.6 Where Does DuckDB Fit In?

DuckDB analyzes and transforms data residing in flat files (CSV, Parquet, JSON) or other database systems (PostgreSQL, SQLite). You can find example data from the book in the DuckDB in Action GitHub repository.

DuckDB fits into data pipelines in two ways:

Transient Transformation

Use DuckDB as a temporary processing engine to transform, filter, and pass data through to another format without persisting anything. This is ideal for format conversion (e.g., JSON to Parquet) or data cleaning.

Persistent Analysis

Create tables for your data to persist it for subsequent high-performance analysis. During ingestion, you can transform and correct column names, data types, and values. If your input consists of nested documents, you can unnest and flatten the data to make relational analysis easier and more efficient.

1.7 How Does the Data Processing Flow Work?

The data processing flow follows four stages (see our step-by-step DuckDB tutorial):

Flowchart of DuckDB steps: load data, populate tables, analyze with SQL, use results.

1. Load Data

DuckDB minimizes setup ceremony. Unlike databases like SQL Server, you don't need to specify schema details upfront. When reading data, it uses sensible defaults and inherent schema information, which you can override when needed. Because DuckDB is open source and built by practitioners, there's strong emphasis on usability—if something is too hard to use, someone in the community will propose a fix. And if built-in functionality doesn't reach far enough, there's likely an extension for it (e.g., geospatial data or full-text search).

2. Populate Tables

Determine which SQL capabilities or DuckDB features can help you perform your analysis or transformation. You can also perform exploratory data analysis (EDA) to quickly get an overview of the distribution, ranges, and relationships in your data.

3. Analyze with SQL

Build relevant SQL statements incrementally, verifying at each step that sample results match your expectations. At this stage, you might create additional tables or views before using advanced SQL features like window functions, common table expressions, and pivots.

4. Use Results

Decide how results will be consumed: turning them into files or databases again, serving them through an application or API, or visualizing them in a Jupyter notebook or dashboard.

1.7.1 Data Formats and Sources

DuckDB supports a large number of data formats and sources with little ceremony:

Format	Capabilities
CSV	Loaded in bulk and parallel with automatic column mapping
DataFrames	Memory handled directly by DuckDB in the same Python process—no data copying required
JSON	Destructured, flattened, and transformed into relational tables; native JSON type for storage (Analyze JSON data)
Parquet	Schema metadata queried directly; predicate pushdown evaluates filters at the storage layer to reduce data loaded—the ideal columnar format for data lakes
Apache Arrow	Read via ADBC (Arrow Database Connectivity) without data copying or transformation
Cloud Buckets (S3, GCP)	Query data directly, reducing transfer infrastructure and enabling cheap processing of large volumes

See the Python Quickstart guide to get started with DataFrames.

1.7.2 Data Structures

DuckDB handles a variety of tables, views, and data types. Beyond traditional types (VARCHAR, INTEGER, FLOAT, DECIMAL, DATE, TIMESTAMP, INTERVAL, BOOLEAN, BLOB), it supports complex structured data types:

Enums — Indexed, named elements of a set that are stored and processed efficiently
Lists (Arrays) — Hold multiple elements of the same type, with a variety of functions for list operations
Maps — Efficient key-value pairs for keyed data points, commonly used during JSON processing
Structs — Consistent key-value structures where the same key always has values of the same data type, enabling more efficient storage and processing

DuckDB also allows you to create custom types and database extensions that provide additional data types. You can create virtual (derived) columns generated from other data via expressions.

1.7.3 Developing the SQL

When analyzing data, start by understanding the shape of your data, then work from simple queries to increasingly complex ones:

Inspect Structure: Use DESCRIBE to learn about columns and data types of your data sources, tables, and views.
Get Statistics: Use the SUMMARIZE clause to get per-column statistics:
- count, min, max, avg, std (deviation)
- approx_unique (estimated count of distinct values)
- percentiles (q25, q50, q75)
- null_percentage
Build Incrementally: Start with LIMIT or a single input file to prototype. Outline the result columns you need (converting dates with strptime if necessary), then add aggregations and filters.
Advanced Aggregations: DuckDB provides many aggregation functions—from traditional (min, avg, sum) to advanced (histogram, bitstring_agg, list, approx_count_distinct) to statistical (percentiles, entropy, regression, skewness).
Window Functions: For running totals and comparisons with previous/next rows, use: aggregation OVER (PARTITION BY column ORDER BY column2 [RANGE ...])
Organize with CTEs: Extract repeatedly used parts of your analytics statement into named common table expressions (CTEs) or views for readability.

1.7.4 Using or Processing the Results

Once you have results from DuckDB, you have several options:

Persist & Export

Create a table from results: CREATE TABLE <name> AS SELECT ...
Write to formats: CSV, JSON, Parquet, Excel, Apache Arrow
Export to other databases (SQLite, PostgreSQL) via extensions
Output smaller result sets as CSV or JSON from the CLI

Visualize

Because a picture tells more than 1,000 rows, visualization is often the preferred choice:

Terminal: Use the built-in bar function for inline bar charts, or command-line tools like youplot
Python: Convert results to DataFrames and render with matplotlib, ggplot, or Plotly
R: Use ggplot2 for visualization
JavaScript: Use D3, Nivo, or Observable
Dashboards: Embed in applications built with Streamlit

Serve via APIs

Because DuckDB is so fast, you can serve results directly from queries via an API that web, command-line, or mobile clients consume. You only need a traditional client-server database setup if your source data is too big to move and your results are comparatively small (much less than 1% of the volume). Otherwise, embed DuckDB into your application or dashboarding tool and run it on local raw data or a local DuckDB database.

DuckDB connects to databases, Jupyter, Python, and outputs to diverse charts and tables.

1.8 Summary

DuckDB is a modern embedded analytical database created in 2018 by Mark Raasveldt and Hannes Mühleisen, excelling at in-memory processing of gigabytes of data.
It supports an extended SQL dialect with productivity features like GROUP BY ALL, SELECT * EXCLUDE, and PIVOT, and gains new capabilities through extensions.
It reads diverse formats natively—CSV, JSON, Parquet, Apache Arrow, Excel—from both local and remote sources including S3 and GCP.
Python, R, and other language integrations are seamless and efficient, with zero-copy DataFrame access.
As an in-process database, it processes data efficiently without copying or network overhead.
Beyond traditional types, DuckDB supports lists, maps, structs, and enums for complex data structures.
Rich built-in functions for data types, aggregations, and statistical analysis make data processing and shaping straightforward.
An iterative SQL workflow—starting with DESCRIBE and SUMMARIZE—helps you understand data before building complex queries.
Flexible output options let you generate reports, visualize charts, export to various formats, or serve results directly via APIs.

FAQS

Who created DuckDB?

DuckDB was created in 2018 by Mark Raasveldt and Hannes Mühleisen, researchers at Centrum Wiskunde & Informatica (CWI) in the Netherlands. They later founded DuckDB Labs to continue development, while the nonprofit DuckDB Foundation ensures the project remains open source under the MIT license.

What data formats does DuckDB support?

DuckDB natively supports CSV, JSON, Parquet, Apache Arrow, and Excel files. It can query pandas and Polars DataFrames directly without copying data, connect to databases like MySQL, SQLite, and PostgreSQL, and read files directly from S3 and GCP cloud buckets.

How much data can DuckDB handle?

DuckDB is designed for the "long tail" of analytics—datasets up to a few hundred gigabytes. While it supports out-of-memory processing (spilling to disk), this is intended for exceptional situations, not routine multi-terabyte processing.

Can DuckDB handle real-time streaming data?

No. DuckDB is not a streaming database. It processes data that is already available, not continuous live updates. For streaming use cases, you must implement a mini-batching architecture to buffer data into chunks before loading them into DuckDB.

Editor's note: Since the book was written, MotherDuck has been shown to operate on streaming data. DuckLake's support for data inlining is also quite promising for streaming data.

Does DuckDB support concurrent writes and multiple users?

No, not in the traditional sense. DuckDB is an embedded, single-process database. It is designed for one writer and multiple readers within the same process. Unlike client-server databases (e.g., PostgreSQL), it does not handle multiple concurrent users writing to the same database file from different processes. For multi-user collaboration and concurrent writing, you would typically use MotherDuck, which extends DuckDB to the cloud as a cloud data warehouse.

Can DuckDB query remote files (S3, Parquet) without downloading them?

Yes. DuckDB uses the httpfs extension to query remote files directly on S3, Azure, or GCS. It leverages predicate pushdown and HTTP Range requests to download only the data rows and columns required for your query, rather than the entire file. This "partial reading" capability makes it incredibly fast for querying large Parquet datasets over the network.

Is DuckDB compatible with PostgreSQL SQL syntax?

Yes, DuckDB is highly compatible with PostgreSQL. Its SQL dialect allows you to run most standard PostgreSQL queries unmodified. It supports the same data types, window functions, and even many Postgres-specific commands. Additionally, DuckDB provides a postgres extension that allows you to attach to a running PostgreSQL instance and query its tables directly, simplifying migration.

What is the difference between DuckDB and MotherDuck?

DuckDB is the open-source, local query engine that runs on your machine. MotherDuck is the managed cloud service built for DuckDB. MotherDuck enables "Dual Execution," a feature that automatically routes queries between your local machine and the cloud for maximum speed. It also adds team collaboration, persistent cloud storage, and secure data sharing to the DuckDB ecosystem.

How does DuckDB compare to SQLite or Pandas in terms of speed and memory?

DuckDB outperforms both for analytical workloads due to its vectorized, columnar engine.

Vs. Pandas: DuckDB supports out-of-core processing, meaning it can process datasets larger than your computer's RAM by spilling excess data to disk gracefully. Pandas requires all data to fit in memory or it crashes.
Vs. SQLite: SQLite is row-oriented (great for transaction records), whereas DuckDB is column-oriented. This makes DuckDB significantly faster for aggregations, reporting, and analytical queries where you scan specific columns rather than entire rows.

How do I install DuckDB and run a basic query in Python?

You can install DuckDB via pip. It has zero external dependencies, making installation instant:

pip install duckdb

To run a basic SQL query immediately without any configuration:

import duckdb duckdb.sql("SELECT 42 AS answer").show()

DuckDB also integrates seamlessly with Pandas. You can query existing DataFrames directly using SQL, treating variables as tables.