13 min read
What is DuckDB? The Complete Introduction and Use Cases
INFOThis is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.
1.1 What is DuckDB?
DuckDB is a modern embedded analytics database that runs on your machine and lets you efficiently process and query gigabytes of data from different sources. Often described as "the SQLite for analytics," it's an open-source, in-process SQL OLAP database optimized for analytical queries on local data.
To understand what makes DuckDB unique:
- Embedded: DuckDB runs directly within your application or notebook—not over a network. This eliminates server setup and network overhead entirely. [Editor's Note: to run over a network, you can use a service like the MotherDuck cloud data warehouse, or a community extension]
- Full SQL Support: It implements standard SQL with powerful extensions, making it immediately familiar to anyone with database experience.
- Vectorized Query Engine: It processes data in parallel chunks, fully utilizing modern multicore CPU architectures for fast analytical processing.
The ecosystem includes MotherDuck, which extends DuckDB into a distributed cloud analytics system that can process data in the cloud and on the edge, adding collaboration and sharing capabilities. The DuckDB community is active and helpful—you can find them on Discord and GitHub, and the documentation is comprehensive enough to answer most questions.
DuckDB handles diverse data formats natively—CSV, JSON, Parquet, Apache Arrow, and Excel files—as well as remote files from cloud buckets like S3 and GCP. It integrates with databases like MySQL, SQLite, and PostgreSQL, and lets you query pandas and Polars DataFrames directly from Python scripts or Jupyter notebooks without copying data.

DuckDB was created in 2018 by Mark Raasveldt and Hannes Mühleisen, researchers in database systems at Centrum Wiskunde & Informatica (CWI)—the national research institute for mathematics and computer science in the Netherlands. The founders spun off DuckDB Labs as a startup to further develop the database, while the nonprofit DuckDB Foundation governs the project, safeguarding its intellectual property and ensuring continuity under the MIT license.
1.2 Why Should You Care About DuckDB?
DuckDB makes data analytics fast and fun again. You no longer need to set up large Apache Spark clusters or run a cloud data warehouse (What is a data warehouse?) just to process a few hundred gigabytes of data. Accessing data from many sources directly—and running the processing where the data resides without copying it over the wire—makes your work faster, simpler, and cheaper.
How DuckDB Compares to Other Tools
The book makes these direct comparisons:
| Tool | DuckDB Advantage |
|---|---|
| SQLite | Faster for analytical workloads (DuckDB vs SQLite comparison) |
| Apache Spark | Easier to set up—no cluster configuration required |
| pandas | Lower resource requirements, handles larger-than-memory datasets |
| Polars | More predictable behavior and error messages |
| PostgreSQL / Redshift | Simpler setup—no server administration needed |
| Talend | Faster and more powerful for data transformations |
Key Benefits
-
Compute Where Data Lives: Access data directly from S3 buckets without expensive transfers. For example, instead of paying AWS Athena to scan compressed JSON log files (charged by data volume), you can deploy DuckDB to an EC2 instance close to the data and query for a fraction of the cost.
-
Rapid Experimentation: Run experiments and validate hypotheses quickly and locally using SQL—no infrastructure setup required.
-
Extended SQL Dialect: Beyond ANSI SQL, DuckDB adds productivity-boosting extensions:
SELECT * EXCLUDE()/REPLACE()— Filter or transform columns from wide tables effortlesslyGROUP BY ALL/ORDER BY ALL— Automatically include all relevant columns without typing them outPIVOT/UNPIVOT— Transpose rows and columns with simple syntaxSTRUCT— Native support for complex nested data structures
1.3 When Should You Use DuckDB?
Use DuckDB for analytics tasks that can be expressed in SQL and work on structured data (tables or documents), as long as your data is already available (not streaming) and volumes don't exceed a few hundred gigabytes. Its columnar engine handles both wide tables with many columns and large tables with many rows efficiently.
Because data doesn't leave your system (local machine or privacy-compliant hosting), DuckDB is ideal for analyzing private data like health information, home automation data, patient records, personal identifying information, and financial statements.
Ideal Use Cases
-
In-Situ Log Analysis: Analyze log files exactly where they're stored without copying them to new locations.
-
Privacy-First Analytics: Process sensitive data—financial statements, medical records, personal information—locally without data ever leaving your system.
-
Edge Computing: Efficiently process sensor data on edge devices, such as reporting on power generation and consumption from smart meters.
-
Machine Learning Preprocessing: Rapid pre-cleaning of user-generated data for ML training pipelines.
-
Data Science Acceleration: Query pandas or Polars DataFrames directly with zero-copy integration. Results can be used as DataFrames without additional memory usage or data transfer. This puts advanced analytics capabilities in the hands of data scientists without involving a data operations group.
-
Rapid Prototyping: Test database schemas and complex data models locally before deploying to expensive production data warehouses.
-
Transport Optimization: Analyze ride data from modern transport operations for bikes and cars.
1.4 When Should You NOT Use DuckDB?
DuckDB is explicitly an analytics database (OLAP), not a transactional database (OLTP). It has minimal support for transactions and parallel write access, so you cannot use it in applications and APIs that process and store input data arriving arbitrarily.
Scale Limitations
The data volumes you can process are mostly limited by main memory. While DuckDB supports out-of-memory processing (spilling to disk), this feature targets exceptional situations where the final portion of processing won't fit into memory—not routine processing of massive datasets.
DuckDB focuses on the "long tail" of data analytics use cases. If you're in an enterprise environment with complex data sources, tools, and applications processing many terabytes, DuckDB may not be the right choice.
No Real-Time Streaming
DuckDB does not process live data streams that update continuously. Data updates should happen in bulk by loading new tables or large chunks of data at once. To process streams, you must implement your own mini-batching architecture—setting up a process to create batches from the stream and store them somewhere DuckDB can query.
1.5 What Are the Best Use Cases for DuckDB?
DuckDB can serve as the "SQLite for analytics"—the standard embedded engine for analytical processing, just as SQLite is for transactional storage. Its primary use cases include:
Edge Computing & Privacy
When analyzing data that shouldn't leave the user's device—health, training, financial, or home automation data—an efficient local infrastructure is essential. Local analytics and preprocessing also reduce the volume of data transported from edge devices like smart meters or sensors.
Cloud Cost Reduction
DuckDB serves as a cheaper alternative to expensive cloud analytics services like BigQuery, Amazon Redshift, AWS Athena, and Snowflake, which charge by processed data volume. A common pattern: replace these services with DuckDB running inside scheduled cloud functions, processing log files directly in storage. You can chain processing functions by writing intermediate results to cloud storage for auditing, helping you reduce cloud data warehouse costs.
Data Science Empowerment
Using DuckDB's state-of-the-art query engine makes data preparation, analysis, filtering, and aggregation more efficient than using pandas—all without leaving the comfortable environment of a Python or R notebook. This puts advanced analytics capabilities in the hands of data scientists working with larger data volumes while reducing complexity and eliminating the need for dedicated data operations support.
Hybrid Cloud-Edge Analysis
A key emerging use case is distributed analysis of data between cloud storage, the edge network, and local devices. MotherDuck enables this by allowing DuckDB to run both in the cloud and locally.
1.6 Where Does DuckDB Fit In?
DuckDB analyzes and transforms data residing in flat files (CSV, Parquet, JSON) or other database systems (PostgreSQL, SQLite). You can find example data from the book in the DuckDB in Action GitHub repository.
DuckDB fits into data pipelines in two ways:
Transient Transformation
Use DuckDB as a temporary processing engine to transform, filter, and pass data through to another format without persisting anything. This is ideal for format conversion (e.g., JSON to Parquet) or data cleaning.
Persistent Analysis
Create tables for your data to persist it for subsequent high-performance analysis. During ingestion, you can transform and correct column names, data types, and values. If your input consists of nested documents, you can unnest and flatten the data to make relational analysis easier and more efficient.
1.7 How Does the Data Processing Flow Work?
The data processing flow follows four stages (see our step-by-step DuckDB tutorial):

1. Load Data
DuckDB minimizes setup ceremony. Unlike databases like SQL Server, you don't need to specify schema details upfront. When reading data, it uses sensible defaults and inherent schema information, which you can override when needed. Because DuckDB is open source and built by practitioners, there's strong emphasis on usability—if something is too hard to use, someone in the community will propose a fix. And if built-in functionality doesn't reach far enough, there's likely an extension for it (e.g., geospatial data or full-text search).
2. Populate Tables
Determine which SQL capabilities or DuckDB features can help you perform your analysis or transformation. You can also perform exploratory data analysis (EDA) to quickly get an overview of the distribution, ranges, and relationships in your data.
3. Analyze with SQL
Build relevant SQL statements incrementally, verifying at each step that sample results match your expectations. At this stage, you might create additional tables or views before using advanced SQL features like window functions, common table expressions, and pivots.
4. Use Results
Decide how results will be consumed: turning them into files or databases again, serving them through an application or API, or visualizing them in a Jupyter notebook or dashboard.
1.7.1 Data Formats and Sources
DuckDB supports a large number of data formats and sources with little ceremony:
| Format | Capabilities |
|---|---|
| CSV | Loaded in bulk and parallel with automatic column mapping |
| DataFrames | Memory handled directly by DuckDB in the same Python process—no data copying required |
| JSON | Destructured, flattened, and transformed into relational tables; native JSON type for storage (Analyze JSON data) |
| Parquet | Schema metadata queried directly; predicate pushdown evaluates filters at the storage layer to reduce data loaded—the ideal columnar format for data lakes |
| Apache Arrow | Read via ADBC (Arrow Database Connectivity) without data copying or transformation |
| Cloud Buckets (S3, GCP) | Query data directly, reducing transfer infrastructure and enabling cheap processing of large volumes |
See the Python Quickstart guide to get started with DataFrames.
1.7.2 Data Structures
DuckDB handles a variety of tables, views, and data types. Beyond traditional types (VARCHAR, INTEGER, FLOAT, DECIMAL, DATE, TIMESTAMP, INTERVAL, BOOLEAN, BLOB), it supports complex structured data types:
- Enums — Indexed, named elements of a set that are stored and processed efficiently
- Lists (Arrays) — Hold multiple elements of the same type, with a variety of functions for list operations
- Maps — Efficient key-value pairs for keyed data points, commonly used during JSON processing
- Structs — Consistent key-value structures where the same key always has values of the same data type, enabling more efficient storage and processing
DuckDB also allows you to create custom types and database extensions that provide additional data types. You can create virtual (derived) columns generated from other data via expressions.
1.7.3 Developing the SQL
When analyzing data, start by understanding the shape of your data, then work from simple queries to increasingly complex ones:
-
Inspect Structure: Use
DESCRIBEto learn about columns and data types of your data sources, tables, and views. -
Get Statistics: Use the
SUMMARIZEclause to get per-column statistics:count,min,max,avg,std(deviation)approx_unique(estimated count of distinct values)percentiles(q25, q50, q75)null_percentage
-
Build Incrementally: Start with
LIMITor a single input file to prototype. Outline the result columns you need (converting dates withstrptimeif necessary), then add aggregations and filters. -
Advanced Aggregations: DuckDB provides many aggregation functions—from traditional (
min,avg,sum) to advanced (histogram,bitstring_agg,list,approx_count_distinct) to statistical (percentiles,entropy,regression,skewness). -
Window Functions: For running totals and comparisons with previous/next rows, use:
aggregation OVER (PARTITION BY column ORDER BY column2 [RANGE ...]) -
Organize with CTEs: Extract repeatedly used parts of your analytics statement into named common table expressions (CTEs) or views for readability.
1.7.4 Using or Processing the Results
Once you have results from DuckDB, you have several options:
Persist & Export
- Create a table from results:
CREATE TABLE <name> AS SELECT ... - Write to formats: CSV, JSON, Parquet, Excel, Apache Arrow
- Export to other databases (SQLite, PostgreSQL) via extensions
- Output smaller result sets as CSV or JSON from the CLI
Visualize
Because a picture tells more than 1,000 rows, visualization is often the preferred choice:
- Terminal: Use the built-in
barfunction for inline bar charts, or command-line tools likeyouplot - Python: Convert results to DataFrames and render with matplotlib, ggplot, or Plotly
- R: Use ggplot2 for visualization
- JavaScript: Use D3, Nivo, or Observable
- Dashboards: Embed in applications built with Streamlit
Serve via APIs
Because DuckDB is so fast, you can serve results directly from queries via an API that web, command-line, or mobile clients consume. You only need a traditional client-server database setup if your source data is too big to move and your results are comparatively small (much less than 1% of the volume). Otherwise, embed DuckDB into your application or dashboarding tool and run it on local raw data or a local DuckDB database.

1.8 Summary
- DuckDB is a modern embedded analytical database created in 2018 by Mark Raasveldt and Hannes Mühleisen, excelling at in-memory processing of gigabytes of data.
- It supports an extended SQL dialect with productivity features like
GROUP BY ALL,SELECT * EXCLUDE, andPIVOT, and gains new capabilities through extensions. - It reads diverse formats natively—CSV, JSON, Parquet, Apache Arrow, Excel—from both local and remote sources including S3 and GCP.
- Python, R, and other language integrations are seamless and efficient, with zero-copy DataFrame access.
- As an in-process database, it processes data efficiently without copying or network overhead.
- Beyond traditional types, DuckDB supports lists, maps, structs, and enums for complex data structures.
- Rich built-in functions for data types, aggregations, and statistical analysis make data processing and shaping straightforward.
- An iterative SQL workflow—starting with
DESCRIBEandSUMMARIZE—helps you understand data before building complex queries. - Flexible output options let you generate reports, visualize charts, export to various formats, or serve results directly via APIs.

FAQS
Who created DuckDB?
What data formats does DuckDB support?
How much data can DuckDB handle?
Can DuckDB handle real-time streaming data?
Does DuckDB support concurrent writes and multiple users?
Can DuckDB query remote files (S3, Parquet) without downloading them?
Is DuckDB compatible with PostgreSQL SQL syntax?
What is the difference between DuckDB and MotherDuck?
How does DuckDB compare to SQLite or Pandas in terms of speed and memory?
How do I install DuckDB and run a basic query in Python?


