8 min read
Performance Considerations for Large Datasets
INFOThis is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.
So far, we have explored DuckDB using small to medium-sized datasets. However, real-world data engineering often involves massive volumes of data. Chapter 10 of DuckDB in Action demonstrates that you don’t always need a separate distributed query engine to handle large-scale data. Instead, DuckDB is highly capable of processing huge datasets efficiently on a single machine.
This chapter focuses on two significant datasets to showcase performance tips and tricks: the Stack Overflow community database (11 GB compressed) and the New York City Taxi dataset (1.7 billion rows).
10.1 Loading and querying the full Stack Overflow database
Stack Overflow is a massive community-driven Q&A site. Analyzing its data provides deep insights into developer trends. The dataset used here is approximately 11 GB in compressed CSV format, containing 58 million posts and 20 million users. While this doesn't quite qualify as "Big Data" (read more about why Big Data is Dead), it is large enough to test DuckDB's limits.
10.1.1 Data dump and extraction
The raw data is available from the Internet Archive as compressed XML files. Since DuckDB does not natively parse XML, the chapter guides you through a transformation pipeline using command-line tools like curl for downloading, 7-Zip for extraction, and xidel combined with jq to convert the XML structure into a CSV format compatible with DuckDB.
10.1.2 The data model
Understanding the relationships between entities is crucial. The Stack Overflow model revolves around Posts (Questions and Answers), which are linked to Users, Comments, Votes, and Badges. Although the CSV dump lacks formal constraints, we can infer foreign keys (e.g., ownerUserId linking Posts to Users) to rebuild the schema within our analytical database.
10.1.3 Exploring the CSV file data
Before ingesting everything, it is best to inspect the data in place. DuckDB’s read_csv function allows you to query compressed CSV files directly.
- Metadata Inspection: Using the
DESCRIBEstatement on aread_csvcall reveals column names and inferred types without loading the full file. - Ad-hoc Analysis: You can run SQL queries directly on the files (e.g., finding the top 5 most popular tags) to validate the data structure before committing to a full load.
10.1.4 Loading the data into DuckDB
Ingestion can be performed by creating tables explicitly or using CREATE TABLE AS SELECT (CTAS) to infer the schema automatically.
- Schema Evolution: The
CREATE OR REPLACE TABLEsyntax is useful during testing to iterate on data loading scripts. - Data Summarization: The
SUMMARIZEclause provides a powerful shortcut to generate statistics—such as min, max, null percentage, and approximate unique counts—for every column in your table, helping you understand data distribution instantly.
10.1.5 Fast exploratory queries on large tables
Once the data is loaded, DuckDB’s columnar engine excels at analytical queries.
- Performance: Queries on 20 million users (e.g., finding top users by reputation) execute in sub-second time (0.126 seconds in the example).
- Visualization: The chapter demonstrates using the internal
bar()function to generate ASCII bar charts directly in the terminal, visualizing reputation rates without external tools.
10.1.6 Posting on weekdays
The text demonstrates how to perform temporal analysis—such as determining which programming languages (like SQL vs. Rust) are discussed more on weekdays versus weekends. This involves grouping by day of the week and filtering by specific tags.
10.1.7 Using enums for tags
For large datasets, string operations can be costly. DuckDB supports ENUM types (enumerated types), which store strings as integers internally.
- Optimization: Converting the
Tagscolumn from raw strings to anENUMtype reduces storage size and improves query performance for filtering and aggregation. - Implementation: The chapter walks through extracting distinct tags, creating an ENUM type, and using
list_transformto convert arrays of string tags into arrays of enums.
10.2 Query planning and execution
Efficient query execution is vital for large datasets. DuckDB employs a sophisticated pipeline to transform SQL into optimized machine code.
10.2.1 Planner and optimizer
The journey of a query involves several stages:
- Parser: Checks syntax.
- Binder: Resolves table and column names.
- Logical Planner: Generates a basic plan.
- Optimizer: Applies rules (like join reordering and filter pushdown) to create an optimized logical plan.
- Physical Planner: Converts the logical plan into executable operations suited for the hardware.
10.2.2 Runtime and vectorization
DuckDB uses a vectorized execution engine. Instead of processing one row at a time, it processes data in batches (vectors) of 2,048 values.
- Columnar Storage: Data is stored in row groups (horizontal partitions), allowing the engine to read only relevant columns.
- Morsel-Driven Parallelism: Execution is parallelized using "morsels," allowing multiple threads to process different chunks of data simultaneously, maximizing CPU usage.
10.2.3 Visualizing query plans with Explain and Explain Analyze
To debug performance, you can use:
- EXPLAIN clause: Shows the tree of operators (the plan) without running the query.
EXPLAIN ANALYZE: Executes the query and reports actual runtimes and row counts for each operator. This helps identify bottlenecks, such as a scan returning too many rows or a sort operation taking too much memory.
10.3 Exporting the Stack Overflow data to Parquet
For long-term storage, Parquet is superior to CSV. It is a compressed, columnar binary format that supports schema retention.
- Export: The
COPY ... TO ... (FORMAT PARQUET)command exports tables efficiently. - Multithreading: DuckDB can write Parquet files in parallel (one file per thread), significantly speeding up the export process (e.g., reducing write time for the users table from 10s to 1.7s).
- Performance Benefit: Reading the row count from a Parquet file is orders of magnitude faster (milliseconds vs. seconds) than CSV because DuckDB can read the metadata in the Parquet footer instead of scanning the entire file.
10.4 Exploring the New York City Taxi dataset from Parquet files
To push boundaries, the chapter explores the NYC Taxi dataset—over 1.7 billion rows stored in 175 Parquet files. This section demonstrates DuckDB's ability to act as a query engine over data stored in cloud storage (S3) without importing it.
10.4.1 Configuring credentials for S3 access
Accessing an S3 bucketrequires the httpfs extension. Users can configure access using CREATE SECRET to securely store credentials (Region, Key ID, Secret) for persistent access to private buckets.
10.4.2 Auto-inferring file types
DuckDB can query remote Parquet files directly using read_parquet. It supports glob patterns (wildcards), allowing you to aggregate data across hundreds of files (e.g., yellow_tripdata_*.parquet) in a single query.
10.4.3 Exploring Parquet schema
Using the parquet_schema() function, you can inspect the schema of remote files to understand column names and types (like passenger_count or trip_distance) without downloading the payload data.
10.4.4 Creating views
Instead of repeatedly typing long S3 URLs, you can create a view.
- Abstraction: A view acts like a virtual table. Creating a view over the wildcard path (
.../yellow_tripdata_202*.parquet) allows you to treat the remote data lake as a local table. - Union by Name: If schema evolves over time, the
union_by_name=trueoption ensures that files with slightly different columns are merged gracefully, filling missing columns with NULLs.
10.4.5 Analyzing the data
Running SUMMARIZE on the view triggers a scan of the remote data. While slower than local data (due to network latency), it is efficient because DuckDB only pulls necessary column data. This allows for massive scale analysis (100M+ rows) from a local laptop or a connected cloud instance like MotherDuck.
10.4.6 Making use of the taxi dataset
The chapter concludes with advanced analytics on the taxi data, such as calculating the average fare per mile over different years.
- Star Expressions: The text introduces
COLUMNSexpressions (Star Expressions) to apply aggregate functions (likemin,max,avg) to multiple columns at once using pattern matching (e.g., all columns containing "amount" or "fee"). - Filter Clauses: Analysis shows trends like the dip in rides during the 2020 pandemic, handled efficiently by filtering on the
yearderived from pickup timestamps.
Summary
- Scale: DuckDB effectively handles datasets ranging from gigabytes (Stack Overflow) to billions of rows (NYC Taxi) on a single node.
- Flexibility: It supports diverse ingestion paths, from complex XML-to-CSV transformations to direct querying of Parquet files on S3.
- Performance: Features like vectorized execution, enum optimization, and multithreaded exports ensure high performance.
- Metadata: Leveraging metadata (via
SUMMARIZEor Parquet footers) allows for instant insights into massive datasets without full table scans.


