Chapter 10

8 min read

Performance Considerations for Large Datasets

INFOThis is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.
'DuckDB In Action' book cover

So far, we have explored DuckDB using small to medium-sized datasets. However, real-world data engineering often involves massive volumes of data. Chapter 10 of DuckDB in Action demonstrates that you don’t always need a separate distributed query engine to handle large-scale data. Instead, DuckDB is highly capable of processing huge datasets efficiently on a single machine.

This chapter focuses on two significant datasets to showcase performance tips and tricks: the Stack Overflow community database (11 GB compressed) and the New York City Taxi dataset (1.7 billion rows).

10.1 Loading and querying the full Stack Overflow database

Stack Overflow is a massive community-driven Q&A site. Analyzing its data provides deep insights into developer trends. The dataset used here is approximately 11 GB in compressed CSV format, containing 58 million posts and 20 million users. While this doesn't quite qualify as "Big Data" (read more about why Big Data is Dead), it is large enough to test DuckDB's limits.

10.1.1 Data dump and extraction

The raw data is available from the Internet Archive as compressed XML files. Since DuckDB does not natively parse XML, the chapter guides you through a transformation pipeline using command-line tools like curl for downloading, 7-Zip for extraction, and xidel combined with jq to convert the XML structure into a CSV format compatible with DuckDB.

10.1.2 The data model

Understanding the relationships between entities is crucial. The Stack Overflow model revolves around Posts (Questions and Answers), which are linked to Users, Comments, Votes, and Badges. Although the CSV dump lacks formal constraints, we can infer foreign keys (e.g., ownerUserId linking Posts to Users) to rebuild the schema within our analytical database.

10.1.3 Exploring the CSV file data

Before ingesting everything, it is best to inspect the data in place. DuckDB’s read_csv function allows you to query compressed CSV files directly.

  • Metadata Inspection: Using the DESCRIBE statement on a read_csv call reveals column names and inferred types without loading the full file.
  • Ad-hoc Analysis: You can run SQL queries directly on the files (e.g., finding the top 5 most popular tags) to validate the data structure before committing to a full load.

10.1.4 Loading the data into DuckDB

Ingestion can be performed by creating tables explicitly or using CREATE TABLE AS SELECT (CTAS) to infer the schema automatically.

  • Schema Evolution: The CREATE OR REPLACE TABLE syntax is useful during testing to iterate on data loading scripts.
  • Data Summarization: The SUMMARIZE clause provides a powerful shortcut to generate statistics—such as min, max, null percentage, and approximate unique counts—for every column in your table, helping you understand data distribution instantly.

10.1.5 Fast exploratory queries on large tables

Once the data is loaded, DuckDB’s columnar engine excels at analytical queries.

  • Performance: Queries on 20 million users (e.g., finding top users by reputation) execute in sub-second time (0.126 seconds in the example).
  • Visualization: The chapter demonstrates using the internal bar() function to generate ASCII bar charts directly in the terminal, visualizing reputation rates without external tools.

10.1.6 Posting on weekdays

The text demonstrates how to perform temporal analysis—such as determining which programming languages (like SQL vs. Rust) are discussed more on weekdays versus weekends. This involves grouping by day of the week and filtering by specific tags.

10.1.7 Using enums for tags

For large datasets, string operations can be costly. DuckDB supports ENUM types (enumerated types), which store strings as integers internally.

  • Optimization: Converting the Tags column from raw strings to an ENUM type reduces storage size and improves query performance for filtering and aggregation.
  • Implementation: The chapter walks through extracting distinct tags, creating an ENUM type, and using list_transform to convert arrays of string tags into arrays of enums.

10.2 Query planning and execution

Efficient query execution is vital for large datasets. DuckDB employs a sophisticated pipeline to transform SQL into optimized machine code.

10.2.1 Planner and optimizer

The journey of a query involves several stages:

  1. Parser: Checks syntax.
  2. Binder: Resolves table and column names.
  3. Logical Planner: Generates a basic plan.
  4. Optimizer: Applies rules (like join reordering and filter pushdown) to create an optimized logical plan.
  5. Physical Planner: Converts the logical plan into executable operations suited for the hardware.

10.2.2 Runtime and vectorization

DuckDB uses a vectorized execution engine. Instead of processing one row at a time, it processes data in batches (vectors) of 2,048 values.

  • Columnar Storage: Data is stored in row groups (horizontal partitions), allowing the engine to read only relevant columns.
  • Morsel-Driven Parallelism: Execution is parallelized using "morsels," allowing multiple threads to process different chunks of data simultaneously, maximizing CPU usage.

10.2.3 Visualizing query plans with Explain and Explain Analyze

To debug performance, you can use:

  • EXPLAIN clause: Shows the tree of operators (the plan) without running the query.
  • EXPLAIN ANALYZE: Executes the query and reports actual runtimes and row counts for each operator. This helps identify bottlenecks, such as a scan returning too many rows or a sort operation taking too much memory.

10.3 Exporting the Stack Overflow data to Parquet

For long-term storage, Parquet is superior to CSV. It is a compressed, columnar binary format that supports schema retention.

  • Export: The COPY ... TO ... (FORMAT PARQUET) command exports tables efficiently.
  • Multithreading: DuckDB can write Parquet files in parallel (one file per thread), significantly speeding up the export process (e.g., reducing write time for the users table from 10s to 1.7s).
  • Performance Benefit: Reading the row count from a Parquet file is orders of magnitude faster (milliseconds vs. seconds) than CSV because DuckDB can read the metadata in the Parquet footer instead of scanning the entire file.

10.4 Exploring the New York City Taxi dataset from Parquet files

To push boundaries, the chapter explores the NYC Taxi dataset—over 1.7 billion rows stored in 175 Parquet files. This section demonstrates DuckDB's ability to act as a query engine over data stored in cloud storage (S3) without importing it.

10.4.1 Configuring credentials for S3 access

Accessing an S3 bucketrequires the httpfs extension. Users can configure access using CREATE SECRET to securely store credentials (Region, Key ID, Secret) for persistent access to private buckets.

10.4.2 Auto-inferring file types

DuckDB can query remote Parquet files directly using read_parquet. It supports glob patterns (wildcards), allowing you to aggregate data across hundreds of files (e.g., yellow_tripdata_*.parquet) in a single query.

10.4.3 Exploring Parquet schema

Using the parquet_schema() function, you can inspect the schema of remote files to understand column names and types (like passenger_count or trip_distance) without downloading the payload data.

10.4.4 Creating views

Instead of repeatedly typing long S3 URLs, you can create a view.

  • Abstraction: A view acts like a virtual table. Creating a view over the wildcard path (.../yellow_tripdata_202*.parquet) allows you to treat the remote data lake as a local table.
  • Union by Name: If schema evolves over time, the union_by_name=true option ensures that files with slightly different columns are merged gracefully, filling missing columns with NULLs.

10.4.5 Analyzing the data

Running SUMMARIZE on the view triggers a scan of the remote data. While slower than local data (due to network latency), it is efficient because DuckDB only pulls necessary column data. This allows for massive scale analysis (100M+ rows) from a local laptop or a connected cloud instance like MotherDuck.

10.4.6 Making use of the taxi dataset

The chapter concludes with advanced analytics on the taxi data, such as calculating the average fare per mile over different years.

  • Star Expressions: The text introduces COLUMNS expressions (Star Expressions) to apply aggregate functions (like min, max, avg) to multiple columns at once using pattern matching (e.g., all columns containing "amount" or "fee").
  • Filter Clauses: Analysis shows trends like the dip in rides during the 2020 pandemic, handled efficiently by filtering on the year derived from pickup timestamps.

Summary

  • Scale: DuckDB effectively handles datasets ranging from gigabytes (Stack Overflow) to billions of rows (NYC Taxi) on a single node.
  • Flexibility: It supports diverse ingestion paths, from complex XML-to-CSV transformations to direct querying of Parquet files on S3.
  • Performance: Features like vectorized execution, enum optimization, and multithreaded exports ensure high performance.
  • Metadata: Leveraging metadata (via SUMMARIZE or Parquet footers) allows for instant insights into massive datasets without full table scans.
'DuckDB In Action' book cover