Chapter 5

5 min read

Exploring Data Without Persistence

This is a summary of a book chapter from DuckDB in Action, published by Manning. Download the complete book for free to read the complete chapter.

'DuckDB In Action' book cover

5.1 Why use a database without persisting any data?

In traditional data engineering, the first step is often ingesting data into a database's storage format. However, analyzing data without persistence is a powerful paradigm shift offered by DuckDB. This approach is ideal for scenarios where data resides in remote locations like Amazon S3, or when privacy concerns prevent permanent storage.

By decoupling the compute engine from the storage layer, you can use SQL to understand the shape and volume of external datasets before building production pipelines. DuckDB integrates seamlessly with existing infrastructure, allowing you to attach to SQLite and Postgres databases directly. In many benchmarks, such as the TPC-H decision support benchmark, DuckDB’s analytical engine has proven faster at querying Postgres data (via binary transfer mode) than Postgres itself.

5.2 Inferring file type and schema

One of DuckDB’s most user-friendly features is its ability to auto-infer file types and schemas. Whether you are working with CSV, JSON, or Parquet files, DuckDB can automatically detect the format and structure using functions like read_csv_auto.

  • Dialect Detection: Automatically identifies delimiters, quoting rules, and header rows.
  • Type Sniffing: Uses sampling (reading the first 20,480 rows by default) to determine column types with a descending priority list—checking for BOOLEAN first, then BIGINT, all the way down to VARCHAR. For granular control over these parameters, refer to the DuckDB documentation.
  • Schema Unification: When reading multiple files via glob syntax (e.g., data*.json), DuckDB can unify schemas by column position or use union_by_name to handle evolving schemas.

5.3 Shredding nested JSON

Modern data often comes in semi-structured formats. This chapter delves into shredding nested JSON using DuckDB’s native JSON extension. Using a real-world dataset of Premier League football matches—created using the understatapi library to analyze "Expected Goals" (xG) from Understat—the text demonstrates how to transform nested JSON objects into flat, relational tables.

Key techniques covered include:

  • Unnesting: Using the unnest() function to unpack lists and structs into rows.
  • Flattening Structs: Using the .* syntax to expand nested objects into individual columns automatically.
  • Handling Schema Drift: Utilizing try_cast to manage data inconsistencies where types may vary across JSON documents, ensuring query resilience without execution errors.

5.4 Translating CSV to Parquet

A common data engineering task is converting row-based formats like CSV into high-performance, columnar formats like Parquet. This section guides you through building a memory-efficient transformation pipeline using the ATP tennis rankings dataset.

You will learn how to:

  • Manage Memory: Use the SET memory_limit configuration to process datasets larger than RAM by flushing data to disk when necessary.
  • Fix Data Types: Use strptime to convert non-standard date strings into proper DATE types during the transformation.
  • Optimize Output: Configure the COPY..TO command with specific codecs (like SNAPPY or GZIP) and row group sizes to balance file size and query speed.

5.5 Analyzing and querying Parquet files

Parquet files are "self-describing," meaning they carry their own metadata. DuckDB allows you to query this metadata directly to optimize your analytical workloads. By using the parquet_schema and parquet_metadata functions, you can inspect the internal structure of the files without reading the full dataset.

This section explains the difference between physical types (how data is stored on disk) and logical types (how it is interpreted). It also demonstrates how DuckDB leverages projection pushdown and predicate pushdown—reading only the specific columns and row groups required for a query—making it an incredibly fast engine for analyzing large Parquet lakes.

5.6 Querying SQLite and other databases

DuckDB acts as a federated query engine capable of attaching to OLTP databases like SQLite. This allows you to run heavy analytical queries on transactional data without moving it. This section demonstrates this capability using the European Soccer Database from Kaggle.

The chapter highlights a specific challenge: SQLite is weakly typed (a column can hold mixed data types), whereas DuckDB is strongly typed. You will learn how to handle type mismatches—such as "float" values appearing in "integer" columns—by creating views that cast columns effectively. The text also touches on the postgres extension for reading directly from PostgreSQL instances; further details can be found in the extension documentation.

5.7 Working with Excel files

Despite the popularity of CSV and Parquet, much of the world's data lives in spreadsheets. DuckDB supports reading and writing Microsoft Excel (.xlsx) files via the spatial extension (which utilizes the underlying GDAL library).

While powerful, this integration comes with caveats. The chapter explains how to use st_read to query Excel sheets as tables and discusses limitations, such as the lack of automatic formula evaluation. It offers practical advice on when to query Excel directly and when to convert to CSV for better performance and type safety.

Summary

  • Non-Persistent Processing: DuckDB enables powerful SQL analysis on data stored in files or pipelines without needing to persist it into the database's internal storage.
  • Intelligent Inference: The engine automatically detects content and data types for major formats like JSON, CSV, and Parquet, simplifying data integration.
  • Complex JSON Handling: Native capabilities allow for querying and normalizing deeply nested, denormalized JSON documents into structured tabular data.
  • Ephemeral Transformations: You can filter, convert, flatten, and join data from various sources on the fly, removing the need for intermediate storage models.
  • High-Performance Federation: The vector-based query engine often executes analytical queries on external databases more efficiently than the native stores themselves without data duplication.
  • Encapsulated Logic: Views provide a robust mechanism to abstract and reuse complex transformations applied to external data types.
'DuckDB In Action' book cover