YouTubeSourcesQuack & Code

Leveraging DuckDB and Delta Lake Together

2024/07/24

If you work with data, you've almost certainly worked with Parquet files. It's the go-to columnar format for analytics, a massive leap forward from human-readable but machine-unfriendly formats like CSV. But as powerful as Parquet is as a file format, it's not a complete solution for managing data tables.

As our data pipelines grow more complex, the limitations of a folder full of Parquet files become painfully obvious. In a recent episode of "Quack and Code," I sat down with Holly Smith, a Developer Advocate at Databricks, to discuss the evolution beyond plain Parquet files and how the Delta Lake table format, combined with DuckDB and MotherDuck, solves these modern data challenges.

Holly shared a relatable journey: "Back when I was a data scientist, I'd grab things from somewhere like Teradata. And I'd never think about a table being a file format... It's just kind of like SELECT * FROM table and then it works."

TL;DR

  • The Problem: Using folders of Parquet files as tables leads to schema enforcement issues, inefficient updates/deletes, and the "small file problem."
  • The Solution: Delta Lake adds a transaction log (_delta_log) on top of Parquet files to provide ACID guarantees, solving these reliability issues.
  • How to Query: DuckDB's delta extension makes it easy to query Delta Lake tables locally or on S3 using a simple delta_scan() function.
  • Performance Boost: For cloud data, use MotherDuck to run queries in the same region as your S3 data, drastically reducing latency by transferring only the final results, not the raw data.

We'll start with the pain points of plain Parquet, see how Delta Lake addresses them, and then get hands-on with querying Delta tables using DuckDB and MotherDuck.

The Pain Points of Using Plain Parquet as a Table

Parquet is an excellent building block, but when you treat a directory of Parquet files as a single table, you invite problems that can break pipelines and cause massive headaches.

The Challenge of Schema Enforcement

A common issue with systems that use schema-on-read is that a single rogue Parquet file can break an entire query. As Holly explained, with plain Parquet, you could define a column as an integer but then write a string into it in a new file, and nothing would stop you.

If one file has a slightly different or incompatible schema, your query fails. Finding and fixing that one problematic file in a table composed of thousands requires tedious, manual work.

Inefficient Updates and Deletes

Parquet files are immutable. You can't just open one up and delete a row.

To handle updates or deletes (like a GDPR right-to-be-forgotten request), you have to read all the relevant Parquet files, filter out the records you don't want, and write out entirely new files.

Holly shared a powerful real-world example of this pain:

"I remember working with one customer... they had every click from their website from the last five years in this one monster table. For two hours a day, they'd have to take it offline. And then GDPR came in and they were like, we don't even know how we can delete data from this. This is a nightmare."

This process is not only complex and slow but also requires taking the table "offline," creating downtime where the data is unreliable.

The "Small File Problem"

Query engines love large, uniform files. However, many data architectures, especially those involving streaming or certain partitioning schemes, create lots of small files.

For instance, an e-commerce platform partitioning by date, as Holly noted in her "Black Friday" example, might have one massive file for a busy shopping day and hundreds of tiny, inefficient files for slower days. Querying thousands of tiny files is incredibly inefficient due to the overhead of opening each file and reading its metadata.

Delta Lake: Adding Transactional Reliability to Parquet

Delta Lake was created to solve these problems. It's not a new file type but a storage protocol that adds a crucial layer of reliability on top of the Parquet files you already have.

"Technically, Delta is a protocol," Holly clarified. "We still have those Parquet files, which are great, but what we're going to do is add a tiny bit of metadata with our Parquet files and store them next to them."

This "tiny bit of metadata" is the key: a transaction log that brings ACID properties (Atomicity, Consistency, Isolation, and Durability) to your data lake. Every operation, whether it's adding data, deleting records, or updating a schema, records as an atomic commit in this transaction log. A transaction either completely succeeds or completely fails, eliminating the risk of corrupted data or inconsistent reads.

Under the Hood: A Peek Inside the _delta_log

So what does this transaction log actually look like? When you create a Delta table, you'll see your familiar Parquet data files alongside a new directory: _delta_log.

Copy code

/path/to/my_delta_table/ ├── 00000000000000000000.json ├── 00000000000000000001.json ├── ... ├── part-00000-a-b-c.snappy.parquet ├── part-00001-d-e-f.snappy.parquet └── _delta_log/

Inside _delta_log, you'll find a series of sequentially numbered JSON files. Each file represents a single atomic transaction or "commit" to the table.

Here's a simplified example of what one of these JSON files might contain after adding a new Parquet file to the table:

Copy code

{ "commitInfo": { "timestamp": 1672531200000, "operation": "WRITE", "operationParameters": {"mode": "Append", "partitionBy": "[]"}, "isBlindAppend": true }, "add": { "path": "part-00001-d-e-f.snappy.parquet", "partitionValues": {}, "size": 1024, "modificationTime": 1672531199000, "dataChange": true, "stats": "{\"numRecords\":100, \"minValues\":{\"id\":101}, \"maxValues\":{\"id\":200}}" } }

This log entry tells us everything a query engine needs to know: a new file was added, its path, its size, and even file-level statistics like the number of records and min/max values, which can skip reading the file entirely if a query filter falls outside that range.

A common question is, "Can I still query the Parquet files directly?" As Holly explained, while you physically can, you shouldn't.

When a row is "deleted" in Delta Lake, the transaction log simply adds a remove action pointing to the old Parquet file. The data still exists in that file, but the log tells any Delta-aware engine to ignore it. If you bypass the log and query the Parquet file directly, you'll see stale, incorrect data. Always query through the Delta protocol to get the correct, current state of your table.

How to Query Delta Lake with DuckDB

The DuckDB delta extension makes it incredibly simple to read this transaction log and query Delta tables correctly. Built on the open-source Delta Kernel project, it ensures robust and efficient access.

Step 1: Install and Load the Extension

First, you'll need the delta extension. DuckDB will automatically download and install it the first time you use a Delta-related function, but you can also install it manually:

Copy code

INSTALL delta; LOAD delta;

Step 2: Query a Local Delta Table

To query a Delta table, you use delta_scan, a table function that interprets the Delta Lake transaction log to find the correct data files. Querying a table stored on your local machine is as simple as pointing the function to the root directory of the table.

Copy code

-- Query a local Delta Lake table SELECT * FROM delta_scan('path/to/my_delta_table');

DuckDB reads the _delta_log, determines the current set of active Parquet files, and queries them. That's it.

Step 3: Query a Delta Table in S3

The real power comes when your Delta tables are stored in cloud object storage like Amazon S3. The delta extension integrates with DuckDB's httpfs extension and secrets manager.

First, configure your AWS credentials. DuckDB can automatically use your environment variables or local AWS config, or you can create a secret explicitly:

Copy code

-- Create a secret to use your local AWS credential chain CREATE SECRET ( TYPE s3, PROVIDER credential_chain );

Now, you can query the Delta table by providing its S3 URI to delta_scan:

Copy code

-- Query a Delta Lake table stored in S3 SELECT user_id, COUNT(*) AS event_count FROM delta_scan('s3://my-bucket/path/to/delta_table') GROUP BY user_id ORDER BY event_count DESC LIMIT 10;

Level Up: Accelerating S3 Queries with MotherDuck

Querying a remote Delta table from your local machine works great, but it has a physical limitation: network latency. Your local DuckDB instance has to make multiple requests to S3 to read the transaction log and then download all the necessary Parquet data over the internet before it can process.

As Holly pointed out, this becomes a major bottleneck for distributed teams: "If your data is in S3 and it's halfway across the planet, connecting locally is... you're just going to have latency regardless of the tool that you're using."

MotherDuck addresses this directly. MotherDuck is a serverless analytics platform powered by DuckDB that runs in the cloud. By running your query in MotherDuck, you move the compute next to the data.

Here's the workflow comparison:

Local DuckDB:

  1. Connect to S3 from your laptop.
  2. List files in _delta_log.
  3. Download relevant log files to your laptop.
  4. Parse logs to identify active data files.
  5. Download gigabytes of Parquet data from S3 to your laptop.
  6. Process the data locally.
  7. Display the small result set.

MotherDuck:

  1. Connect your local DuckDB client to MotherDuck.
  2. Send the SQL query to MotherDuck's cloud-based engine.
  3. MotherDuck's engine performs steps 1-6 within the same cloud region as your data, over a high-speed network.
  4. MotherDuck sends only the small, final result set back to your laptop.

This shift dramatically reduces data transfer over the public internet, moving only the small final result set instead of the entire raw dataset.

The query is identical. The only difference is where it runs.

Copy code

-- 1. Connect to MotherDuck (from the CLI) -- duckdb md: -- 2. Ensure your S3 secret is stored in MotherDuck -- You can create it once at https://app.motherduck.com/ -- 3. Run the exact same query SELECT user_id, COUNT(*) AS event_count FROM delta_scan('s3://my-bucket/path/to/delta_table') GROUP BY user_id ORDER BY event_count DESC LIMIT 10;

For large datasets, the performance difference is dramatic. Instead of pulling gigabytes of raw data across the internet, you're only transmitting kilobytes of results.

The Bigger Picture: A Converging Ecosystem

While this guide focuses on Delta Lake, it's reassuring to know that the skills you're learning fit into a broader, converging ecosystem. The data world once had a "table format war" between Delta Lake, Apache Iceberg, and Apache Hudi. But the trend now moves towards interoperability and convergence.

With Databricks acquiring Tabular (the company behind Iceberg), the focus is on making these formats work together. Features like Delta Uniform allow a single table to be read as both Delta and Iceberg, and future developments aim to make writing to multiple formats atomic.

"This is not taking away one project... this is the idea of how can we take great features from the various projects and make them work together?" Holly said. "Hopefully, it shouldn't matter in the future which one you pick."

Delta Lake vs. Iceberg vs. Hudi

At a high level, the main difference between the three open-source table formats is how they track changes and table state.

  • Delta Lake uses an ordered transaction log of JSON files in the _delta_log directory. To find the current state, a reader must process these logs sequentially.
  • Apache Iceberg takes a different approach, using a hierarchy of metadata files. It maintains a central metadata file that points to manifest lists, which in turn point to data files. This structure can make discovering the current files for a query faster, as it doesn't require listing and processing a potentially long transaction log.
  • Apache Hudi offers different table types (Copy on Write and Merge on Read) and maintains a timeline of actions performed on the table, similar to Delta's log. It was initially focused on providing fast upserts and deletes for streaming use cases.

Conclusion & Next Steps

Parquet is a fantastic file format, but for building reliable, scalable data tables, you need more. Delta Lake provides the transactional guarantees and performance features necessary for modern data platforms.

  • Use Delta Lake to add a transactional metadata layer that solves the core problems of plain Parquet.
  • Use DuckDB and its delta extension to query these tables locally or in the cloud with minimal setup.
  • Use MotherDuck to run queries on cloud-based Delta tables with compute next to the data, eliminating network bottlenecks.

Ready to try it yourself?

FAQ

What is the main advantage of Delta Lake over Parquet?

Delta Lake adds a transaction log to Parquet files, providing ACID transactions. This transaction log solves major Parquet issues like schema enforcement, difficult updates/deletes, and ensures data reliability, which a simple folder of Parquet files cannot guarantee.

Can I query Delta Lake with standard SQL?

Yes. With tools like DuckDB and its delta extension, you can use standard SQL functions like delta_scan('path/to/table') to query Delta Lake tables as if they were regular database tables.

Why should I use MotherDuck to query a Delta Lake table in S3?

Querying a remote S3 Delta table from your local machine requires downloading large amounts of data, causing high latency. MotherDuck runs the DuckDB engine in the cloud, next to your data, processes it remotely and sends only the small final result back to you, which is significantly faster.

Can DuckDB write to Delta Lake tables?

Currently, DuckDB's delta extension primarily supports reading from Delta Lake tables. Writing or updating data in the Delta format is not yet supported directly through the extension.

Related Videos

" pg_duckdb: Ducking awesome analytics in Postgres" video thumbnail

2025-06-12

pg_duckdb: Ducking awesome analytics in Postgres

Supercharge your Postgres analytics! This talk shows how the pg_duckdb extension accelerates your slowest queries instantly, often with zero code changes. Learn practical tips and how to use remote columnar storage for even more speed.

Talk

Sources

"Can DuckDB revolutionize the data lake experience?" video thumbnail

16:37

2024-11-22

Can DuckDB revolutionize the data lake experience?

Mehdi explores DuckDB as a catalog for Data Lake and Lakehouse pattern. He'll define what we mean by "data catalog", gives clear examples on how they work and dive into a pragmatic use case with DuckDB & MotherDuck.

YouTube

Data Pipelines

Sources

"A new paradigm for data visualization with just SQL + Markdown" video thumbnail

1:00:53

2024-09-24

A new paradigm for data visualization with just SQL + Markdown

Come to this Quack&Code where Mehdi will discuss data visualization with DuckDB/MotherDuck, specifically focusing on Evidence! Archie, who is building evidence.dev, will join us to share his wisdom on charts ;-)

BI & Visualization

YouTube

Quack & Code