Escaping Catalog Hell: A Guide to Iceberg, DuckDB & the Data Lakehouse
2025/06/12Building a modern data stack often feels like a choice between two extremes. You can go "full-SaaS" with a platform like Snowflake or Databricks, which gets you moving fast but risks vendor lock-in and spiraling costs. Or, you can build it all yourself with open-source tools, giving you ultimate flexibility but often requiring months of complex infrastructure work before you can deliver a single insight.
This is what Julien Hurault calls the "cold start problem". "There is no middle ground," he notes in a recent conversation with MotherDuck's Mehdi Ouazza. Every data team, from startups to large enterprises, faces this tension.
So, how do you find that middle ground? Open table formats like Apache Iceberg are the map, promising a future where data is decoupled from compute. But the catalog—the system that tracks the state of your tables—is the tricky terrain you must navigate.
In this article, we'll explore this terrain through the expert eyes of Julien, who has guided many companies on this journey. We'll break down the promise and the pain of the modern data stack, demystify the catalog, and walk through a hands-on tutorial to get you started with Iceberg and DuckDB in minutes, no cloud account required.
The Promise and the Pain of the Modern Data Stack
The dream of the modern data stack is flexibility. You want to use the best tool for the job without being locked into a single vendor's ecosystem. This is where open table formats like Apache Iceberg, Delta Lake, and Hudi come in. They allow you to store your data in a vendor-neutral format in your own object storage (like AWS S3 or Google Cloud Storage).
The Multi-Engine Lakehouse Vision
Once your data is in an open format, you can bring different query engines to it. This "multi-engine" approach is the future of data architecture.
As Julien puts it, "People are just going to start by dumping their data in Iceberg... and then just plug a warehouse on top of it". This turns the traditional data warehouse on its head. Instead of being the single source of truth for storage and compute, it becomes just one of many specialized tools you can use.
🎙️ Julien's Insight: Think of a powerful data warehouse like a "serverless function". You can spin it up to perform a compute-intensive task on your Iceberg data and then write the results back to the lakehouse. Nothing is permanently stored or locked inside the warehouse.
This model gives you incredible power:
- Use DuckDB 🦆 for fast, local analytical queries and development.
- Use Spark ✨ for large-scale ETL and batch processing.
- Use Snowflake or BigQuery ❄️ for massive, ad-hoc interactive queries when you need the horsepower.
Your data remains in an open, accessible format, and you avoid getting locked into any single compute vendor. But there's a catch.
The Hidden Hurdle: Understanding the Apache Iceberg Catalog
Adopting Iceberg isn't just about writing Parquet files with a specific structure. It's about managing the state of your tables—what data is in the table, what the schema looks like, and how it has changed over time. This is the job of the catalog.
While powerful, the catalog is also what holds many teams back from adopting Iceberg. According to Julien, the main barriers are:
- Poor User Experience: The APIs and tooling can be complex, especially for developers outside the JVM ecosystem (e.g., Python and Node.js users).
- Table Maintenance: Suddenly, tasks like compaction, cleaning up old snapshots, and optimizing file layouts become your responsibility, not the warehouse's.
- The Catalog Itself: It's another critical piece of infrastructure you have to choose, deploy, and manage. This is often the biggest source of complexity and frustration—what we call "catalog hell."
The Iceberg Catalog Landscape: REST, Serverless & More
The world of Iceberg catalogs can be confusing. Here's a quick breakdown of the main options discussed:
- Managed REST Catalogs: These are dedicated catalog services. The most common are AWS Glue Catalog, Databricks Unity Catalog, and the open-source Project Nessie. They provide a central endpoint to manage table state and handle concurrent writes, but they are yet another service to pay for and manage.
- "Serverless" Catalogs: A new wave of services tightly integrates the catalog with the storage layer. Amazon S3 Tables and Cloudflare R2 Tables are prime examples. As Julien highlights, these are a "great innovation because they bundle the catalog with the storage, simplifying setup and maintenance". You don't manage a separate catalog service; it's part of your storage bucket.
- File-Based Catalogs: At its core, a REST catalog is often just "a fancy service to point to a metadata file," as Julien notes. This complexity is what led to simpler, file-based approaches, which are perfect for local development and getting started.
This last approach is the key to escaping catalog hell and getting your hands dirty with Iceberg and DuckDB.
A Practical, Hands-On Approach with boring-catalog and DuckDB
To demonstrate just how simple an Iceberg setup can be, Julien created an open-source tool called boring-catalog
. It implements a lightweight, file-based catalog using a single JSON file. It's the perfect way to learn how Iceberg works without needing a cloud account or a complex distributed setup.
Let's walk through it. 🚀
Goal: Go from zero to querying an Iceberg table with DuckDB in 5 minutes.
Step 1: Installation & Setup
First, install boring-catalog
using pip.
Copy code
pip install boringcatalog
Next, initialize your catalog. This is similar to running git init
.
Copy code
ice init
This simple command does two things:
- Creates a
warehouse/
directory to store your Iceberg table data. - Creates a
.ice/index
file that points to your catalog file, which iswarehouse/catalog/catalog_boring.json
.
This catalog_boring.json
file is your catalog. It's just a simple JSON file that will keep track of your tables and point to their latest metadata files. This elegantly demonstrates Julien's point: you don't always need a complex REST service to manage state.
Step 2: Committing Data to an Iceberg Table
Now, let's get some sample data and commit it to a new Iceberg table.
Copy code
# Get some sample data (NYC taxi trips)
curl -L -o yellow_tripdata.parquet https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
# Commit the Parquet file to a new Iceberg table called 'trips'
ice commit trips --source yellow_tripdata.parquet
That's it! You've just created a new Iceberg table and committed your first snapshot. The workflow is intentionally git
-like. You can even view the history of your table.
Copy code
ice log trips
You'll see output like this, showing the complete history of operations, which enables powerful features like time-travel queries.
Copy code
commit 5917812165563990664
Table: ice_default.trips
Date: 2025-07-09 19:55:00 UTC
Operation: append
Summary:
added-data-files : 1
total-data-files : 1
added-records : 20000
total-records : 20000
Step 3: Querying with DuckDB
Now for the fun part. How do you query this table? boring-catalog
comes with a handy command to fire up a DuckDB shell that's pre-configured to read your Iceberg catalog.
Copy code
ice duck
This drops you right into a DuckDB CLI. You can now query your Iceberg table directly with SQL!
Copy code
-- The 'ice duck' command automatically creates a view for your table
USE ice_default;
SELECT passenger_count, count(*)
FROM trips
GROUP BY 1
ORDER BY 2 DESC;
+------------------+---------------+
| passenger_count | count_star() |
| double | int64 |
+------------------+---------------+
| 1.0 | 14545 |
| 2.0 | 2997 |
| 3.0 | 883 |
| 0.0 | 585 |
| 4.0 | 424 |
| 5.0 | 335 |
| 6.0 | 221 |
| NULL | 7 |
| 7.0 | 2 |
| 9.0 | 1 |
+------------------+---------------+
You've successfully built a local, multi-engine data lakehouse. You used boring-catalog
to manage the table format (Iceberg) and DuckDB as your query engine.
The Bigger Picture: Iceberg vs. DuckLake
This hands-on example helps clarify the philosophical differences between Iceberg and DuckLake.
The conversation between Mehdi and Julien shed light on this key distinction:
-
Iceberg's Catalog: As we saw with
boring-catalog
, the catalog is a lightweight pointer to metadata files. Its primary job is to provide a central place for atomic commits, ensuring that concurrent writers don't corrupt the table. The metadata about the files (like Parquet file statistics) lives in separatemetadata.json
files on disk. -
DuckLake's Catalog: In the DuckLake approach, the catalog isn't just a pointer; it contains the actual metadata itself, typically within a SQL database. This removes the need for separate metadata files on disk and gives the catalog more responsibility, which can simplify the overall architecture and user experience.
As Julien perfectly summarized, the ideal future would be a marriage of these two worlds: "Iceberg's broad engine interoperability combined with DuckLake's simple, elegant user experience". That's the dream many data engineers share today.
Conclusion: Your Next Steps
The catalog is the central nervous system of an open data lakehouse. While historically a source of complexity, a new wave of tools and managed services is making the power of Iceberg more accessible than ever. For the modern data professional, understanding how catalogs work—and how to choose the right one for the job—is a crucial skill.
- Try it yourself: The best way to learn is by doing. We highly encourage you to try out Julien's
boring-catalog
on your own machine. - Go deeper: To learn more from Julien, check out his Boring Data newsletter and data stack templates.
- Explore the DuckDB approach: Want to dive deeper into how DuckDB and MotherDuck are innovating to solve the catalog problem? Get started with MotherDuck and DuckLake today.
Happy building! 🦆
CONTENT
- The Promise and the Pain of the Modern Data Stack
- The Hidden Hurdle: Understanding the Apache Iceberg Catalog
- A Practical, Hands-On Approach with boring-catalog and DuckDB
- The Bigger Picture: Iceberg vs. DuckLake
Related Videos

2025-04-23
Instant SQL Mode - Real Time Feedback to Make SQL Data Exploration Fly
Hamilton Ulmer shares insights from MotherDuck's Instant SQL Mode, exploring how real-time query result previews eliminate the traditional write-run-debug cycle through client-side parsing and DuckDB-WASM caching.
SQL
Talk
MotherDuck Features