Escaping Catalog Hell: A Guide to Iceberg, DuckDB & the Data Lakehouse

2025/06/12Featuring:

Building a modern data stack often feels like a choice between two extremes. You can go "full-SaaS" with a platform like Snowflake or Databricks, which gets you moving fast but risks vendor lock-in and spiraling costs. Or, you can build it all yourself with open-source tools, giving you ultimate flexibility but often requiring months of complex infrastructure work before you can deliver a single insight.

This is what Julien Hurault calls the "cold start problem". "There is no middle ground," he notes in a recent conversation with MotherDuck's Mehdi Ouazza. Every data team, from startups to large enterprises, faces this tension.

So, how do you find that middle ground? Open table formats like Apache Iceberg are the map, promising a future where data is decoupled from compute. But the catalog—the system that tracks the state of your tables—is the tricky terrain you must navigate.

In this article, we'll explore this terrain through the expert eyes of Julien, who has guided many companies on this journey. We'll break down the promise and the pain of the modern data stack, demystify the catalog, and walk through a hands-on tutorial to get you started with Iceberg and DuckDB in minutes, no cloud account required.

The Promise and the Pain of the Modern Data Stack

The dream of the modern data stack is flexibility. You want to use the best tool for the job without being locked into a single vendor's ecosystem. This is where open table formats like Apache Iceberg, Delta Lake, and Hudi come in. They allow you to store your data in a vendor-neutral format in your own object storage (like AWS S3 or Google Cloud Storage).

The Multi-Engine Lakehouse Vision

Once your data is in an open format, you can bring different query engines to it. This "multi-engine" approach is the future of data architecture.

As Julien puts it, "People are just going to start by dumping their data in Iceberg... and then just plug a warehouse on top of it". This turns the traditional data warehouse on its head. Instead of being the single source of truth for storage and compute, it becomes just one of many specialized tools you can use.

🎙️ Julien's Insight: Think of a powerful data warehouse like a "serverless function". You can spin it up to perform a compute-intensive task on your Iceberg data and then write the results back to the lakehouse. Nothing is permanently stored or locked inside the warehouse.

This model gives you incredible power:

Use DuckDB 🦆 for fast, local analytical queries and development.
Use Spark ✨ for large-scale ETL and batch processing.
Use Snowflake or BigQuery ❄️ for massive, ad-hoc interactive queries when you need the horsepower.

Your data remains in an open, accessible format, and you avoid getting locked into any single compute vendor. But there's a catch.

The Hidden Hurdle: Understanding the Apache Iceberg Catalog

Adopting Iceberg isn't just about writing Parquet files with a specific structure. It's about managing the state of your tables—what data is in the table, what the schema looks like, and how it has changed over time. This is the job of the catalog.

While powerful, the catalog is also what holds many teams back from adopting Iceberg. According to Julien, the main barriers are:

Poor User Experience: The APIs and tooling can be complex, especially for developers outside the JVM ecosystem (e.g., Python and Node.js users).
Table Maintenance: Suddenly, tasks like compaction, cleaning up old snapshots, and optimizing file layouts become your responsibility, not the warehouse's.
The Catalog Itself: It's another critical piece of infrastructure you have to choose, deploy, and manage. This is often the biggest source of complexity and frustration—what we call "catalog hell."

The Iceberg Catalog Landscape: REST, Serverless & More

The world of Iceberg catalogs can be confusing. Here's a quick breakdown of the main options discussed:

Managed REST Catalogs: These are dedicated catalog services. The most common are AWS Glue Catalog, Databricks Unity Catalog, and the open-source Project Nessie. They provide a central endpoint to manage table state and handle concurrent writes, but they are yet another service to pay for and manage.
"Serverless" Catalogs: A new wave of services tightly integrates the catalog with the storage layer. Amazon S3 Tables and Cloudflare R2 Tables are prime examples. As Julien highlights, these are a "great innovation because they bundle the catalog with the storage, simplifying setup and maintenance". You don't manage a separate catalog service; it's part of your storage bucket.
File-Based Catalogs: At its core, a REST catalog is often just "a fancy service to point to a metadata file," as Julien notes. This complexity is what led to simpler, file-based approaches, which are perfect for local development and getting started.

This last approach is the key to escaping catalog hell and getting your hands dirty with Iceberg and DuckDB.

A Practical, Hands-On Approach with boring-catalog and DuckDB

To demonstrate just how simple an Iceberg setup can be, Julien created an open-source tool called boring-catalog. It implements a lightweight, file-based catalog using a single JSON file. It's the perfect way to learn how Iceberg works without needing a cloud account or a complex distributed setup.

Let's walk through it. 🚀

Goal: Go from zero to querying an Iceberg table with DuckDB in 5 minutes.

Step 1: Installation & Setup

First, install boring-catalog using pip.

Copy code
pip install boringcatalog

Next, initialize your catalog. This is similar to running git init.

Copy code
ice init

This simple command does two things:

Creates a warehouse/ directory to store your Iceberg table data.
Creates a .ice/index file that points to your catalog file, which is warehouse/catalog/catalog_boring.json.

This catalog_boring.json file is your catalog. It's just a simple JSON file that will keep track of your tables and point to their latest metadata files. This elegantly demonstrates Julien's point: you don't always need a complex REST service to manage state.

Step 2: Committing Data to an Iceberg Table

Now, let's get some sample data and commit it to a new Iceberg table.

Copy code
# Get some sample data (NYC taxi trips)
curl -L -o yellow_tripdata.parquet https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet

# Commit the Parquet file to a new Iceberg table called 'trips'
ice commit trips --source yellow_tripdata.parquet

That's it! You've just created a new Iceberg table and committed your first snapshot. The workflow is intentionally git-like. You can even view the history of your table.

Copy code
ice log trips

You'll see output like this, showing the complete history of operations, which enables powerful features like time-travel queries.

Copy code
commit 5917812165563990664
  Table: ice_default.trips
  Date: 2025-07-09 19:55:00 UTC
  Operation: append
  Summary:
  added-data-files      : 1
  total-data-files      : 1
  added-records         : 20000
  total-records         : 20000

Step 3: Querying with DuckDB

Now for the fun part. How do you query this table? boring-catalog comes with a handy command to fire up a DuckDB shell that's pre-configured to read your Iceberg catalog.

Copy code
ice duck

This drops you right into a DuckDB CLI. You can now query your Iceberg table directly with SQL!

Copy code
-- The 'ice duck' command automatically creates a view for your table

USE ice_default;

SELECT passenger_count, count(*)
FROM trips
GROUP BY 1
ORDER BY 2 DESC;

+------------------+---------------+
| passenger_count  | count_star()  |
|     double       |     int64     |
+------------------+---------------+
|              1.0 |         14545 |
|              2.0 |          2997 |
|              3.0 |           883 |
|              0.0 |           585 |
|              4.0 |           424 |
|              5.0 |           335 |
|              6.0 |           221 |
|            NULL  |             7 |
|              7.0 |             2 |
|              9.0 |             1 |
+------------------+---------------+

You've successfully built a local, multi-engine data lakehouse. You used boring-catalog to manage the table format (Iceberg) and DuckDB as your query engine.

The Bigger Picture: Iceberg vs. DuckLake

This hands-on example helps clarify the philosophical differences between Iceberg and DuckLake.

The conversation between Mehdi and Julien shed light on this key distinction:

Iceberg's Catalog: As we saw with boring-catalog, the catalog is a lightweight pointer to metadata files. Its primary job is to provide a central place for atomic commits, ensuring that concurrent writers don't corrupt the table. The metadata about the files (like Parquet file statistics) lives in separate metadata.json files on disk.
DuckLake's Catalog: In the DuckLake approach, the catalog isn't just a pointer; it contains the actual metadata itself, typically within a SQL database. This removes the need for separate metadata files on disk and gives the catalog more responsibility, which can simplify the overall architecture and user experience.

As Julien perfectly summarized, the ideal future would be a marriage of these two worlds: "Iceberg's broad engine interoperability combined with DuckLake's simple, elegant user experience". That's the dream many data engineers share today.

Conclusion: Your Next Steps

The catalog is the central nervous system of an open data lakehouse. While historically a source of complexity, a new wave of tools and managed services is making the power of Iceberg more accessible than ever. For the modern data professional, understanding how catalogs work—and how to choose the right one for the job—is a crucial skill.

Try it yourself: The best way to learn is by doing. We highly encourage you to try out Julien's boring-catalog on your own machine.
Go deeper: To learn more from Julien, check out his Boring Data newsletter and data stack templates.
Explore the DuckDB approach: Want to dive deeper into how DuckDB and MotherDuck are innovating to solve the catalog problem? Get started with MotherDuck and DuckLake today.

Happy building! 🦆

Transcript

0:00Anyway, I'm super happy to have uh Julian with me um which is a freelance

0:07uh data engineer uh that has a lot of experience to share and we're going to talk about uh iceberg catalog table format and all that jazz. But please uh first uh Julia can you introduce yourself a bit? What's what's your background and your story? Yes, of course. So, first of all, hello everybody. Um, thanks for having me here

0:29on the the podcast on the on the stream. Um, yeah. So, my name is Julian. I'm based in Switzerland close to Geneva.

0:36So, Geneva Loausan and I've been working in data field for 10 years now. Um, and

0:43I work as a freelance data engineers as you mentioned before and I help various customers in Switzerland and in Europe to set up their data stack. um startups or corporates big companies. I mean it depends but uh it's mostly what I've been doing now for uh five years exactly. Wow. So you started uh I'm also started it like a decade ago. So you

1:05started with uh on premise cluster like the cloud era and so on or turn story.

1:12Uh yeah I did some some spark on everything 10 years ago. uh my official title was when I started was data scientist as many people that was what I was what I was actually doing was data engineering. So only five years ago I I turned officially into a data engineer data engineer but actually for 10 years I've been doing data engineering just

1:32with another. Yeah. Yeah. I think like like a lot of people like my title first was uh uh BI developers and then I needed to kind of like understand how distributed system and other things work and then you know with the job requirement basically your job world can switch and also just depending on like what you actually uh

1:57like to do right uh I think it's also uh that's why there is Joe Joe Ray is also uh defining himself self as a recover uh

2:08data scientist right uh so um but even as a data scientist if you like to do like modeling 80% of the time I mean what you need to do is basically data cleaning and moving data from A to B so basically data engineering so at the end every everything is converging to to data engineering right yes yes and even

2:30AI but we're going to try to not talk about that today let's try Um can you um

2:37can you a bit so you you've built uh also boring uh data.io

2:45um can you a bit uh introduce what is

2:50all about and why do did you actually build that? Uh yeah of course um I'm doing like a lot of different project for different companies and while I was doing that I was wondering hey I'm always always starting from scratch for every project I have to write every single line of code um every every time I start a new

3:09data stack project. So what I thought that maybe I could like condensate all my learning that I've built over this last 10 years and build like one single template that people can clone get and then get started super easily. Basically what I've noticed is that when a company starts a data stack project there is like this cold start problem like ever

3:32you want to go fast then you go full SAS but then you locked in because once your volume what the data volume when the volume of data is growing then your cost of course is are growing. The other way to go is just to build yourself everything but then it takes so much time. I mean you have to count at least

3:51three months of engineering work just to get started and you haven't delivered any insight any dashboard nothing so and there is no middle ground between like doing everything yourself to yourself or going full SAS and this is a problem I tried to to solve basically can you can you a bit like can we give names uh like

4:10when you name SAS is basically someone starting their data stack and they go on data bricks or they go on snowflake um and they but everything there instead of like you know build building their own infrastructure which means um I guess managing their own compute time on AWS or Google cloud what is that what you mean is that middle ground so there are

4:35different levels I mean the top level is like the top platform that are building bundled into what one source so these are basically built on top of snowflake on top of data brick and like they bundle one orchestrator or ingestion and transform and warehouse basically. So that's the first level. Then there is a level below where you put everything in

4:54snowflake but then you use manage ingestion for example or manage dbt like dbt cloud and then it's just like s that you are piling up and then your bill is as well um growing and most of them are like billing based on the volume of data you're working with and that's why as well your bill is just growing uh over

5:13time and you're kind of locked and for example let let's take dbt if you want to manage dbt by self then you need to run it yourself means you need to build some infra but like building this infra is like it's it's doable but it basically takes time you need to build the CI and everything I tried to package

5:31in one repo in one template that you can get and this is in order to solve like this cold start problem and but so it's really so this is interesting it's really about doing a balance on what you can offload to SAS and what you can self host it basically.

5:54Yeah, exactly. I think the the business model of this SAS is kind of a bit tricky because they charge you like it's a variable cost but setting up a stack is like a oneshot cost. So basically you are responding to a oneshot problem with like a variable cost solution and I have the feeling this both world does they do

6:12not match together. Hence the oneshot pricing model as well of growing data. You just basically buy once the code is yours. You can you of course you own the code. You self host it and you're free to customize it and make it evolve depending on how your project as well grows or not. Yeah. Uh so I think I have

6:32let's let's open uh quickly uh to just get a bit of some visual to

6:39to to chat here up up. So if um

6:49let me put you there

6:56bear with me. So for example, what's what's the common thing that you see people want to actually you know self-hosted and do it do it themselves?

7:10Um the first one is probably ingestion like selfhost an elt like DT for example

7:19how to run it yourself and here in the templates I basically have pre written lambdas I lambdas that can run the DT

7:30for you basically everything is ready just need to to customize the connector and everything is ready to go. Then there is a DBT of course uh DBT is included here. There is a inside the template there is like a DBT project.

7:43Yeah. And the CI running it. So that's already built in. And then there is of

7:51course then iceberg. Uh I have two templates today. One is snowflake

7:57centered and the other one is iceberg centered on AWS. Um I don't know if you can maybe scroll scroll down a bit. Uh yeah, let me actually put uh ourselves

8:10like this so that we zoom out a bit on the screen and I'm going to zoom out also here.

8:19Yes, exactly. Now here's a small overview on what you get when you purchase a template. So first that there are two kinds of templates I mentioned. One is warehouse centered or one snowflake and the other one is iceberg centered. And here the one that you're showing is about uh iceberg and you see that everything is preset up to work with

8:41glue which is a managed service of AWS. So when you deploy the template you already have your glue table deployed with uh you have as mentioned before dbt run for you on top of Athena and you have DT running for you on Lambda. So basically only almost so only open source tools running for you in inside AWS. How do

9:05you so what the thing I like I'm here a bit to challenge is that I think it's u it's depending on the size of the company too right because those tools they're easy to bootstrap but at scale they require they might be you know require a bit more more maintenance. How how do you see someone evolving within

9:25the stack? Do you think like at some point they should take some part of load to assess or what what's your thought?

9:32Let's say like your team your data team is not scaling really but your business like needs and so on is scaling because that's often the case hiring data engineer is really hard right so

9:45um I mean it's hard to say hard to hard to have like a kind of a general rule but I think if you start to build skills internally like around terapform around this open source tool that's a way as well to maybe uh skate it better because you know if you handle if you can handle it if you know how it works

10:03then I think it's a good way as well to to to scale your platform um both in terms of internal skill maintenance and as well in terms of of cost. Um I don't know if it's response. Yeah. Yeah. Yeah.

10:17No, it's it's a it's a definitely an opinion. I think I would I would have a uh I would say a middle ground there where I think uh sometimes you have you know data team which is one person data analyst and it doesn't have time to invest into that. I think it's like finding the right balance to like what

10:39do you what can you you know uh offload to a SAS because hiring and training people is also you know as expensive than your cloud bill at the end especially for for data engineers. Um but but that is great. How how do what do you see how do you see iceberg fitting in this picture? Because you mentioned you have also a template with

11:02uh with snowflake. So what's what do you think iceberg enable you to kickstart your uh your data stack versus a standard data data warehouse?

11:16Um I mean that is probably independent to this template thing but like more generally I think today there are two two way to get started with a data stack either you go full warehouse or like you jump on the open open table format vagon

11:31um it's more kind of a strategic decision that you need to take um I have the two templates because I want to support both and I think they're going to become kind of uh they're going to merge together means if you have your data In asber, you can still use snowflake on top of it, but they they

11:48are not going to live next to each other. And I think probably in the coming months or years, people are just going to start by dumping their data in iceberg or in a open table format, iceberg or delta, whatever, and then just plug a warehouse on top of it. Um, depending of what their management wants to use and and the current actually

12:11actually the current as well market, right? Because um if snowflake is good today, we don't know if snowflake will be good tomorrow, right? So yeah, snowflake today is probably the so snowflake in 10 years will be probably the record of today. So I think and what what do you think? Have you seen like any performance challenge? Because the

12:31problem the problem is that I mean unless you use spark but I'm curious like if you've seen stuff with duck db I'm also opening regarding performance but the bold point is that all cloud data warehouse and including duct db they have their own data format right uh which is always is going to be much more efficient right at scale. So do you see

12:54challenge how do you see those stable those internal format you know working together with those stable format like iceberg. So let's say um you know I have a snowflake instance or I have a duck db should I store my data into iceberg or should I store what's what's the pro of storing my data into the internal format

13:16of the of the relevant app engine. I think uh the best way the best way to edge yourself is probably to store things in iceberg because once you are in iceberg then you can still move to snowflake run computation there if you want to run comput intensive stuff and then write back to snowflake to iceberg sorry to iceberg I wrote I wrote an

13:38article about this last year about using snowflake or like cloud warehouses as a serverless function more or less basically you have the data store in iceberg use you you you process them in snowflake and then red back right back nothing is is persisted inside snowflake and I think if you if you if you go full iceberg or delta whatever then you still

13:59have kind of an escape strategy and you can move still to the warehouse world uh in case you need and I have actually I know a company here in Switzerland they have stored everything in delta uh in a local cloud and then they are using spark and asure in order to compute the data when they need when they need to do

14:20like large back back fields or when they have comput intensive task but then they always write back to that lake and that's why they're kind of flexible as well. Yeah. But I think I think if you have spark uh because spark is compatibility with uh iceberg and delta today is really good. um it it kind of makes sense but um if it's spark is not

14:44your default on giant like it's just like there is a couple of many reason why you wouldn't use that then um then it's a bit hard to kind of trying to find this workflow but I like your answer in a way that u having I think in the future having your things as a table format first and then loading data only

15:06only that you need for compute intensive work to the relevant warehouse. Uh that makes that makes a lot of sense. Um

15:16multi-gine I mean we have been working like in a multi-engine setup for years before what I used to do is like dumping data in S3 prep-processing them inside AWS with SCS task or lambda whatever and then copy data to snowflake. So I mean this engine is kind of has existed for years but it's just that with iceberg it

15:37makes everything uh easier I think to move data around and just pick the right engine uh that fit your your usage right yeah um so yeah you have also speaking of you have also uh a newsletter that I'm just uh putting uh there for people watching where you've been talking about uh everything data engineering and on the on the link uh recently on uh the

16:04click and so on. Uh but uh what do you think it is missing today? So we based on what we just talked with the table format and this EDL word where people write first into iceberg and then move to the multi-engine that they want. What do you think is missing today for a wider adoption?

16:28Yeah, I think um so for example iceberg this comes from big tech means it's adapted to big tech tools as well and the API that is best support is like basically for spark so if you use spark you have super support but like for python or other non gvm

16:51tools I mean it's nent it's been building so there are some thing missing which make a bit the experience a complicated uh I would say uh so I mean the first is like I think the user experience is not that great for non GVM user then you have this wall maintenance story that you need to think about that

17:11you don't have if you go to to snowflake and everything and yeah I think that's the two main uh problem that we have right now and of course the third one is like the catalog um because as if the

17:24standard way to use iceberg is to use it with a hosted catalog so you it either to purchase it from a SAS provider or to self host it and which I mean it's an additional complexity.

17:36Yeah. So and and and all the like what's

17:40the landscape today because I'm a bit uh behind. Can you share regarding the different catalog today? Do they support all the table format like can you give us like a bit uh an overview on what you know? Um I haven't had a look a deeper look this last weeks to be honest. I wrote about it like a couple of months

18:01ago but from last time I checked I mean you have like of course Polaris from Snowflake which is one of the leading ones. You have AWS glue which is as well interesting and you have the um lake keeper which is like the open source uh alternative um but that you need to self host and so that's I would say that and you have of

18:22course um the one from datab bricks um unity unity catalog exactly that's all the force I would say leaders um but then there is a new wave of like catalog integrated inside um um buckets like for

18:39example S3 you have S3 tables basically selfhost like host REST catalog uh inside your bucket directly so you don't need to do anything and by exposing this rest end on point you can then get access to the data from other engines snowflake or even db or whatever and how would how would this work so you have to

19:01host the rest uh service to provide the the REST API to towards the the S3. You

19:09basically create an S3 table. So nice S3 table and then out of the box you get the oh yeah from the S3 table feature.

19:18That's what you mean. Yeah, exactly. And Cloudflare has the same for okay basically you have a manage iceberg table um that you can use out of the box. Well, I think that's that's that's great innovation because it simplifies a lot of things when you when you want to go to want to move to Iceber. Is that what they called uh the Air2 managed

19:41data catalog or is it something else again? Yeah, it is. It is. Okay. So, this thing exactly and what's magical with Cloudflare is that they they have zero zero eress cost means consuming your iceberg data is basically free.

19:56Yeah. And you can you can you can check my newsletter I wrote last year about this like a Z data distribution. Yeah.

20:04Yeah. I remember this one. I mean it's super cool. But so based on our initial discussion, this is actually a good example where you can mitigate basically what you offload to a SAS and what you own because here you're going to manage basically how you structure your lake and how you write your data to the bucket but you uh

20:28offload the catalog basically. Is that correct? Exactly. And the the maintenance I think this provider they they offer as well with table maintenance out of the box like AWS for example is is maintaining your your tables for you automatically.

20:46Yeah. What do you think of uh duck league which uh has you know just for context for people that has been living in a cave um it is a new table format

20:59that's been released by uh uh Doug DB lab so it's basically to you know to compete around the other table format uh like iceberg that we just talk or delta lake what's your what's your two cents on that I think they have done it mostly because they don't want to host a catalog. I think that was the trigger. The catalog

21:24concept didn't make sense for them. That's why they built they built like kind of their own and then they built their their own metadata format and everything to basically their own open table format. I think it makes definitely sense. I mean we will I I I have the same observation that's why I built as well boring catalog.

21:41maybe see after but uh I think the success of this will only depends of the integrations that will be built. I mean open table format is all about the integrations. Iceberg I like iceberg because there is integration to snowflake to to db to pol to to

22:01everything. So basically if I know that if I dump my data inside iceberg I'll be I will be able to use it after anywhere and I think that that the question mark around the lake will it have will it support enough integrations to be meaningful and so and I have some news because I've seen some threats around uh

22:25spark through GDBC uh so this is uh going on there is a couple of things being solved and TRO Trino is already uh in in branch not release right but so those things has been um is being actively worked on I think you made uh a great point that I don't think any table format can take off if there is no adaption right and so

22:52uh the the DDB team probably knows that but I think also there is a there is a will of exactly as you said to have different engine right and so for people saying okay I want to use duck clay but uh I don't want to use only duck db right um then there is probably incentive also for just individual in

23:15companies to uh contribute back to make uh this integration uh working and uh what um

23:27what what advice would you would you recommend for for someone starting, you know, a table format lakehouse approach.

23:37Um, I mean, in order to respond to to this question, I built like a the most boring catalog in the world. Boring catalog. Yeah, we're going to try we're going to try that. But like aside aside from using using your tools, what what kind of uh recommendation would would you give to people how to get started? I

23:59mean I think the best way to get started is for now to use the manage services. I think um that's why people are kind of frustrated as well by Asber because it's the best way to use it is probably to use the manage services. For example, in AWS you have the glue catalog which is really well done to get started with.

24:19Yeah. And is like managing a lot of things for you. So it's probably a safe way to start um especially regarding the maintenance and everything of your tables.

24:29So I think that that's probably the advice I would give uh first like the biggest

24:38bonds at the beginning maybe to get a feel of what's going on and at the same time I think it's I mean if you manage if you go to iceberg it means you have additional responsibilities in your team and means that you probably need to build the associated knowledge with it and understand what's going on and uh

24:57how it works and everything. Um, it's an additional effort, but I think it's worth it because getting to understand how open table format works, you understand really better how a cloud warehouse is also work. Yeah. Because the structure is is the same basically you have you have files, you have metadata, you have an engine on top of it and how they

25:21interact with each other. What is stored inside metadata? Why do you have metadata? I think it's it can only be a good investment to learn how it works and why we need it.

25:32So that that's an interesting take that if you go to the lakehouse and open table format way you're going to have to level up your technical skills. It's not like for a classic data analyst that's building you know maybe DBT model. It kind of upscales but there is a technical knowledge you know to gap to

25:57fill there right for now I think yeah in the coming months and week and years that's going to be hidden I think inside oh well how do you think it's going to be hidden inside the tool that we need I think it will be just package and we don't going to have to interact with iceberg except if we if we want if we

26:17need to Um, how do how do you see I'm just curious there. How do you see this abstraction going on? Can you give like an example of workflow of like okay today you need to be able to do this and this and I think tomorrow this is going to be this is going to go away.

26:36I think the the manage iceber table by uh cloudfare on and AWS are good okay good idea because you just create your bucket inside inside your cloud account and you have everything managed for you it is automatically maintained and everything just worrying about writing your data and quering them and everything inside like the technicalities of iceberg are

27:00hidden but if you want you can access to the data and you can go a level down and I think that's what's important like you want you want for some edge cases you want to keep a bit control and but but at the same time you want it managed for the standard cases so and I think this way of manageback tables with

27:20like the possibility to fine-tune things are quite a good way a good path to follow so um to going back to the advice

27:29maybe the easiest way is to just use S3 tables or um or cloudfare equivalent

27:37tables and then is that just from my understanding is the S3 managed tables the is not the same than the AWS clue because you could still create and manage yourself towards the catalog right but you need to do the update and the maintenance yourself on the catalog is that correct so either you go S3 and glue on top of this and then you manage

28:02yourself all the files and everything inside your bucket. Yes. And glue catalog offers some maintenance tasks as well that you can do yourself or you go like for the S3 table and here everything is done for you. Uh okay. So that's a completely different product and interface but that's why iceberg is is getting hated because it's super

28:23complex. You have many different terminology and in snowflake there are they have different names in GCP they have once again different names for this. So, no, that's that that's that's great. I'm I'm getting my uh knowledge uh up to date because if it's confusing for me, it's probably confusing for you.

28:44But so yeah, so just the long story short is that you have kind of fully managed icebergs um on S3 tables and

28:53then a layer down is like you pick your own catalog and your bucket uh together and you manage your stuff together like manage catalog like glue and uh your bucket and uh yeah that's that's I think that's that's a good summary. So um so I guess uh

29:15if we if we just I'm just curious because we have a lot of people using Doug DB in the in the audience. Uh how do you see we we talked about adoption of the engine but do you think like having off offering um a managed service like S3 tables that is dlay compatible is is also key you think for the

29:37adoption something like that because we have talked about the adoption on the query engine right but not really on the on the catalog side I think people don't want to manage a catalog they just want to have it out of the box I just want to write to iceberg and have an interface in order to create the data. I don't want to manage

29:57anything and uh that's what has actually makes um the click interesting because just provide an SQL database and then the metadata and catalog will be just stored there. So you don't need to do anything and the data are available to query. So I think yeah that the key just don't want to host as well a catalog because it's kind of I mean the catalog

30:19becomes like the critical point. Yeah, exactly. And that's I mean that's an additional responsibility for the team and potentially maintenance tasks that you will need to run if the catalog goes down. So yeah, that makes sense. So we

30:35are uh already half uh the time. Uh so I want to take a few minutes to uh get uh

30:43uh hands on uh by trying out

30:47your opensource project which is the boring catalog. I'm just going to first uh share the

30:58link with the folks uh here and uh yeah I just want maybe to go uh

31:09to the readme see if it works if it doesn't works and kind of like understand just your your reasoning behind we talked about uh you know manage catalog like history tables catalog that are serverless but you where you need to do the work like AWS clue and uh how how basically the boring

31:31catalog fit into this picture now so basically as a user and I think many people told me that they don't want to worried about the catalog don't want to set it up it's kind of complex they just want to get started super easily and most important the goal is to better understand iceberg what's going on what is a snapshot how this metadata

31:54and everything. And that's why I wanted to build this boring catalog just to to help people getting started with iceberg understand how it how it works and and and getting yeah getting started super easily and that means that the key as well for the for the adoption of iceberg because now I think the next the the tech is used in production in many

32:16places but what's missing now is like an adoption for from wider yeah from to lower the technical barrier for others.

32:27Is that correct? Yeah, exactly. Um so,

32:33so let's go for it. I have I'm going to share my entire screen uh here. You should see I'm just into uh a cursor

32:44environment. Let me zoom in. Um and let me also take the

32:52boring data uh GitHub. Um so how does that work? I need to install uh a Python package UV pip install boring catalog. I think I'm running in the in the container. I think I'm I'm going to do just uh uh the pep install. Oh, actually I think it was already cached on on this one. That's why it was going so fast. Time saved.

33:20Um, okay. So, it is uh it is installed.

33:24Um, so I need to init the catalog. Is

33:29that what it does? This command? Yeah, there is basically one CLI called ICE and it's pretty similar to the Git workflow. You know, to get started, just do ICE in it. Mhm. And uh what what it's

33:43going to do? Yeah, it's going to create I mean yeah it's on the right. So similar to the g workflow you have an index file inside do I folder which is basically storing like your local config while you are it helps to to if you want to run after other CLI command that will know everything about your context. Okay. So

34:07basically it's pointing to this. Exactly. And and and this can be this is local. Yeah. But can be remote. Is that correct? Exactly. You can put like an S3 an S3 pass S S3 URI and it will store the catalog on the data on on S3. Okay.

34:28And uh so basically the concept of this catalog is just to use one JSON file. uh you don't need to host anything every all the catalog information are stored inside this file. Uh if you can open it uh okay it's already open. Yeah. So you see that with the cat catalog name and when you will create new name spaces

34:46that will be listed there and as well the table and catalog I mean if you check the spec uh catalog the goal of the catalog is just to point to a specific um metadata.json file. Yeah,

35:01just a pointer and that's why I think that the point from that hosting a service just to point to a file is is kind of yes.

35:11Yeah, that's that's interesting to talk. So we haven't uh really dive into the deck link because uh there going to be other session by the way and if you go on our YouTube channel on mother duck uh YouTube channel I've done already a video on it and we're going to have uh next week uh a webinar with Hannes and

35:30Jordan talking about dick specifically um here I wanted to have the topic more about the catalog but it's it's interesting to talk about the two is that as you mentioned here the catalog uh responsibility is pretty uh low right on iceberg and delta lake because they're pointing they mostly pointing to metadata file hosted on S3. Yeah. And

35:55you need it because in case of uh concurrent writers you need to have a single uh place of truth and you need to know who is writing. So you in order to handle this concurrency you need to have uh a catalog to do it. But like you may think hey you're working with a file doesn't work but with S3 you have this

36:15uh concurrency management like you can have a lock on a file and that's what I'm using e tags on this file. So when you write to S3 it's going to check if this file is lock or not and and and if it's a case then the writer the right will be rejected for from a second writer. So it supports so it's a it's

36:34using a built-in feature from WSS3 to basically handle the concurrency, right? Okay. And so and so just to circle back with uh with ducklake uh ducklake actually uh the catalog is living in your uh in any SQL database. But there

36:54is much more than just uh a pointer to the file. It's all the meta data. So it's not pointing just to the metadata file. It's actually containing the metadata data. So uh the responsibility then in uh uh in a catalog in a ducklick

37:13setup is much more important because it contains the actual metadata not just the pointer.

37:20Um so let's let's go and add uh some

37:25data I guess. So what's what does this command do? Uh so now you need to to get

37:32first some data locally that can be ingested. Uh let if you can check the Redmi there is a curl command that you can use. Okay. Okay. Because I already in it my my errors. Yeah. Yeah. Uh

37:47yes I can see yellow trip data. Yeah.

37:51The classic. Um so this is a park file. a simple par

37:59file. So it's not an iceberg table. No, no, it's just the bucket that you want to ingest inside your Yeah. iceberg table. Okay. Okay. So you know you have it. And then there is a nice ice commit command. Yeah. That you can use. So I need to uh to commit uh just

38:19just like this. So so this will create a new name space create a new table and add new iceber table. your iceberg back your packet data to this iceback table.

38:31Okay. And that's it. So now we can see that the catalog has changed is pointing to the latest metadata.json file. You should if you check your file explorer on the right or on the left. Yeah. Uh maybe actually uh going sorry

38:51I did something uh wrong here.

38:55Um so if I go to warehouse

39:03uh then there is the iceberg db and the

39:10data and the metadata. So this is the table that I created. So this is an iceberg table. What are you using behind the scene to write to uh to iceberg? So

39:20it's just an interface on top of piasburg. So everything you can do with piburg you can do is boring catalog and if you want to interact via in the python script you just create a new iceberg instance PSB instance pointing to this catalog it's always as well in the readmi and then you can interact like you would with with pi iceber okay

39:42and yes so you can for example on here like sim as well once again similar to git you can do a ice log command

39:52and Okay, now you're not in the right place. Just go one. Yeah.

40:03Uh I may have changed uh by accident

40:07this one once uh we uh we did the edit.

40:12So this is actually one thing. This is a bit dangerous to have this one editing manually. Uh wait, let me try something again.

40:23Because here uh it doesn't find any snapshot, right? Uh exactly. Should I if I commit again? It's probably me that lets me just uh or

40:39maybe you can ditch everything and try again. I don't know what happened.

40:45Do I need to to keep the index? What's the index again? Uh it's just pointing to the catalog so you can I guess you can keep it. Okay.

40:54Uh anyway, so I'm doing uh ice in it.

41:00Yes. And then I'm fetching some data. Yeah, you should have already it. Yeah. And then I'm doing a commit and then I can do a log. Mhm. All right.

41:15Exactly. So it was exactly the me messing with the manual JSON file. Um

41:22and here you see you have like the complete history. You can try to commit again and you will see like a second entry in your log.

41:32You see every operation that we have. I think what's that's interesting with iceberg because then you get a time travel. If you want to go back to one commit you can because everything is snapshotted. So you had you have like time travel out of the box. Yeah. And so if and I saw you have a command ice duck

41:52which is basically using db to read the iceberg table. Is that correct? Yeah. And even cooler you can do ice duck dash ui.

42:05Ice duck. Ah. And it's going to launch uh the local DBUI. Okay. Uh, wait. I need to

42:19starting. Uh, I I may need to have to uh probably port it. But yeah. All right. So So this is basically how you use DDB into the tool is to uh is to uh basically read uh all

42:36the all the iceberg uh table. Yeah, you can can actually try you can for example do the select store from catalog tables that will list all your tables. Yeah, available and you can query as well table directly if you want. Yeah. Yeah.

42:51So, uh okay.

42:56So, it's also a bit uh I mean here it's really Yeah. Okay. But the d the the

43:03data itself is uh basically here just on file uh on a as a local file.

43:11Um so there's no there is no duct DB. It's everything is in memory. There is no duct DB file somewhere. Yeah, exactly. It's basically an SQL script that I run when you start your your DB shell. Yeah, it's going to preset the previous thing and you can have access as well to all the snapshots and potentially you don't have it now but on

43:33to the metadata as well itself. So similar to duck lake but within the duck db shell. Yeah. No, but that's great. I

43:43think that's it for the ends on and just uh going around uh the things that at least uh uh I wanted to do. Um to finish

43:55out, what's the one thing you would like to see on iceberg and one thing you would like to see on on the click?

44:08Probably um both of them merging together. Something like that. Uh yeah, I would like to have probably the experience user experience from Duck Lake with the integration of iceberg.

44:22You could merge these two we would have like a nice a nice setup. I think as a user I mean from a from other perspective I don't know but from a user perspective having one table format which is a standard which has interoperability and good experience that's that's probably the dream of many engineer today. Yeah, I think

44:45so if I if I rephrase that is that um one thing in duck lake is interability with other engines, right? And one thing in iceberg is the user experience, right? Which is not which is not really great. Does that summarize pretty well?

45:03Yeah, exactly. That's my wish. Cool. Uh Julian, thanks again uh for uh joining us today. Um, as a reminder, uh, you have, uh, a newsletter that I just put, uh, there on the highlight. Um, you

45:22have also your boring, uh, data, uh,

45:26stack offering and also the open-source uh, code catalog that we just play around. Um, I think it was super interesting. I I actually kind of like updated my knowledge on the existing catalog and where where they are. Um and

45:42uh and I'm looking forward for the future implementation. I think within the coming six months we're going to see other things happening hopefully. Um but but yeah that's that's will be it. And again for the audience if you're interested to dive more into table format and duck lake specifically we have a webinar coming in. Uh I'll put

46:05actually uh the link into the description. You can go to uh motherduck.comvents

46:15uh if you want uh to register uh that that

46:21would be it. Thank you J. Thank you. Thank you. Thank you for the invitation.

FAQS

What is an Iceberg catalog and why is it needed for data lakehouses?

An Iceberg catalog tracks the current state of your tables and handles concurrency control for writes. It stores pointers to metadata files and provides the compare-and-swap mechanism needed for ACID guarantees when multiple writers access the same table. Common catalog options include AWS Glue, Snowflake Polaris, Unity Catalog (Databricks), and self-hosted options like LakeKeeper. The catalog must be backed by something with ACID properties, typically a SQL database. For more on the data lakehouse architecture, see our open lakehouse guide.

How does DuckLake differ from Iceberg as a table format?

DuckLake is a new open table format from DuckDB Labs that stores all metadata directly in a SQL database (like PostgreSQL or SQLite), rather than using separate metadata files on object storage as Iceberg does. This eliminates the need for a separate catalog service because the database itself is the catalog. The trade-off is that DuckLake currently has fewer engine integrations than Iceberg, though work is underway on Spark and Trino connectors. Learn more about DuckLake.

What are the main challenges with adopting Apache Iceberg today?

According to the discussion in this video, the three main challenges with Iceberg adoption are: (1) the user experience is poor for non-JVM users since Iceberg's best-supported API is for Spark/Java, (2) table maintenance responsibilities like compaction and snapshot expiration fall on your team rather than being handled automatically, and (3) the catalog setup adds complexity. You either self-host a catalog or purchase one from a SaaS provider, and each cloud provider uses different terminology and services for catalog management.

Should you store data in Iceberg or in your data warehouse's native format?

Storing data in Iceberg first gives you maximum flexibility and an escape strategy. You can plug in any compatible query engine (Snowflake, DuckDB, Spark, Trino, etc.) without being locked into one vendor. The recommended pattern discussed is to use cloud warehouses as "serverless compute functions": store your data in Iceberg, process it in whichever engine you need, and always write results back to Iceberg rather than persisting inside a proprietary warehouse format.