DuckDB & Iceberg : The Future of Lightweight Data Management

Q: What is Apache Iceberg and how does it differ from Parquet as a file format?

Parquet is a file format for storing columnar data efficiently. Iceberg is a table format , a layer of metadata and specifications on top of data files (usually Parquet) that adds ACID transactions, schema evolution, time travel, and partition pruning. Iceberg manages a tree of metadata: a catalog points to JSON metadata files, which point to Avro manifest lists, which point to manifest files that reference the actual Parquet data files. This metadata layer enables features that Parquet alone cannot provide, like atomic updates and concurrent writes. For more on Parquet , see our format guide.

2024/05/16Featuring:

Mehdi Ouazza

Brian Olsen

TL;DR: Deep dive into Apache Iceberg table format—what it is, how it differs from Hive and Delta Lake, and hands-on examples using PyIceberg with DuckDB for reading Iceberg tables.

The Evolution from Hive to Modern Table Formats

Brian Olsen from Tabular joins Mehdi to explain why table formats like Iceberg exist. The journey started with Hadoop and Hive about 10 years ago:

Hadoop era: Distributed computing enabled large-scale data processing but with terrible user experience (MapReduce, Java)
Hive: Added SQL-like interface but lacked true ACID properties and was painfully slow ("a good day was running 6 queries")
The problem: Data warehouses needed atomicity, consistency, isolation, and durability—Hive couldn't deliver

Why Table Formats Emerged

Three table formats emerged to solve these problems:

Delta Lake (Databricks, ~2017): First to market, initially relied on filesystem for atomicity
Apache Hudi (Uber): Optimized for streaming ingestion and real-time CDC
Apache Iceberg (Netflix, 2018): Clean-slate redesign for cloud-native architecture

How Iceberg Works

Iceberg's architecture consists of:

Catalog: Tracks current table state, handles compare-and-swap for atomicity (minimal metadata)
Metadata files: JSON files pointing to manifest lists
Manifest lists: Avro files listing manifest files
Manifest files: Avro files pointing to actual data files
Data files: Parquet (or ORC) containing the actual data

This tree structure enables:

Snapshot isolation: Each commit creates a new snapshot, enabling time travel
Efficient pruning: Query engines can skip irrelevant files using metadata
True ACID: Atomicity through catalog compare-and-swap, not filesystem operations

Hands-On: PyIceberg with DuckDB

The demo shows how to:

Copy code
from pyiceberg.catalog import load_catalog

# Create catalog with SQLite backend (local demo)
catalog = load_catalog(
    "default",
    warehouse="/local/path",
    uri="sqlite:///catalog.db"
)

# Create namespace and table
catalog.create_namespace("default")
table = catalog.create_table("default.my_table", schema=arrow_schema)

# Append data
table.append(arrow_table)

# Read with DuckDB
import duckdb
conn = duckdb.connect()
conn.execute("SELECT * FROM iceberg_table")

DuckDB + Iceberg: Current State

Read support: DuckDB can read Iceberg tables via the Iceberg extension
No predicate pushdown yet: Currently performs full scans (opportunity for contribution)
Two implementations: PyIceberg library vs. DuckDB's native Iceberg extension (different tradeoffs)

Why Iceberg Is Winning

Iceberg has gained adoption because:

Neutral Apache governance (not tied to one vendor)
Support from Snowflake, Databricks, AWS, and major query engines
Clean break from Hive's legacy assumptions
Network effects: more integrations create more value

TABLE OF CONTENTS

The Evolution from Hive to Modern Table Formats

Why Table Formats Emerged

How Iceberg Works

Hands-On: PyIceberg with DuckDB

DuckDB + Iceberg: Current State

Why Iceberg Is Winning

Transcript

0:00[Music] n [Music]

0:30oh

0:39[Music]

1:01[Music]

1:16[Music]

1:22he [Music]

1:45[Music]

1:56[Music]

2:14[Music]

2:30he [Music]

2:49[Music]

3:00he

3:05[Music]

3:35[Music]

3:40he [Music]

3:58[Music]

4:35[Music]

5:01welcome everybody to another episode of uh quack and code where we do quack and uh C and today we're going to talk about t format and I know I've been there for a couple of meetups table format still sounds a bit weird a lot of people are you know kind of understanding what's a file format but what are the table

5:24format and when you hear some names like

5:29uh who the Delta lake or Iceberg which are going to be the topic of today uh tend they start to ring a bell but what are the exactly added value of them and what's behind the hood of this table format we'll dive into this uh and we also going to get into a practic handson where you can read uh and write

5:53hopefully we'll see uh with dug DV and Iceberg um but for that of course I welcome the expert on that topic on Iceberg topic uh Brian which is with me uh Brian how are you doing welcome Hey Hey M how's it going uh thanks for having me on and yeah I'm doing great how are you doing yeah uh we're doing uh

6:17super great uh I mean it's uh it's summer now over there um yeah but uh let

6:24me just check that we are correctly uh live uh okay seem start just checking here and I think yeah got to make sure the audience can actually see and hear us if you're uh with us here uh please

6:40let us know but yeah we are uh live on both sides uh LinkedIn is not uh being LinkedIn on streaming it does work sometimes uh but yeah so where what's uh what's all about those table formats uh Brian but before before that who actually are are you and who what is tab as a company itself yeah so um that's a lot of uh

7:08good questions I I am a uh you know kind of I like to start off you know US military background that got me really interested in technology and databases in general um and from there I decided Well I should just you know learn more about how a lot of this satellite communication database type stuff works so I did an undergrad in uh computer

7:33science moved from there into getting into data type space still uh at that time Hadoop was just becoming a thing um

7:44and uh I was in fact doing research papers where all we did was talk about uh Hadoop ways of doing traditional things um and and how to do it in a distributed manner which was very new uh and exciting and then after that uh you know uh got my well I followed on to do my masters and then got a job um as a

8:07data basically data engineer I was still software engineer at the time but data engineer shortly came after and I just never changed my title um so you know really diving into a lot of these weird issues with like Hive and and all this stuff and seeing like um you know you

8:27would have I I started out in trino actually working a little bit in trino and one of the hilarious uh uh kind of quotes that one of the trino creators Martin Traverso um would always uh uh or kind

8:43of spout out and when he when they were initially creating trino and making the initial stuff was uh yeah it's a it's a good day whenever you can get like six queries run in Hive uh and that's I'm curious a good day is it is it because of the you know the technical challenge or the the query

9:03language it's so this was you know in the early days when we were first trying to handle distributed data sets um done

9:13at this very high scale and done in a fault tolerant matter uh the issue was you know you were scaling up to a certain number of of uh data and you're

9:24ultimately going to um you know go through these large comput ations um at some point if you have a single you know kind of area of failure you don't have a way to like get that original time back you don't have to a way to recover in a way that you don't have to start back from square one yeah and so that was a

9:45really big focus on Hadoop and all these other things is Fault tolerance um and so fault tolerance came at the at the at the cost heavy cost of actual usability

9:57uh first in that there was really no no uh SQL type way to actually access this data you would have to write map reduce programs you'd have to hire these people to write Java and it was or or Python and it was it was terrible it was very very poor user experience not wor very well thought through but what this was

10:17really only doing was saying hey it's possible to to you know evaluate data at this scale now and look this is what everybody's doing um as you know Mother D has has uh uh put pointed out you know it's it's not that everybody really needs to uh have this this size and scale of data but at the time it was

10:39very much like oh everything is going to be big data you need to be able to understand so many different data formats to function in this world and so everybody jumped into the bandwagon without even asking questions yeah and then you get in Hive uh which is the SQL kind of layer that sat on top of this

10:59thing and it's trying to kind of get us back to the feeling of like oh yeah this is a SQL database um but we had kind of

11:09violated a lot of the expectations that you would have from a warehouse have viated a lot of the speed of course um but we we particularly you know uh did not have any like particular guarantees in terms of how uh updates occur you know if you're dealing with a warehouse um you know sometimes you're just doing a pendon and that's fine um but

11:37sometimes you actually do care about kind of like those asset properties that we would have in an oltp type type data system and so um you know warehouses were were always able to handle this the idea was that it wasn't going to be super frequent but when it does happen you know I need to make sure that I

11:54don't duplicate uh Records or else I'm going to have inaccurate um results in my in my analytics so um so that was another kind of big aspect outside of just the user incredibly poor um it was an incredibly poor uh uh kind of experience from the um uh from from you know the

12:19actual end result the analytics that came out of that may or may not be correct so um so that that uh you know

12:28obviously took some time people were still kind of saying well we need to still be able to process this data we're just going to have to assume or take kind of assumptions that well the data is not going to be perfect but it's going to be in the ballpark and that was kind of the general running assumption

12:44for like almost 10 years uh after Big Data really took off and so you know get

12:51some ballpark answers not perfect answers and you know live with it over time we start to start you know kind of see this unfolding of well no if I'm in finan in financial data right if I'm in the fin fin uh um you know fintech or or something like that it's incredibly important that the data is actually

13:15correct um and I so basically I need you

13:20know my cake and I need to eat it too I I I really need to be able to you know have the this analytics Warehouse type type setup um and I can't give up the the quality of the data I can't give up uh all this stuff because I I need to actually ask real questions on this that need

13:40Precision um and so all the requirements like assd and all these things started to come creeping back into our vernacular and and an area of concerns for users so um you know Hive had

13:55constructed um basically this

13:59I don't want to call it a spec because a spec requires you to write something down um you you effectively uh to understand how it works uh it stores a bunch of metadata around these SQL tables or Hive SQL tables and with that you'll um you know

14:25you'll you'll be able to retrieve all of that from a catalog which is going to be stored in a centralized little you know MySQL or postgress database on the side and that's going to track all of my table information my my if you look at any regular database you look at the information undor schema um database that's giving you all the the metadata

14:48about you know any table that you've created in that database um this is yeah pretty much where all of that data is getting stored and then you'll have like hdfs which more fre frequently over time became an S3 uh object storage uh layer that you would just you know you dump the actual data into so um to initially

15:10address like these you know kind of acid requirements um people started using um

15:18hdfs and S3 and and different file systems as the mechanism to to run to

15:26basically do a compare and swap compare and SW yeah let's let's UNP because there is already no but there's that's great but there is already a lot right so we came from uh basically hadua where

15:38we need to distribute the compute right so we're getting out of the data warehouse uh SQL simple SQL right approach and we need to dig uh to deal with a lot of different technology map ruce and Java developers and so on yeah

15:59but we only we we lost a lot of you know um features that we had from the data Ray house uh And Hive is kind of the premise where they store metad data on the object storage so back then on the hdfs and then now uh on hisory and so so

16:19far we're trying to understand basically you know where those table format is coming from and that's the the timeline right and we are getting so what what's the next step basically after um after this so we have I've running with some metadata right yeah but we're still relying which kind of file format is mostly popular at that point of time

16:43yeah so uh one actually that's a good brings up a good point um so at that time right like Hive was uh you know interacting with these files that would actually get stored on object storage in hdfs and uh started out with r Coler eventually moved into orc and all these other things and eventually we get to

17:06what we commonly use today uh par in terms of a binary format and obviously you still use Json and CSV and all the other pieces but if you want to have a performant uh uh system that you know especially on colner or or SQL data tabular type data um you you'll ultimately want to have that in a a a

17:26binary format like a orc or parquet and so um those are file formats so think about those synonymously with like Json um or Avro AV is another one of those that doesn't isn't quite the same it's more row focused uh and not going to be as performant but it's the goal is like there's going to be some level of

17:47metadata that gets stored with some of these things so you have Json or something else like you know Json has the title The the titles of the thing and maybe a CSV file might have the title of the uh you know header row yeah um and that's some data about the data but that's kind of usually where it ends

18:04for those more form par is also typed right so there is a schema exactly and so park has kind of this nested schema and has a lot of information in the footer itself about how to find data within the the file itself so you don't have to scan the entire file depending on what you know about what you're looking for um and so

18:27so these file formats had existed from like the hive days like those kind of built were built out of Hive time so Hive was you know able to do some of this and kind of replicate the feel look and feel of this but as we mentioned like performance was terrible and we also didn't have any true concept of atomicity

18:52because and you know getting back to some acid stuff I know this is getting kind of down to boring uh maybe uh uh school schooltime education for people but uh atomicity you need to basically have you know a kind of compare and swap ability to uh particularly when you're implementing um a framework called like optimistic concurrency and so

19:15effectively you know two two writers everything you know again we're Distributing we're trying to do things in parallel two writers that are trying to uh add data to a table at the same time um and ultimately have this shared re resource right that they're trying to to write to one's going to finish first and commit and be the the winner and

19:36then there's going to be a second one that tries based on the older State and we got to make sure that that one fails well a lot of times this was being used uh at the file system layer uh to do this compare and Swap and there's the the file system itself you know this would be they would utilize like a file

19:55being written as the kind of way to see if there was a new commit and there are so there there are plenty of ways where that's not an atomic operation in fact because it's a distributed system you have to write it and depending on who which node you're reading from you might get a different answer back feature were

20:15available in data rarehouse right in classic SQL yes database it was kind of pseudo acid stuff and provided you don't get into that you know one case uh various cases where you're you're dependent on this thing to be an atomic operation and it it It ultimately isn't and fails you to you know you you'll have duplicates records you have data

20:39that's broken and could actually cause corruption in in the uh data itself so and what's the timeline again like here we're talking about like roughly 10 years ago right on average like yeah was it 2014 2024 yeah it's about 10 years ago about a decade ago um I mean people were starting to move to cloud and so on

21:00but on average like if you were doing a distributed system like spark was still pretty new uh and uh and so par started

21:10to kind of raise because of spark I guess what's your thought on that I don't know I think a lot of I I think it's just the network effect you get with certain things adopting right so uh spark obviously had a lot of good support there but for instance trino started out with or heavy orc support and then later had to move closer to

21:31park because of the network effect so I think it comes I think pandas also had a lot to do with that because they had primarily paret support yeah and that's when you started getting more people on you had kind of the data engineer uh I mean type of person and then you had the uh kind of data science person and

21:49everybody was seeing parquet because I I think particularly there was one feature that people really loved which was the nested uh you you could do nested um structs and things like that whereas orc didn't have you could technically do it but it wasn't like supported from a uh you know indexing type way so um so people you know went to paret because

22:10well real data in the real world is nested commonly uh and so especially if you're trying to avoid joins so so then you know okay well we have this thing that doesn't actually give us full Warehouse features um it's really slow so you know we bring in engines like spark and engines like trino to speed it up in different ways um and and and

22:34handle edge cases and use cases to to make that experience better which is good you know there's a lot of cool user experience pieces that come out of the py spark area um analytics in Fr in terms of like making things interactively uh set up so that your bi person or anybody can kind of not have to wait uh you know and only run six

22:54queries a day yeah you know it's quite a business killer if if that's your your uh your speed that you're running at and so um so this this ultimately you know speeds some things some of the way up um but we're still missing a lot and so um next around oh I'm gonna totally botch the the dates on this but I maybe maybe

23:17data do you know when data bricks um uh folks started to open source Delta Lake I'm trying to remember when is it 2015 open I'm not sure I think it came later than that because they started working Delta Lake internally and then they they open source it uh but they would be like 17 yeah something something like that I

23:41would say later than that there in in my memory there were the first ones on on the scene yeah um shortly after hoodie

23:51comes out of uber yeah um and and so uh

23:56you know data bricks is basically uh uh open open sourcing their version of what's basically going to be replacing hives uh two couple things you need to be able to replace the mechanisms for acid capabilities and make that actually work yeah um Unfortunately they initially they were still doing and this this actually just recently changed uh they were still

24:21relying on the file system uh for for a lot of that and still not being Atomic in in all cases but over time right then they've recently I think just updated that uh feature but um to to you know

24:34have a Catal have an external catalog uh out of Simplicity it's not fun to have to write a you know run an external catalog but um this is one of the things one of the advantages and maybe maybe we can uh we can as we dive into the the table format right and first yes maybe we can dive directly in iceberg and

24:54coming back to the timeline to say what requires uh a table format usually so you already mentioned you know a catalog um yes and maybe we can grab the picture you mentioned yeah let me go ahead and share my screen then um I so you know a quick you know we

25:16we'll get back to how these connect um but essentially these three things come out and it's kind of we actually wrote down you know the the the things that we're tracking and the you know that we need to understand stand to know about a table and actually you know have a shared consensus of what that actually means versus in Hive it was kind of

25:36anybody's guess and reverse engineering you know things and so um so this is you know the the last person on the last project on uh uh this the scene was Iceberg uh I think coming around in you know 2017 out of Netflix 2017 2018 uh maybe 2018 was the official open sourcing um and don't to Apache so it

26:01comes out it goes in incubator status and then it you know basically goes right into uh uh kind of the the public use and at first people were like ah another one yeah because so far is like if we keep the timeline uh with us as we started where like this we had uh Delta Lake then hoody uh from Uber and then um

26:25uh Iceberg from Netflix uh yeah it's it's kind of funny do you have any comment on that like we have two table format coming from tech company and then like one coming from a data company what's what's your thought on that can we so so between all of them you know Delta was the first so obviously like

26:45they were just uh you know kind of exposing how they solved their their their problem uh to the space and uh I I don't actually know maybe more of the specific reasons but I think there was a Val a a an idea to get maybe Community more involved in that because they they do build around the open source

27:03Community pretty well so I think they were just the first trying to do this Warehouse thing and sharing the ideas there um it didn't work so so you know Uber's case they have real time uh streaming uh ingestion cases and so there is various trade-offs that come into how you bring data into a data warehouse and effectively trying to

27:27treat it instead of treating it like more of a centralized data repository like a warehouse you're trying to almost like use that generic data Lake architecture to architect something like a click house or like a you know like a a a streaming ingest system where you immediately can do reads on on data that's coming in and this was critical

27:52obviously for um a use case like uber because you got like time estimates and things like that about you know different cars moving around and people requesting at the same time and you're needing to run those real-time analytics um it's it's incred I mean say what you will about you know if if you believe you know real time where real time

28:13applies and where it doesn't uh I believe there is plenty of applications but I don't think that they dominate especially in the when you think about the analytics realm uh real time is not necessarily like so ubiquitous that it's going to replace the the you know kind of incumbent Warehouse batch type uh workload I think there are some

28:36companies who do have that need and Uber is definitely one of them and so hoodie was designed to basically be the dataw DAT like the data warehouse slash you know the master of warehouses and streaming inest and and there came a lot of like tradeoffs that that went into that uh and basically kind of

28:58maybe making some of the uh uh data consistency and data quality uh worse in favor of trying to support this like this use case and so H hood has a you know deep uh streaming need right so they build their own table format based on that but what about Netflix so Netflix then came in and were like well you know

29:23we you I mean at the core problem I think Ryan blue and he the creator of Iceberg um also CEO of tabular and we can get into that later um so you know Ryan uh was was an engineer uh he had worked on cloud era worked on the high format uh in open source and before he got into Netflix and so he was pretty

29:44familiar with the problem space um and probably I guess that already had some theories going on in terms of how to address those um and they ran into that same issue again and just realized that where where Delta uh Lake was and where uh you know around how they were still utilizing the file system um as as a

30:08means to handle that compare and swap for atomicity and all these other things plus there is just all of these kind of like you know kind of uh Legacy issues

30:21that still came in from hive um hoodie and and uh Delta both really struggled with like um you know like they they wanted to have backwards compatibility or almost like you know you can have your hive thing and you can have us running on the same on cluster and uh and not have to really change too much

30:42um but but there were intrinsic issues with that based on the fact that like Hive was built on the Assumption of running on hdfs uh and a and a contiguous file store versus you know Random Access object storage uh on the cloud and so there needed to be a full redesign in terms of how you you anticipated the data to be laid out and

31:05the the kind of storage technology that you would be using um and and and the new realities of of cloud technology ver versus where we were in the early days of of Hadoop and so this is where you start to kind of see this formation coming out so without getting too much like uh basically this is what uh uh you

31:27know they came came up with at Netflix you have this catalog um and so a catalog is basically at its at its core

31:37it is the thing that is tracking the the

31:42state the the the kind of like fi find

31:46me the state current state of this table um and at a given time samper something like that and so this is two things it's a compare and swap operation um so you need to have that for the atomicity and it needs to be something that already has acid principles and so quite commonly it will be a database uh

32:08kind of a SQL database um and similar to like it was in Hive um and the other

32:16thing is uh whereas you know Hive a lot of the metadata resided in that database whereas here this one only needs let's you know let's not have all the data youan the metadata stored and cor and like kind of like collocated in this one bottleneck of a single catalog um instead let's use this catalog for the very minimal things that we need it

32:42for so one as we said was compar and swap for uh the kind of um uh the optimistic concurrency uh guarantees around atomicity and then the second part then comes into okay well I we need basically the top pointer of a basically our current tree which is the which which represents the current state of our table and so um so you have

33:06that catalog it's doing those two primary things um and then you're going to then instead of you know having all of this metadata about where the tables are and all these things that used to be you know stored in this Central uh database well now instead of doing that um you can actually Outsource

33:28the the traversal of this tree to a query engine that understands Iceberg format and so instead of that query engine having to constantly check back in with this this catalog here as as was

33:42done in Hive it can now basically get that top pointer with a with a 01 you know like guess that first call and then after that it knows all of the rest of the information and as it traverses that tree these are all files that sit inside of you know an object store like like S3 and so kind ofile it's important to

34:02mention on which kind of file yeah and we'll we'll look into these but uh the the metadata file and the Manifest list um are all Avro files uh sorry as well as the Manifest file itself so um sorry metadata files Json manifest list and manifest file we'll get into talking about what those are the these two are

34:24are Avo at this level so you have a Json file the metadata is B basically you know where like point me to where are all the things I need to find so you would do a a read out of that and then um based on you know the the uh query you might have some information in terms

34:41of a a a predicate that you need to push down or that when I say predicate push down right so you say where City equals I think we we have this example here later City equals Belgium right um so

34:55you know if I have a a whole bunch of data around cities um and I have that that you know predicate that tells me in the query engine that hey I you know I I'm going to read this uh manifest list that specifically would be the only one that would contain the the the Belgium parquet file that I'm looking for and so

35:16I'll pull on that list um and do another 01 read from from uh from S3 to pull that manifest file now that file is going to hold a whole uh list of data files and these data files are just parket right it's par or orc or whatever and so um so um you know you'll you'll uh then go

35:41ahead and scan those those uh data or you know talk to the Manifest file it's going to tell you specifically which parquet files you want as opposed to Hive what Hive would require you to do pretty much is you would have to assume that this is like hdfs block storage and actually scan over the entire uh uh directory

36:04list of all these files and check each one you would literally have to pull the footer out of each one of these parket files to see if the data is in there or not yeah so let's take a bit of time to recap sure because there is a lot so we have base file where the data is

36:22actually is still orc or Park it's it's mostly often Park these days right correct me I'm wrong yeah yeah it's pretty much Park but so park we cannot update a park file one R file right if you need to update run roll uh within a PAR file you need to reite the entire PAR file and you cannot do any

36:45delete um so there is a couple of limitation where basically those metad data will have they will tell you if there is a delete happening this will happen in some of the metadata layer right to see yes which kind of delete thr has been yeah you'll you'll so in deletes in in at least in V2 you'll um you won't

37:10you'll actually add a new delete file and then you can you'll compress that out later but yeah so yeah so a couple of operation uh that know parket doesn't support here it's kind of a view if we come back to data rarehouse uh yeah you know comp back The View but you part you're talking about compaction what what is what do you

37:32define as a as compaction basically so let's let's actually real quickly look at this this picture and try to think about like the regular operations so um before there's this part of the tree over here um let's you know just try to cover your your eyes and not see that right side um when you first write to

37:52something and we'll show this in the example here uh in the in the python example here with dctb um you know I have this Iceberg catalog um in our case that we're going to be running this is going to be SQL light um and so it's going to point to a uh you

38:09know a metadata file that's going to indicate you know the schema and all this stuff and the metadata file initially won't have any um schema or anything invol or sorry it won't have any data involved in it directly but um or maybe you know let's just in in this particular uh image you would you maybe consider this like a copy uh copy as or

38:33whatever like basically a copy operation where you do have data in the very first snapshot but quite commonly if you just create a table and then insert you'll you'll have one where it has like no uh no manifest list and and we'll see that here in a second but after you do that first you know kind of move in there

38:51this this this first snapshot and what we call you know this metadata file here uh correlates to what's called a snapshot in in iceberg and so in the catalog you will you will basically store a record that says this is the current state this the current snapshot um that that represents reality and then um so and that's going to give you a a

39:16file location that's gonna in in S3 that's going to point you to this file and then from this file you'll be able to look and find you know the the list of things that are are added and so so in this case whenever we're on Snapshot zero here um you'll notice that this manifest list um points to these two

39:35manifest files but not this third one right yeah so if you are uh if if in the very first insert um you're only talking to these data files and and these data files don't exist yet let's say then you'll go and write uh another uh more data a second round of data to this table then you're going to write a

39:58second snapshot that's going to create this new manifest list that now points at all three so now we actually have two states that we can hop back and forth between we can actually look at this table State at snapshot zero and we can actually look at the table State at snapshot one both you know kind of coexist how however long we want to keep

40:17them around um and uh and that metadata you know can uh can be compacted you know that can we could we could take these two snapshots and just make them the current snapshot and what that effectively does is it gets rid of this this tree um let's say over time you delete some files or you delete a

40:35partition or something like that then eventually these manifest files and data files associated with it would also go away um and so you're you're kind of seeing this like kind of moving um kind of progress progressing set of snapshots that that occur and anytime a snapshot's happening that's actually a commit that's that that's that optimistic currency you have some

40:58writer yeah doing some work to write the data and uh it gets a commit into the

41:04catalog because the catalog is the thing that guards us from going outside of the atomicity guarantees of of acid properties and things like that this is where really Iceberg started to gain a lot of momentum and shine was you know and gain early adoption was uh this ability to actually you know have a a system and a setup

41:27that not only gives you this you know asset transaction and and things like that but uh it intentionally broke a lot of the older um kind of compatibilities with Hive just so that it could be do a much better job at at being a uh a file or a table format that builds on our current day architectures in the cloud

41:49right with S3 storage and all these things and so when you look at this um

41:56you know and where we're at today there's still this kind of confusion around which one do I choose and all that kind of stuff yeah um ideally you know obviously the you see things like xtable and all these things to try to combine these the goal in my mind is we need to narrow it down to a single

42:16protocol just like we have and I I showed you this earlier just like with tcpip and all the protocol Wars right um we need to ultimately choose the one that is going to be serving the case of of of

42:32warehouses and so um so that you know this I'm not trying to get on my like soapbox Crusade thing but ultimately instead of trying to create another artificial layer you know that that kind of combines these um you know I would say for the most part Iceberg and and Delta uh after Delta opened up 2.0 and and further 3.0 they've caught up

42:55functionally very well with with Iceberg and um you know there's just kind of one-offs here and there of of different functions that each each support um so it doesn't really make sense at this point to have those two and Hoodie has the streaming ingest thing but it it ultimately can lead very easily to you know corrupt data and um and and

43:18ultimately is not the best user experience it's kind of still stuck in those Hadoop Hive type type ages and so

43:26the question question comes then into is like really most people kind of have come to accept unless you have a streaming use case um you're you're just going to you know choose have to choose between one of the two and usually the choice comes down to do I you know am am I thinking about you know kind of the

43:42future architecture yeah um as as this grew right um I more query engines

43:50started to you know uh adopt Iceberg and

43:55as that happens right uh there's this metcalf's law um as more things use you yeah this is the same thing happened with with duct DB there's a lot building around duck DB right and so it's a similar type of uh experience where you have all these things that that depend on and and Inter integrate and work with

44:14um the more you have that the the stickier the product is and the more value the product will bring for you and so iceberg is is getting at its extreme point I I wrote a Blog uh kind of provoc have intentionally on the title to say Iceberg won the table format War because yeah and and it was at the time when

44:37when you know Delta had had announced that they would be providing some support for for I um for Iceberg and oh sorry I I didn't mean to say Delta I meant to say data bricks yeah but um and and when uh and similarly you know snowflake had also done this and so you this is the only form

44:57that you know you could get Snowflake and data bricks collectively using because it's under aache it's it's uh you know has a lot of governance from Apple and AWS and and uh a lot and Netflix and a lot of other companies including tabular um okay let's let's let's uh let me stop you there uh because you're you're taking too much

45:20your hat of D tab but you did see a lot

45:24of things interesting the first thing is H you no you you said that there is other initiative to have basically abstraction on on on those stable format right uh X and Delta as also how is it called again data breakes Unity I think is is the so there there is also uh framework to abstract uh Delta Lake

45:48Iceberg for you but if you think so like

45:52it's already an abstraction a table format yes right because I've I've had also I mean we're not going into that path but I also I had a chat with people thinking we should have a new file format right the start there because it's based on park one to rule them all yeah exactly I think we it's exactly

46:12that we are always getting into that point but I think uh something which is

46:18obvious to me is that in the data ecosystem we have a lot of complexity in ter of Tooling in term of integration and I don't think building abstraction on top of abstraction is really future proof right it can solve maybe like a migration process or something right if you're changing a technology um but on the long run it's I don't think it's it's viable

46:42the question I have for you um you um

46:47you mentioned um regarding uh the the

46:51the context coming back to the graph basically because we talk about the components so we need a catalog we need the we have metadata file manifest list manifest file and then we have the data itself which going be park or what are the other components that are link it to a table format to these days so you mentioned the catalog where

47:15does usually the catalog live what do you seen and what's what's the execution inine you you mentioned a couple of them that is actually going to write those uh stable format here at Iceberg yeah so one one I guess uh kind

47:36of uh trying to think one pattern that I've seen pretty frequently um across a couple different open source things I I've seen it recently in uh uh I don't know if you have you heard of Matrix the uh uh communication platform that's trying to yeah yeah so uh you know the these uh

47:59basically the goal that I've seen a lot of cases is building out a standard that way you know and that's that's the kind of uh entry point and focal point what

48:10Iceberg uh the the project is is aiming to do is maintain that standard uh in a neutrally governed way and maintain um almost the SDK or the API in the libraries to those but try I mean try as

48:27hard as possible to avoid um any

48:32implementation details uh around uh anything in particular that that's official you know Iceberg there's um you know reference implementations that different companies will own we we have tabular has one um there's another company called uh

48:54gravitino and if you just go on to GitHub and you search for you know Iceberg rest uh catalog um these are you

49:04know these are these are good reference implementations that follow um a the specification and you you yourself could write one by looking at one of these similar like how you know this Matrix is a spec it has some libraries or or apis that it officially owns um but there's like I don't know 20 different ref reference implementations of the actual

49:28server that you could run yourself different ones have different licenses different ones have different support and activity and so I'm seeing this trend and it actually works super well in open source in general um but basically this is where where you know uh you know that Apache Iceberg comes in is the catalog there's no official Apache Iceberg uh catalog now there's some

49:55debate about bringing a reference implement ation into uh the Apache Iceberg so that people can kind of use something out of the box that is you know supported by by the uh the community but the concern there is we don't want to end up in a state where we were with Hive where we had a million libraries that supported everybody's

50:15unique use cases um rather you know you have at the at the most you'll have a very very very light reference implementation and then if somebody has like a special you know monitoring thing or has a special you know extra set of libraries you can take that reference implementation copy it over to your org or something like that and build that

50:38out and even share it yourself right like it's all part of like building around that Community rather than having you know one implementation that just gets bloated yeah um and and requires a lot of testing and focus of Maintenance it's better to you know resolve especially with open source resourcing right yeah trying to keep people you know focused and sustainable and all

51:02that stuff it's better to have the focus be on as few things as possible uh unless it's like an absolute necessity so we have so we have rest catalog or specification for example for to build a catalog we have existing catalog what's the common one that you see on the cloud hi metastore met but there is also like

51:25sever less uh I mean yes I would say like like glue glue well glue and by the way glue is like the a of us version of uh of of Hive metastore yeah um but but has different features of course um then you have SQL there's generic SQL and it will basically use SQL Alchemy yeah so you

51:46could you know connect to various uh SQL drivers there so we'll do that because the time is running so we have uh oh yeah yeah in a few minutes I just want to to to close on the component so uh we're going to do use the SQL light as the catalog and then who what's the writing engine typically to that is

52:08being used so it's whatever query engine uh this is you know kind of thinking about the flip side so if you look at something like uh duct DB you can connect to you know local file systems and all these other uh potential systems that you you can query from uh by you know Integrations and things like that

52:27um this Iceberg looks at the world as a a in a flip side right all these different engines can talk to this one kind of uh layer of storage format uh and this is why I say I compare this to tcpip you think of like a SQL query layer you think of then the table format file format and then S3 right those are

52:50kind of like our our general protocols and in each layer you kind of have Iceberg for instance can be can talk to multiple query engines and duck DB is of course one of those query engines um with uh with with currently with just the read support uh and and no write support and so but like if we coming

53:08back to Netflix for example they were mostly using uh which kind of execution in trying to write uh so so writing was generally

53:19done through spark and then they also supported like a reading through through trino yeah um and so that that a very very common pattern with early uh adoption Iceberg but it's exploded to where you know big query uh red shift snowflake um so data warehouse basically have some features of data Lake where you can write t format Iceberg uh within

53:46some of those yes and and and and the cool thing is if you write in one provided everything is talking you know following the spec or basically utilizing those SDK libraries that are maintained in different languages um then then you're then then whatever is actually implemented within the query engine itself is actually what what level of support you have for Iceberg

54:11yeah and so EAS what what's kind where the the catalog is storing we still have the limitation where today those data warehouse they use internal catalog or what what do they typically use it depends I I would say most commonly people are still migrating away from their you know a lot of a lot of this is centered around big data so you know old

54:34Hive shops that that are trying to upgrade and get to the latest thing and so um there's a lot of migration tools by the way everybody has migration tools between everything like Iceberg has migration tools between Hive and this is why another reason why xtable and and the other one is like makes no sense because there's already tools to to

54:55manage the migration so anyways let's go on to the uh the code this yeah let's go go too much quacking yeah too muching um so I have uh the I

55:08can I can share my screen I have the so you'll be the driver I'll be the Newbie okay um okay sounds great yeah yeah so basically I'm gonna share quickly uh we have some issue on the LinkedIn live because LinkedIn is being LinkedIn but I'm sharing it on uh uh YouTube and I'll share on the comments uh give me a sec

55:30on uh the mod duck uh live uh

55:36here uh notebook I'll put again in the description later if you if you want to catch up but the point is uh we are going to use by Iceberg uh and Doug DB to show you what's possible uh can you give me a brief uh note about the difference between ice ber and P Iceberg so uh iceberg is thinking about

56:01that as the spec Pi iceberg is the that API layer for python so this is basically it's not aiming to be a query engine on itself so like Duck DB is a quer full-blown query engine we want duck DB to use Pi Iceberg um and and or other people who Implement a python Library we want them to utilize that so

56:24that anytime there's a fix that happens in our library um you upgrade you know that you'll get that that fix automatically uh and this is being focused on and built out by the by the people so here you've you you're pulling in you know this uh this CSV file and very similar duct Beast thing that I'm sure everybody knows and into a well

56:47it's good to do a recap you know we we have Perfect Stranger Than coming to the live stream to discover but so here is I'm just uh running a dgdb query and uh what's beautiful is that I can query an hisory file with one line of code every dependency is buil in I don't need to install anything um and it's a public

57:06bucket so there is nothing to do and I'm converting it to a p Aro uh table um so

57:15I'm going to run this and I think I just need it it's done and so now I have my uh pie table I think it's h yeah it's about uh hair quality um and so we see the the the the schema and we have some uh some sample of data over there um I need to install by

57:36Iceberg obviously and SQL light why do I have to install SQL light oh why do I oh you're asking me yeah my turn it's your it's you the expert right yes that's right uh so so sqlite is ultimately going to be the catalog in this case so you're you have you know some sort of SQL database and

57:59so sqlite has all the acid uh capabilities even though it runs like a local file system it has um all of those neat features in yeah uh that we expect I mean at least it works on our Android phones doesn't it yeah so that will be our catalog for the demo and uh here I'm just creating a folder here so this is

58:20uh my space and um I'm going to do so

58:24what does this part uh do is connect to the to the catalog to SQL light and what's reference this is yeah Pi Iceberg So it's talking to the warehouse so this is going to be where we store the kind of underlying file structures generally this would would exist in S3 yeah but because we're using SQL catalog and

58:45because we're using a local file system um we're able to actually you know fake out S3 and write it locally here inside of the the the no book so that warehouse is pointing to a local file system usually it would be an S3 or whatever bucket um or hdfs uh and then URI sqlite

59:06that is pointing to the so you know sqlite has the you know a local an ability to write you know all of your SQL data to a local file system. DB and so that's where all of your SQL data which is the catalog data uh that holds that top level you know what what's the current snapshot that represents the

59:28state of the table yeah yeah now that's uh that's clear and then here Rec creat I guess a name space within just the catalog and we create so you're calling it default yeah yeah uh I'm create a table which is the within in that name space and error quality which is my table and so now and you're pulling the

59:51schema from the the arrow to pass that along yeah so I'm pass schema so now we already have some metad data which is appar there which is a Jason file so

60:05this this file right here is a Json file that directly correlates to a snapshot yeah so any Json metadata. Json file is a is basically the in file version of a snapshot and that's what the catalog the SQL if you if you query that SQL light database it would point to uh this this particular file and then this file then

60:29points to uh well nothing at this point because we have an empty table Yeah so exactly so now we're gonna do uh that we're GNA uh append tables I'm I'm getting basically the content of my P Ro table so P table is my uh Source table initially right so I'm going to run this and here I should so I do have some

60:53content I need to refresh probably the f three there you go so now we have metadata and we have data yes so that

61:06you'll see that second metadata. Json file that's the second snapshot and so now in same thing that we saw in that graph above you know you'll have that that uh second entry in the database in the sqlite database that points to that 01 snapshot file that points to uh the

61:26the first AO file on the top sorry the second one that starts with SNAP and then from that AO file points to the list F the the list file that actually points to then the park file it's a small data usually you'll have multiple you know data files and a lot that this would make more sense in that context

61:45but uh it's basically one file that points to another file that points to another file yeah um and so here uh what I'm going to do is just we have a function to uh created uh dark DB table

61:58from that uh uh table which is uh the I

62:03uh Iceberg table so this is where I'm do uh the reading if this works I don't have any I don't know if I lost my run time but it looks good okay and here basically I'm just running based on uh so this is just a ddb connection and I'm running uh SQL command and just from because ddb is from uh statement first I

62:27don't need the select star so so above there that scan operation's lazy it just basically says here's where you'll find the iceberg table and then when you go to duct DB and you say the table name then that's I like you know connection that ice Iceberg connection uh nothing's been executed until you actually materialize it tosql and now you have in

62:50dctb what it understands is the table air actually executes um the full scan

62:58of of the uh of the of the iceberg table uh metadata at that point yeah um so there is no there is no filtering happening there yeah and so here what uh

63:10I so if you look at this is this table is actually the table um the the pi Iceberg table. scan yeah and so this is

63:20uh when you see this row filter this isn't actually being done in duct DB uh duck DB currently only uh applies a full scan um rather than you know pushing down something so if you go down to the uh Med if you go down to the that one so this is actually just you know uh rebuilding the arrow data frame and then

63:42overriding the duct DB table yeah um so that you're dealing you have less no filtering happening at the read time there is a a transition to by Arrow yep

63:54so so if you were to query update that SQL query the from air and the duct b side and you put where City equals Berlin you would actually scan the entire data set uh without skipping anything and that that that wouldn't happen so that's I wanted to make that clear in terms of if you do ever Point

64:14ducky B to like some giant table right now it would take a long time and you'd be like why the heck is this taking so long and I'm hoping that we can encourage uh if anybody's interested in you know doing kind of messing with this stuff uh and playing around with this um you know and you're interested in this

64:32please ask for it we we we uh I I think

64:36we have some work that we would be interested in helping out uh in terms of uh contribution and I do think there is some early work early PRS that have been kind of uh put forth to uh to solve some of the push down capabilities that high Iceberg can no that's great and so here basically what we do is simple um doing

64:57a a filter within the arrow uh word and

65:02then overwrite as you said the same data frame um so here's my new data frame so now my state of my file represent basically this States at least the last snapshot so it's only filtering on the CD building and so if I uh overwrite it to ddb and I'm displaying again I only have 10 rows now and those contain only

65:27uh the Berlin City so they kind of like like the good thing is that I I really love actually uh the dougb iceberg just to understand what's happening in local F because it's true that it's pretty fast right just within a notebook to explore step by step um what's happening

65:47and now we have another Park file which is probably and there is probably a diff happening um but uh one other thing maybe to mention is that let me get my terminal one sec that we had there is another way to read uh from wgb to

66:07Iceberg right and it's through the uh Iceberg

66:14extension um what's the what's the actual difference between that and uh andberg do you know I mean the implementation is different right so so uh oh yeah so actually ducky B does not use the pi Iceberg uh implementation you're right uh there is so in general we we kind of hope people would use you know the the the centralized library

66:43because that's the one that you know the the community supports however I mean I I think that um Hans and a lot of the people on the core uh and duck labs and those things are very consist considerate and thoughtful in terms of you know if we might write a generic library but it may not be performant on

67:02duct TB and and their big thing is performance and so they uh basically have to reimplement everything that we've already implemented uh within python um and uh it comes with trade-offs right because then now you get to specifically make sure this runs optimal to whatever duct DB's needs are however if there's a change or a fix or

67:25something that com comes through uh our side then you you have to implement it there so so everything is a duplication of effort um that's again um and and it is managed by the the duck Labs folks uh and and Hans is open to this just but you I talked to him when he came to Chicago a while back

67:47and he's really wants to to support this as the community needs it or as somebody will pay duck labs money DS yeah dubs

67:59sorry no no worries um so yeah so that's a really good point so it's a different implementation it's also a limitation to uh read supports right and yeah and so how this work is that you so here I'm in the de DB CLI I uh you can install uh

68:19Iceberg and then just load uh load Iceberg and I have uh I think this file

68:27and so what I'm just doing is invoking a Iceberg scan and there is uh a given

68:35file Iceberg table over there it's taking a bit down time uh but here we go we have about 50,000 rows return here uh for this one but again it's limited to uh read only um and it's a different implementation from uh from the PBG so that Lees you two options to start at least your U your duck DB and Ice Spirit

69:00Journey but what's we are we are already over time but what's uh what's your closing thoughts about um basically lighter engine or execution engine like dgdb and and table format where where does you see uh it's placed against you know the other that we mentioned classically uh would it be spark or a classic data warehouse yeah so I I see uh a huge value in the

69:29developer experience of duct DB uh and people utilizing it for various uh purposes around data like local data engineering and local data science uh and there's a huge value in starting out like at a at a smaller scale with the same data that you're actually needing to point to and so um you know I uh a

69:52lot of times you know people will mock data or you know point to various other things I think it's incredibly valuable to um you know provided you have access and and you're you know it's good to do on your laptop you P you know pull this data locally uh get a small representative data set of of what real

70:11prod data is and sometimes when you're doing that experimentation it's it's incredibly valuable to point it directly to those tables like we have said having a full table scan is um is is not ideal

70:24um and I think that duck DB you know is generally not necessarily aiming at the at the Big Data stuff usually but I think when it comes to uh understanding and getting proof of Concepts and getting the idea of the L what eventually gets applied at larger scale um duct DB is invaluable in that process and it's an incredibly important process

70:45that a lot of people Overlook so um so I see a lot of value especially from you know the local developer workstation uh capabilities around big data and even if it's like Iceberg doesn't have to be like big data all the time either um and so there is plenty of value in just having this you know mutually

71:08represented format that if you need to you know write test data or or or quickly spit out test data to an ice per table in production and then do a quick you know kind of some sort of testing or validation um you know it's it's incredibly easy for for uh you know duct B to to see what the state is and just

71:28knowing the metadata and using it for Tooling in that sense it's incredibly incredibly value valuable so um all that being said and particularly for me as a developer Advocate uh I I care let's let's close that one by uh what's what's your plug around your position at daer and yeah basically what's what's your mission and it's kind of like your uh

71:52plug your at time for the rest of that's yeah that's exactly uh I mean what it is it's I want people to understand table formats uh and and and how they differ from you know paret or other file formats um and I really want people to see the value of this like you know of of of the things that we've solved um

72:17basically making it possible for the warehouse to be at a large scale and still have the original data warehouse features and much further interoperable features across you know data engines so you're not having to store data more than once uh you know if you're in big query or snowflake or all these other things you don't want to

72:37store that data multiple times and so having a format that everybody reads and understands is incredibly valuable and important and uh you know having it everywhere from Duck DB doing the local thing working on that same format to all the big like big data query engines uh that uh that are doing this it's it's incredibly valuable to have that and not

72:59have to move data around back and forth yeah no I'm looking forward like that was one of my worst nightmare is data migration especially between two different database that has different type uh so looking for what to not do that anymore um yes thank you Brian again uh for joining us than for having me it was uh a really history lesson I

73:23didn't expect that but but I was pretty happy at the end uh I did learn uh myself a couple of stuff so thank you uh again for joining us uh if you want to check tabular uh so it's tab. iio and

73:37I'll put all the link in the description uh if you want to check by Iceberg yeah we also have a virtual book on getting started with Iceberg it's Iceberg focused not necessarily tabular focused um there's a it's a you know virtual book uh that you can go uh look up on the Internet it's free um it's on

73:55tabulars thing as well I'll send you the link medy so you can add it to the show notes or wherever you add it good thank you uh to that uh see you next time into

74:05the quacken show which is happening every other week and I believe next time

74:13will be uh debugging SQL query I think re Factory

74:20SQL queries so that's an interesting one keeping like a long script you've seen those on right yes and how do you tackle that uh so that would be interesting stay tuned on it's not only on DVT i f General how do you know that that's how I tackle it uh don't spoil it uh but not

74:40only that uh but anyway check out our event at pd.com events if you don't want to miss another live stream and yes have a beautiful day beautiful morning or beautiful evening depending on how you where you are and see you next time

FAQS

What is Apache Iceberg and how does it differ from Parquet as a file format?

Parquet is a file format for storing columnar data efficiently. Iceberg is a table format, a layer of metadata and specifications on top of data files (usually Parquet) that adds ACID transactions, schema evolution, time travel, and partition pruning. Iceberg manages a tree of metadata: a catalog points to JSON metadata files, which point to Avro manifest lists, which point to manifest files that reference the actual Parquet data files. This metadata layer enables features that Parquet alone cannot provide, like atomic updates and concurrent writes. For more on Parquet, see our format guide.

How does DuckDB read Apache Iceberg tables?

DuckDB reads Iceberg tables in two ways: (1) using the native DuckDB Iceberg extension (INSTALL iceberg; LOAD iceberg;) which provides an iceberg_scan() function, or (2) through PyIceberg in Python, which loads data into PyArrow tables that DuckDB can then query via zero-copy integration. Both approaches currently only support read operations. Write support is not yet available in DuckDB's Iceberg extension. The DuckDB extension performs full table scans without partition pruning, so very large Iceberg tables may be slow to query.

Why was Apache Iceberg created and how did it emerge from the Hadoop era?

Iceberg was created at Netflix around 2017-2018 by Ryan Blue, who had previously worked on the Hive format at Cloudera. The older Hive metastore had serious issues: it relied on file-system-level operations for atomicity (which weren't truly atomic), required scanning all Parquet file footers to find data, and was designed for HDFS rather than cloud object storage like S3. Iceberg was a ground-up redesign that uses a proper catalog with compare-and-swap for ACID guarantees, stores metadata in a tree structure that enables partition pruning, and is optimized for modern cloud storage architectures.

What is the role of the Iceberg catalog and what options exist today?

The Iceberg catalog has two jobs: it provides the compare-and-swap mechanism for atomic commits (so two concurrent writers don't corrupt data), and it stores the pointer to the current table snapshot. Common catalog options include AWS Glue (managed), S3 Tables (fully managed Iceberg on S3), Snowflake Polaris, Unity Catalog (Databricks), and self-hosted options like LakeKeeper. For local development and learning, you can use a SQLite-based catalog, as demonstrated in this video with PyIceberg. For a deeper look at the open lakehouse architecture, see our guide.