Leveraging DuckDB and Delta Lake Together

2024/07/24Featuring:

If you work with data, you've almost certainly worked with Parquet files. It's the go-to columnar format for analytics, a massive leap forward from human-readable but machine-unfriendly formats like CSV. But as powerful as Parquet is as a file format, it's not a complete solution for managing data tables.

As our data pipelines grow more complex, the limitations of a folder full of Parquet files become painfully obvious. In a recent episode of "Quack and Code," I sat down with Holly Smith, a Developer Advocate at Databricks, to discuss the evolution beyond plain Parquet files and how the Delta Lake table format, combined with DuckDB and MotherDuck, solves these modern data challenges.

Holly shared a relatable journey: "Back when I was a data scientist, I'd grab things from somewhere like Teradata. And I'd never think about a table being a file format... It's just kind of like SELECT * FROM table and then it works."

TL;DR

The Problem: Using folders of Parquet files as tables leads to schema enforcement issues, inefficient updates/deletes, and the "small file problem."
The Solution: Delta Lake adds a transaction log (_delta_log) on top of Parquet files to provide ACID guarantees, solving these reliability issues.
How to Query: DuckDB's delta extension makes it easy to query Delta Lake tables locally or on S3 using a simple delta_scan() function.
Performance Boost: For cloud data, use MotherDuck to run queries in the same region as your S3 data, drastically reducing latency by transferring only the final results, not the raw data.

We'll start with the pain points of plain Parquet, see how Delta Lake addresses them, and then get hands-on with querying Delta tables using DuckDB and MotherDuck.

The Pain Points of Using Plain Parquet as a Table

Parquet is an excellent building block, but when you treat a directory of Parquet files as a single table, you invite problems that can break pipelines and cause massive headaches.

The Challenge of Schema Enforcement

A common issue with systems that use schema-on-read is that a single rogue Parquet file can break an entire query. As Holly explained, with plain Parquet, you could define a column as an integer but then write a string into it in a new file, and nothing would stop you.

If one file has a slightly different or incompatible schema, your query fails. Finding and fixing that one problematic file in a table composed of thousands requires tedious, manual work.

Inefficient Updates and Deletes

Parquet files are immutable. You can't just open one up and delete a row.

To handle updates or deletes (like a GDPR right-to-be-forgotten request), you have to read all the relevant Parquet files, filter out the records you don't want, and write out entirely new files.

Holly shared a powerful real-world example of this pain:

"I remember working with one customer... they had every click from their website from the last five years in this one monster table. For two hours a day, they'd have to take it offline. And then GDPR came in and they were like, we don't even know how we can delete data from this. This is a nightmare."

This process is not only complex and slow but also requires taking the table "offline," creating downtime where the data is unreliable.

The "Small File Problem"

Query engines love large, uniform files. However, many data architectures, especially those involving streaming or certain partitioning schemes, create lots of small files.

For instance, an e-commerce platform partitioning by date, as Holly noted in her "Black Friday" example, might have one massive file for a busy shopping day and hundreds of tiny, inefficient files for slower days. Querying thousands of tiny files is incredibly inefficient due to the overhead of opening each file and reading its metadata.

Delta Lake: Adding Transactional Reliability to Parquet

Delta Lake was created to solve these problems. It's not a new file type but a storage protocol that adds a crucial layer of reliability on top of the Parquet files you already have.

"Technically, Delta is a protocol," Holly clarified. "We still have those Parquet files, which are great, but what we're going to do is add a tiny bit of metadata with our Parquet files and store them next to them."

This "tiny bit of metadata" is the key: a transaction log that brings ACID properties (Atomicity, Consistency, Isolation, and Durability) to your data lake. Every operation, whether it's adding data, deleting records, or updating a schema, records as an atomic commit in this transaction log. A transaction either completely succeeds or completely fails, eliminating the risk of corrupted data or inconsistent reads.

Under the Hood: A Peek Inside the `_delta_log`

So what does this transaction log actually look like? When you create a Delta table, you'll see your familiar Parquet data files alongside a new directory: _delta_log.

Copy code
/path/to/my_delta_table/
├── 00000000000000000000.json
├── 00000000000000000001.json
├── ...
├── part-00000-a-b-c.snappy.parquet
├── part-00001-d-e-f.snappy.parquet
└── _delta_log/

Inside _delta_log, you'll find a series of sequentially numbered JSON files. Each file represents a single atomic transaction or "commit" to the table.

Here's a simplified example of what one of these JSON files might contain after adding a new Parquet file to the table:

Copy code
{
  "commitInfo": {
    "timestamp": 1672531200000,
    "operation": "WRITE",
    "operationParameters": {"mode": "Append", "partitionBy": "[]"},
    "isBlindAppend": true
  },
  "add": {
    "path": "part-00001-d-e-f.snappy.parquet",
    "partitionValues": {},
    "size": 1024,
    "modificationTime": 1672531199000,
    "dataChange": true,
    "stats": "{\"numRecords\":100, \"minValues\":{\"id\":101}, \"maxValues\":{\"id\":200}}"
  }
}

This log entry tells us everything a query engine needs to know: a new file was added, its path, its size, and even file-level statistics like the number of records and min/max values, which can skip reading the file entirely if a query filter falls outside that range.

A common question is, "Can I still query the Parquet files directly?" As Holly explained, while you physically can, you shouldn't.

When a row is "deleted" in Delta Lake, the transaction log simply adds a remove action pointing to the old Parquet file. The data still exists in that file, but the log tells any Delta-aware engine to ignore it. If you bypass the log and query the Parquet file directly, you'll see stale, incorrect data. Always query through the Delta protocol to get the correct, current state of your table.

How to Query Delta Lake with DuckDB

The DuckDB delta extension makes it incredibly simple to read this transaction log and query Delta tables correctly. Built on the open-source Delta Kernel project, it ensures robust and efficient access.

Step 1: Install and Load the Extension

First, you'll need the delta extension. DuckDB will automatically download and install it the first time you use a Delta-related function, but you can also install it manually:

Copy code
INSTALL delta;
LOAD delta;

Step 2: Query a Local Delta Table

To query a Delta table, you use delta_scan, a table function that interprets the Delta Lake transaction log to find the correct data files. Querying a table stored on your local machine is as simple as pointing the function to the root directory of the table.

Copy code
-- Query a local Delta Lake table
SELECT *
FROM delta_scan('path/to/my_delta_table');

DuckDB reads the _delta_log, determines the current set of active Parquet files, and queries them. That's it.

Step 3: Query a Delta Table in S3

The real power comes when your Delta tables are stored in cloud object storage like Amazon S3. The delta extension integrates with DuckDB's httpfs extension and secrets manager.

Before configuring credentials, you must ensure the httpfs extension is installed and loaded. This extension provides the necessary filesystem abstraction to read files over the network (HTTPS/S3).

Copy code
-- Install and load the httpfs extension for S3 connectivity
INSTALL httpfs;
LOAD httpfs;

Next, configure your AWS credentials. DuckDB can automatically use your environment variables or local AWS config, or you can create a secret explicitly:

Copy code
-- Create a secret to use your local AWS credential chain
CREATE SECRET (
    TYPE s3,
    PROVIDER credential_chain
);

Now, you can query the Delta table by providing its S3 URI to delta_scan:

Copy code
-- Query a Delta Lake table stored in S3
SELECT
    user_id,
    COUNT(*) AS event_count
FROM delta_scan('s3://my-bucket/path/to/delta_table')
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 10;

Level Up: Accelerating S3 Queries with MotherDuck

Querying a remote Delta table from your local machine works great, but it has a physical limitation: network latency. Your local DuckDB instance has to make multiple requests to S3 to read the transaction log and then download all the necessary Parquet data over the internet before it can process.

As Holly pointed out, this becomes a major bottleneck for distributed teams: "If your data is in S3 and it's halfway across the planet, connecting locally is... you're just going to have latency regardless of the tool that you're using."

MotherDuck addresses this directly. MotherDuck is a serverless analytics platform powered by DuckDB that runs in the cloud. By running your query in MotherDuck, you move the compute next to the data.

Here's the workflow comparison:

Local DuckDB:

Connect to S3 from your laptop.
List files in _delta_log.
Download relevant log files to your laptop.
Parse logs to identify active data files.
Download gigabytes of Parquet data from S3 to your laptop.
Process the data locally.
Display the small result set.

MotherDuck:

Connect your local DuckDB client to MotherDuck.
Send the SQL query to MotherDuck's cloud-based engine.
MotherDuck's engine performs steps 1-6 within the same cloud region as your data, over a high-speed network.
MotherDuck sends only the small, final result set back to your laptop.

This shift dramatically reduces data transfer over the public internet, moving only the small final result set instead of the entire raw dataset.

The query is identical. The only difference is where it runs.

Copy code
-- 1. Connect to MotherDuck (from the CLI)
-- duckdb md:

-- 2. Ensure your S3 secret is stored in MotherDuck
-- You can create it once at https://app.motherduck.com/

-- 3. Run the exact same query
SELECT
    user_id,
    COUNT(*) AS event_count
FROM delta_scan('s3://my-bucket/path/to/delta_table')
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 10;

For large datasets, the performance difference is dramatic. Instead of pulling gigabytes of raw data across the internet, you're only transmitting kilobytes of results.

The Bigger Picture: A Converging Ecosystem

While this guide focuses on Delta Lake, it's reassuring to know that the skills you're learning fit into a broader, converging ecosystem. The data world once had a "table format war" between Delta Lake, Apache Iceberg, and Apache Hudi. But the trend now moves towards interoperability and convergence.

With Databricks acquiring Tabular (the company behind Iceberg), the focus is on making these formats work together. Features like Delta Uniform allow a single table to be read as both Delta and Iceberg, and future developments aim to make writing to multiple formats atomic.

"This is not taking away one project... this is the idea of how can we take great features from the various projects and make them work together?" Holly said. "Hopefully, it shouldn't matter in the future which one you pick."

Delta Lake vs. Iceberg vs. Hudi

At a high level, the main difference between the three open-source table formats is how they track changes and table state.

Delta Lake uses an ordered transaction log of JSON files in the _delta_log directory. To find the current state, a reader must process these logs sequentially.
Apache Iceberg takes a different approach, using a hierarchy of metadata files. It maintains a central metadata file that points to manifest lists, which in turn point to data files. This structure can make discovering the current files for a query faster, as it doesn't require listing and processing a potentially long transaction log.
Apache Hudi offers different table types (Copy on Write and Merge on Read) and maintains a timeline of actions performed on the table, similar to Delta's log. It was initially focused on providing fast upserts and deletes for streaming use cases.

Conclusion & Next Steps

Parquet is a fantastic file format, but for building reliable, scalable data tables, you need more. Delta Lake provides the transactional guarantees and performance features necessary for modern data platforms.

Use Delta Lake to add a transactional metadata layer that solves the core problems of plain Parquet.
Use DuckDB and its delta extension to query these tables locally or in the cloud with minimal setup.
Use MotherDuck to run queries on cloud-based Delta tables with compute next to the data, eliminating network bottlenecks.

Ready to try it yourself?

Dive into the DuckDB Delta Extension Documentation to see all the available features.
Explore the Delta Lake project to learn more about the format.
Sign up for a free MotherDuck account and try querying a public Delta Lake dataset from S3.

FAQ

What is the main advantage of Delta Lake over Parquet?

Delta Lake adds a transaction log to Parquet files, providing ACID transactions. This transaction log solves major Parquet issues like schema enforcement, difficult updates/deletes, and ensures data reliability, which a simple folder of Parquet files cannot guarantee.

Can I query Delta Lake with standard SQL?

Yes. With tools like DuckDB and its delta extension, you can use standard SQL functions like delta_scan('path/to/table') to query Delta Lake tables as if they were regular database tables.

Why should I use MotherDuck to query a Delta Lake table in S3?

Querying a remote S3 Delta table from your local machine requires downloading large amounts of data, causing high latency. MotherDuck runs the DuckDB engine in the cloud, next to your data, processes it remotely and sends only the small final result back to you, which is significantly faster.

Can DuckDB write to Delta Lake tables?

Currently, DuckDB's delta extension primarily supports reading from Delta Lake tables. Writing or updating data in the Delta format is not yet supported directly through the extension.

Transcript

0:01[Music]

0:12hello everybody how is life uh we are

0:16here for another quacken code episode uh and we're going to talk uh again about table format we actually it's the second time we talk about t for we talked about iceberg which is finding actually a new home also uh to data breaks and speaking of that I have a guest from data breaks today uh hly Smith that's going to help

0:37us to talk about uh Delta Lake and also how does that work with duct DB and a bit like how these integration at the high level is working and we'll get to an Endzone part uh so that you have something to get started and so I have holy in the backstage and let me bring G

0:59here there you go helloo how is it going very good thank you uh really really excited to be here I was so honored when you uh I got that message on my DM I was like yes this this sounds like it's going to be fun uh let's chat about Delta for an hour and do some do some demos and show people

1:19what it can do yeah no that's uh that's exciting um but let's start just a bit about like your your background uh where actually you you're coming from from when you calling us right uh I want to know everything so yeah awesome so uh I work for data bricks as you said uh I've worked for data bricks for about five

1:40and a half years now I have done a variety of roles here I kind of joined as a technical account manager type person I'll be honest I didn't really know what I was doing I think I was hired because I was organized um but technical skills I was not great and I had to learn everything from scratch

1:55really and that really kind of gave me a lot of empathy when it comes to people who are new to a lot of technical things

2:03um because maybe they haven't studied it at University anyway so then I went on to do Consulting and I have done that for just over four years now and I kind of had a bit of a a moment where I realized that actually I have really a lot of fun kind of teaching people I I was an instructor for a while and I

2:20enjoyed doing conference talks and I enjoyed writing about things and you know showing people what is a good feature to use and how it's going to help them and you know make things easier for them and so it that job that is a job um is called developer Advocate within developer relations and that is what I do now so yeah um really looking

2:39forward to being able to talk about Delta and things I should say my kind of background before that was more around data science and kind of maths uh so I do have a bit of knowledge there but I will say I spent most of my time doing data engineering I think there's something nice about getting something really lined up and really optimized and

2:57just making it making it sing yeah no uh

3:01that's nice someone is just thinking holy is the real mother duck yeah well uh we'll

3:09we'll know about that uh but we for sure gonna uh talk about both um yeah it's it's really interesting how many people have different background on on devil right because I think it's a it's a really specific skills um niche of skill set that you need to have in the same person right uh so being able to speak

3:31being able to be uh technical uh doing blogs and so on what do you enjoy the most by the way on your content where do you we usually see you I think for me it's when and this is also what I liked about training as well is seeing things click for people and you know hearing about something that they'd maybe been a

3:51little bit hesitant to try because they didn't really understand it hearing the explanation going oh now I get it and like this is how I could make it useful and part of what I do at the moment is like a little bit of feature awareness as well so we've bought out this new thing it's going to do XYZ and people

4:07are like oh this is going to make my life so much easier uh we recently bought out variant types um which is part of Delta and the idea is you can you know take something that used to be absolutely horrible with unstructured data and now you can just kind of do you just ingest it as if it's a normal you

4:24know column type and it's going to save so many headaches for so many people and like hearing knowing that people hear about this and how much time it's going to save them that's so satisfying uh for me for people to be able to understand that I I I get that like it's U and I think especially if you have people like

4:43in person because we do a lot of online stuff right right now uh coming back I think it's it's always a pleasure to meet those people so if you ever seen us in at conference would it be holy and me please don't hesitate to come and chat we are not going to eat you promise I'm so um quite short in real life and I

5:01think a lot of people are quite surprised by kind of meet me and um I'm also normally hang I don't hang out with Simon but like normally if there's a conference Simon from advancing analytics is also quite there and he's normally quite tall he's always quite tall um I think people are quite surprised when they see us next to each

5:17other and it's kind of like oh wow didn't expect it to yeah people people make a mental image of you and there is always surprise when you get to see in person uh but diving in in uh the T format um so uh we what do we call table

5:34format because par is a file format right uh that most people uh know because it's kind of like the standard in analytic and then came those table format uh like delt but why is that so is par not not enough so uh Park is a good building block uh but unfortunately we need it to do more these days I think it kind of

6:00originally came out you know it did things that we'd not really seen from other file formats but it wasn't quite as much as we maybe thought about um I think maybe I like I'll start maybe a little bit further back um so I used to do a lot of things with you know back when I was a

6:16data scientist I'd like grab things from somewhere like Terra dat and I'd never think about like a table being a file format I would never think about like what does this mean underneath it's just kind of like select star from table and then it works and apparently that's not how it works there are many things underneath that kind of make things

6:36faster obviously a table is a collection of data and this is really s like this blew my mind when I think I I heard about it for the first time and it was like yeah these are kind ofal files we put the files together and then the files become a table and with the file format that you use this starts um

6:55coming together um I can see someone saying csps Alex thank you yeah like it's a file format and csvs you know that's a really interesting point actually because things like csvs are great for humans they're terrible for machines you know you could if something you can open and you can edit and you can modify like that's great for

7:15humans who want to interact with that data in that particular way but there are things that we need from you know when a when a machine when a computer is reading our data we need things that are going to make that a lot easier and so this is where par had this idea of like a footer file so it would store lots of

7:32interesting metadata there so things like what are the columns what types are they are they strings how big are those strings uh and you get other information as well so if you've got things like numerics in there you'd have mins and Maxes in there so that would that would help with you know some things and um it's just kind of been an

7:53evolution for that and again you know one table might be multiple files underneath and the way that we construct those files the way that we query those files that's the thing that can really give you the speed up and that's why par is such a great starting point to then be able to build on it but there are

8:13some challenges with using kind of vanilla paret uh one of the one of the big things is around kind of schema enforcement or like data quality enforcement so you could say this column is a number it's an integer and you could put a string in it and um it just like wouldn't enforce that whatsoever and so you'd have some really nasty

8:36surprises when you've built a particular pipeline you know you've looked at the table the table says that this is an integer and then you've built your pipeline and lo and behold it's a string and everything breaks and it's annoying to fix and it's like why is this a problem it shouldn't be a problem yeah by the way like just to

8:55highlight this kind of problem um because I had that multiple time especially when you have like SQL inine of schema on read so they read the schema when they scan all the parket file and if you have one parket file for you know one table that has a different schema which is not compatible then the query is going to get an error and this

9:18is tricky to solve because you need to find a parket file can contain multiple rows so you need to find which Row in which parket file sometimes you get the good error uh so like this is like a really classic problem when you don't have safety before you have other way to do safety but like a common way to to

9:38it's a really tedious manual process to fix yeah it is and like I guess the Lazy Girl way to get around it is just to kind of infer your schema every single time but that means you have to go to your entire table whatever engine you're using has to go through every single record to be like are you a number are

9:56you a number are you a number and lo and behold it's like it's really slow so rubbish um rubbish performance rubbish data quality sadness all around um I guess the next thing about paret is that it was really tricky to update data and this idea that when you were making a change to a table um because you've

10:16got you know a collection of files there if you want to make changes to those files that table was ineffectively kind of offline so you then ended up with this horrible situation where if you wanted to you know make some big updates to a table that would be very tricky if you wanted to stream to a table but then

10:35also do something that was like a little bit more complex in terms of like making changes to it within the stream but then also reading from it as well for the next thing down that would also be really tricky and I remember working with one customer actually so this was before they'd moved to Delta they had like every click from their website from

10:56the last five years in this one monster table and you know for two hours a day they'd have to take it offline and then gdpr came in and they were like we don't even know how we can delete data from like this this is this is a nightmare we cannot do this and so you ended up with

11:12these tables with downtime which is not what you want in the slightest yeah now that's uh that's big

11:21uh big point so to recap uh schema enforcement which basically uh inherent data quality um a a way easy to do other

11:32operation like uh Delete for example and I really like how you uh frame that with uh the CSV Journey right because we can like like for me like but it's just a data engineer talking like I I have nothing against my CSV friends um but if I would I would just remove them from the equation because I've I had so many

11:56nightmares but it's good because it's kind of of evolution and it's still going to be there right we can put also make Microsoft Excel over there um and intend there is the question what what is uh what is next after basically uh par so could you maybe uh uh start to

12:15talk uh we we talked about the pain points right now on on on par I like the the big features yeah sorry I have just seen Stephanie's question of like could you share a bit more on how to optimize Delta reads so I'm going to talk about one of the optim ation things that kind of sucked in paret um and that was you

12:32had no control or very little control over the sizes of the files that you were writing so could you could opt for partitions and see these these were hard partitions there'd be like boundaries that you'd have to have in there normally people would have dates like the the day of the year and that's great until Black Friday comes along and

12:54you've got like one monster file but then you've also got I don't know I don't know a day that it was super sunny and like no one went shopping and so you've got the really tiny little files as well and so that was an absolute Nightmare and then we also had things like streaming so if it was real time you just make lots of

13:08little tiny ones and so that was also completely the wrong answer and so yeah so in terms of what Delta does it's aimed to solving a lot of these ways a lot of these problems but it's aiming to solve them in a way just kind of like the smart way of doing it rather than like the long way of doing it and

13:27there's a couple of um there's a couple of things that happen to to make that go ahead and happen so first of all technically technically Delta is not a file type now if you refer to it is a Delta table or called it a table type like that's fine no one's going to call you up on it but if you're an absolute

13:48pedant technically technically is a protocol because what we have we still have those paret files which are great but what we're going to do is we're going to add a tiny bit of metadata with

14:01our with our parket files and we're going to store them next to them wherever we're storing our paret files it's going to be in Json and it's going to tell us a lot about our files what files belong to what table what are the things that we need to change we've got some metadata in there like minimums and

14:18maximums as well some extra metadata um and all sorts of other interesting information so for example um this allows us to to do things like add another paret file to add that to our table instead of having to like rewrite the entire thing out all over again we can just do it in one step and we have this concept of kind of steps

14:41now um technically they are called transactions and so these when we make a change um that's going to happen as a transaction it's going to be anatomic commit so this means either completely succeeds or it completely fails we're not going to end up with any kind of like halfway points where we're you know um you know half putting things in there

15:01and then it fails you got load of crop data like no more of that we're not doing that anymore um and then yeah just to to um if we have like database people that are not familiar with this is actually a good analogy because we with table format like Delta L we are bringing basically database uh features like transaction as you

15:24mentioned right that we didn't have when we use plane you know classically object storage with a file format right uh so Alex says um is it still valid P within the extra metadata you know what I think I'm just going to go ahead and share my screen uh Med if you can do go ahead yeah of course I'll do I'll do the I'll do the

15:49code one yeah so um here's my vs code um there we go

15:58um and so I have this table uh called tweet sentiment uh with medals basically it's a bunch of like Twitter Olympic data I used it from a live stream yesterday um what it contains doesn't really matter but I have this this data it's the ugliest data I could find um and so that's what I went ahead grabbed

16:15now this is stored locally um you can connect to S3 we'll talk about that in a bit but I decided for this particular you know event today that we would do something just like a local table and what we can see here uh is that I have a bunch of parquet files that are linked to this particular

16:35table and then also I have this other little metadata directory here it's just a file that's called underscore Delta log and this is where the interesting stuff starts to happen so if I go ahead and look at this Json file or I hope I haven't made any changes um we can see um we've got some interesting information here so we can

16:59see the Tim stamp of when I made it this is my user ID um we can see the operation that I did we can see that I did it as an append um I can see kind of partition values and all of these other things that are happening here and all of this like beautiful metadata oops that makes it so much easier um for

17:18Delta to be able to make smart kind of decisions about this now they should in here somewhere link to a bunch of files sorry this is not very clear I think you can format the it won't hurt the the file if you just command p and format the Cheon normally you know in vs codes you can format the Cheon oh sorry

17:42command command P command shift p uh the

17:47magic pallet command shift B yeah come on shift p is yeah you're there yeah you're there and then you this format uh yeah format document oh sorry I'm a bit of a princess I'm new to like data bricks notebooks and so like coming out of my comfort zone to so you know what I'm super happy because at least you got out

18:14something from from this live stream yeah um all right so um so here's some interesting things so we've also got in here so there we go we can see here that what I did is I appended some data and it's saying here is the file that I added and if I look down here for 000078 A2 I can see that this is the paret file

18:38here and so it's not I don't want to say it's like it's not magic like going being able to read through these it's really useful you'll have the add files you'll have the remove files and so what's happening is like when I'm doing select star from Twitter data what it's doing is it's looking for this Delta log

18:57it's seeing the most recent you know Json file that I have um and then it is going and just seeing like which are the files that are relevant which are the ones that I should be querying so hopefully this hasn't completely Bamboozled people um if you have got questions I really do want to make sure that this is understandable um

19:17so if you have questions I'd love to answer them um yeah as they're as they're coming in um but yeah this is someone yeah so just coming back sorry to the Alex uh Alex question is it uh still a valid Park file with the extra metata yes it is so you can play around and you could query this uh specific uh

19:38parket file like you would any standard parket file right yes so you you could do that um there is nothing like physically stopping you from doing that what I would say is the way that we handle deletes is quite interesting so for like we've got this new feature it's called deletion vectors um it's just definitely something you can use um with

20:00DB makes things like reads faster but ultimately um when we delete data the Delta log basically says pretend to delete this data just like just don't read from it anymore it doesn't actually go to the paret file and like delete the parket file it just says don't read this bit anymore so if you are using a particular version of a table and let's

20:20say my latest thing that I did was like delete from this table where I don't know there's offensive Twitter data in there which not everything surprisingly um maybe I deleted some tweets that I I didn't want to see in there if I go ahead and like directly query that paret file it's still going to have that in

20:39there um but if I query via the Delta log as like treat it as a Delta table then it is going to have um that kind of latest view of the data and so it is going to be kind of like the up toate version that you should be looking at yeah so uh that the long story short is

20:57that you can is it it is a parket no that's great it's it is a standard parket file but it doesn't make much sense to query this one if you are using Delta Lake because you you are uh basically um you need to see this view where you have performed delete and so on which might still be in those in

21:17those file we have a couple of question actually uh do you want to tackle them yeah these are great questions how do we yeah how do you know what G unlock to view what if you have like 50,000 Delta logs so you only get like one Delta log form like one Delta log file or to subdirectory per table so

21:39it's always going to it's like a onetoone mapping it's always going to look in there and it's always going to pick the latest version so you don't have to like it's really simple to use you literally do like select star from Twitter data and it's always going to bring you the most upto-date version if you have 50,000 in

21:58there that's fine it's always going to bring you 50,000 I would say when you get to 50,000 um that is quite a lot um there are um some things that you can do to kind of compact those files to get rid of like the old stale versions like do you really need that version from like two and a half months ago probably

22:18not um so there are extra commands that you can do to kind of Tidy that up so you know having couple of hundred thousand maybe 10,000 and kind of Jason files is fine but depending on the engine it might struggle a little bit it might CH so um there is some maintenance that you can do um with your Delta files

22:38to keep it to the latest versions that you normally care about um we normally say keep about seven days worth of you know versions that tends to do well for most people I think that's the I think that's the default and oh gosh you want to testing me now is is definitely a feature I think that is coming out soon

22:59um is the ability to just set a default retention period and say you know what I only want the last seven days worth of Json files get get rid of everything else and then I never have to think about it again so Emil I hope that answers your question so in theory and

23:15it no you don't you don't need to to know the framework will will uh will

23:22just do it for you but that's also bring a great question like when you start to scale in how many version you need to have and actually it's a really good point because I've seen so many both in database and uh you know in Del L uh people uh hover versioning the their data like literally dumping every day uh

23:42their backup data same for the de U for a data Le with like you know a real art copy so it's not metadata and the thing is that with Delta leag if you do versioning you're not uh you you're just going to be versioning you know smartly the difference rather than a full uh full copy right so that's much more

24:03efficient in term if there sories that correct yeah absolutely uh I also see Daniel's question of can I do insert into parket table uh if you are using Delta I would strongly advise that you do not mess around with the actual parket files underneath just query the the Delta table as is um so you can do I

24:27really wouldn't recommend it and the reason for that is um all those statistics that we had on the screen of the Json of you know the mins the maximums when was this last updated who made the update all of that is going to become completely invalid if you're going and kind of manually doing things to those parket files now can you go

24:46ahead and do um insert into um like when

24:50someone inserts into like the Delta file you know that's that's fine you know that the log will handle all of those bits and pieces so um and and just to coming back to the question of Daniel uh the right feature is not yet support with ddb because he ask inside dub right H soon soonish um but so um so basically

25:14you need exactly so Daniel just okay I got it I do insert into Delta table yes it is you always play basically with the Delta level view rather than going on Park here we basically it's fun the interesting question we have uh because we you know you break it down with this code that is Park in logs uh but don't

25:35mess around with those files they are there just for you know learning purpose and understand how the framework is built uh in general so maybe we should talk through like how I would expect someone to use this um so uh the way

25:51that I kind of like thought about someone you know would really benefit from using this so first of all um Delta has a huge ecosystem system and I kind of had in mind you know you're working with this other team and they're like great news your data is ready um here it is um here it is in delta in an S3

26:08bucket here's the location off you Pop um there is a connection that you can do uh with S3 uh so uh May if I could grab my screen um so I couldn't think of a way to set up secrets and like not accidentally live stream the stuff that you know this is H this is how you do it you know

26:32it you wait you set up your secrets and then you share it yeah so like I I like I'm such a clut with this stuff like I knew I would just completely mess up it's it's really hard on live stream it's like everybody if you're having leak secrets on live stream you're not you know a real live streamer so welcome

26:50to the club I thought it was a highrisk strategy so I didn't want to do it anyway so what you'd have is you'd have your um your oh I spell S3 wrong brilliant You' have your secret um that you've created um with S3 and then you literally do select star from and use Delta scan as the keyword and then S3

27:10the bucket name and if you're working with like Unity catalog and data bricks it might look something like this but otherwise it'll just look like some generic kind of like file path and you can just go ahead and query that and you know use it just like you would anything else um in dub so kind of very simple to

27:28kind of go ahead and get started so that's like one way that I thought people might be working together um another thing is it might not just be like here's here's your data offy pops

27:41you whenever if you have data that's been constantly updated and you want to be able to get always the latest version of this particular data so maybe this table is being updated every hour um because it's Delta there's going to be no down time um even if it's being streamed too so you like microbatch updates that are happening kind of like

28:03every half second um you're still going to get the latest version so you can get like really real time data um with something that's happening here there's also no issues in terms of multiple people reading from the same um from the same table either so you know you might have kind of like different components writing to it you know you can be

28:26reading from it other people can be reading reading from it in multiple ways and because we have this idea of like the Delta log with the different Delta versions that means that you're always going to get the latest version you're not going to bump into each other you're not going to have some of those you know more traditional kind of like database

28:44problems that you might have had in the past kind of Legacy systems um so it's a really nice way to be able to get that latest kind of upto-date data um you are working on so that's that's kind of what I thought of in my mind when you know who would using it like this and I think this is a really

29:00neat way of doing it yeah no that's uh great we have a question um by the way which is um I'm not sure if you uh you

29:12actually know if it does PowerAll reads at the moment what with Partition involved I'm pretty sure it's uh I see multiread yes mult T parall read and I think I don't see why if there is partition involved it wouldn't do it um to be to be confirmed so I'm going to

29:33maybe say something a bit controversial here but um in terms of partitioning Delta file so the sorry can we go back to my screen yeah yeah okay so these

29:46part these files here that I've got this is just me like faffing about doing some stuff it's not necessarily like me choosing how I want to partition my files because I know it's quite small um but example that I gave to you earlier of like maybe I have really big data and it is you know partitioned by day we can

30:05we can Define that like f you know um

30:10data from day one should never be in the same file as um data from day two you might have multiple files for day one and only one for day two um but you can set those hard limits on like where these partitions should be now if those

30:27partitions like if you've got more than a gigabyte of data to put within one partition you know go ahead and do it but the way Delta is evolving is moving away from partitioning and having a different way to kind of optimize um that data so is it multi-threaded yes you're going to get distribution across um your partitions yes um but it's not

30:47necessarily the best way to optimize your data layout that's a story for another time uh but I hope that answers your question yeah that's uh the that's great uh uh and just looking at the time I think uh what I wanted to let me see sorry there was one one

31:08other question which is from shy LinkedIn user who did not share their name can we expect DB integration with unity catalog anytime soon um so I

31:20appreciate this is a Delta stream um but yes there is a integration that is being worked on at the moment we had our big conference a couple of months ago uh that was demoed very early stages very risky demo but it worked so yeah I can't complain um and there is also um a

31:39couple of features that are coming out from Delta as well that are going to um make that uh a lot easier as well so maybe that's for later when we've got like more more time because that's a very nuanced conversation about a new feature that's coming up um but yes let's let's go back to our agenda before

31:59I go into of course um um so yeah I want

32:04to uh to talk maybe a bit um about the

32:09different uh you know Delta projects uh so you have the let me share my screen

32:16we have the Delta kernel we have the Delta Delta Le and so this is like a bit confusing for people when they look for open source you know uh get up Delta L is like why is there you know so many uh repository that seems to do uh the the same thing uh so let me just share my

32:38screen um and not my uh not my email uh

32:42probably so wait a minute um do you want to share the sorry do you want to share my other screen and I can talk through like why yeah of course yeah wait a minute I'm going to

32:57do that right now now you have another screen that you're sharing right now uh maybe no you just have to all right let's let's go for screen number two there we go yeah so um Delta lake is an open source project uh we have enormous amount of contributors uh the amount of downloads is getting pretty staggering pretty quickly and we have this big

33:19ecosystem this is great you know we we love open source we love getting these Integrations uh we know that a lot of people can benefit from this technology um and originally we just kind of built one connector per provider so it's like oh do you want a Delta integration with Flink cool build an integration um do you want one with beam cool let's build

33:41a second one um and we've had maybe like eight new features come out in the past couple of years seven more are planned and it turns out that if you have this many Integrations um you know it's kind of like thanks for building the connector by the way we've bought out something else it's like oh okay we'll put it on

33:58the backlog you know when when can we get oh we've just you know announced something else new announc something else new and it's a very quick way to not ingratiate yourself with an open- Source community so um and of course like with a lot of Technologies we also have that divide between uh Team Java and then team rust um so for some reason rust and

34:22python is split out in this diagram but think of them kind of similar and then there's like the C C++ implement mentation which is where depb sits in and so we've also got these kind of two different Integrations as well and so the idea is to kind of really simplify this instead of having like 50 odd connectors um that need to be upgraded

34:40every single time there's a new feature why don't we try and build some kind of interpretation um that is going to make this integration a lot easier and this is called Delta kernel a lot of very smart people have been working on like how do we make this um a lot easier how do we move away from this idea that like

34:59every single project needs its own um upgrade yeah so in terms of like why are there so many projects it's like well we've got the original one um and then we have the um we've got the rust kernel

35:12integration and then we also have the uh the Java the jbm kernel integration as well um and I think one of the questions I got beforehand about um this actually I think if I go to here you will see that the Delta extension is currently experimental um that's because it runs using this new kernel um for Delta

35:34that's why it's currently experimental at the moment now what does experimental mean this means don't put your like critical production workloads on here like the API does have a few little kind of bugs and quirks that we are working through so you know use it for everyday analytics please don't put this at the center of like A1 million pipeline that

35:56you need otherwise you're get to get fined by the regulator if you don't get it right so um I think you know that's that's generally where I kind of draw the line when it comes to experimental but it is very close um to getting a lot of that that squashed and getting this out of um experimental it is quite new um I think

36:13it was it was announced a couple of months ago yeah so you know still still

36:19early days you know it's not I want to be very clear that you know this is not buggy you know there are a few kind of little quirks here and there but it's not like on Monday it worked and then on Tuesday it failed like it's not going to be like that and there's a lot of um in

36:33terms of the functionality that committed like that works very well but it's it is great so the the point is to open you know where you can actually use Delta and you know in increase the velocity of the features rather than developing here and there uh multiple times the same thing and uh Delta RS was a project was that started kind of the

36:56same way right it the show it was oh we need you know um to be able to easily do use Python with Delta lake or rest um and so this project was started as a community and now they're using also I believe uh the Delta Kel uh for for for

37:15Delta RS so which is also docal library and python that where you can uh read Delta lake table right is that correct yeah yeah it is and I I do want to point out that it's also a real kind of community effort to build this kind Kel this isn't you know the Delta crew being like we're going to build a kernel and

37:32you're going to like it that's not how it works at all it's very much working with the community to be like would this kind of thing work like would this kind of thing work so things like with oh I don't want to go too much I don't want to bore people too much but things like uh with the rust integration like it's

37:47using Arrow as a way to transfer data you know how do people feel about this oh it turns out people like arrows arrow is going to go in there um so yeah this isn't of like some project dictating what the rest of the ecosystem is going to put up with is very much a kind of collaborative thing for the community

38:03yeah I want to bring that topic I mean Steph uh bring so by the way Steph is a cooworker of mine so she's bringing she actually worked on the Delta Le integration with modern deck um so but so she asked could you share uh is there also if there will be at some point Iceberg Kel arrest and

38:26I think uh bigger maybe question is like given uh the acquisition of tabar which was responsible mostly of of Iceberg which is another table format by data brakes um where do you see you don't need to you know give data braks opinion but like just on a high level where do you see heading through in term of you

38:46know the data industry with those different table format today and like should I pick Iceberg right be Delta Lake and now it's everything is under data brakes uh I would say hat so what's what's your thought on this so a couple of things so in terms of the challenges that people had with paret like that was fairly Universal and then we ended up

39:06with kind of three major teams of people trying to solve the same problems and doing it in a slightly different way so you've got Delta you've got Iceberg and you've got hoodie and if you close one eye and squint with the other they kind of look the same don't tell them I said

39:23that but like I'm imagine if I told my mom about these things she'd be like Holly what are you on about just shut up um you know we all agree that Park is a great starting point and we all agree that like metadata is a way to kind of solve this but it's kind of a different school of thought of like where should

39:39this metadata go so if you look at Delta they're like hey we should keep it next to all the other files and like that makes sense um if you are you know with Iceberg they're like oh but actually we should keep the metadata we should put that with a catalog and I've completely blanked on where hoody put their

39:54metadata but they do it in a complet different way as well um and it seems kind of silly that if you know 95% of the things that you're writing out technically paret files and like it's a sliver of metadata on top why not just like write out three sets of metadata and have them kind of work together and

40:12so this is where Delta uniform comes in and so not only do you get the Delta log um but you can also get it into the catalog as well um we have this idea there's a feature that's coming out soon um called coordinated permits um which means that not only can you read from these different tables because if you

40:32think about it you know if I write um I make a table I write out my Delta log and then I write out something in the catalog reading from it is fine because I can just like refer to the data wherever it is but if I want to write to it it's like oh which you know if I

40:49write to the catalog but then I need to like read from the lot so today it's not the greatest story you know there's a lot of reading from but not a lot in the way of writing and the idea is that if we can get these kind of three metadata files to commit all at the same time

41:05doesn't really matter which one that you go for um candidly you know yeah um and

41:11I think we're seeing that more as the communities kind of move closer together and we are seeing this as kind of like team this is all a headache that we all have can we like figure out a way to solve this like does it matter that l tax says hoodie rather than Iceberg like whatever um the the challenge does come

41:33in terms of feature interoperability so for example the way that deletes are handled you know that's different between the different projects um there's a couple of other things that are like implemented in one but not implemented in the other so you know there is a lot of you know smart Minds that have to work together to solve that

41:50kind of problem and that's where the tabular acquisition comes in you know this is this is not taking away one project this is not kind of like burning one to the ground absolutely not this is the idea of like how can we take great features from the various projects and make them work together so um you know this idea of like well

42:09which one should I pick hopefully it shouldn't matter in the future yeah no but that's that's great to hear because I think it's a you know the the data stack is already really complex these days with the explosion of stools and I think uh we we've been asking for conization at different levels and I think this one is is a great one because

42:31like the the way that you store your data is kind of like critical choice right you don't want to do multiple migration uh some people argue that we you know as data engineer the only work we ever done is migration U so uh hopefully that will change um in the Futures we already um have 15 minutes

42:53left uh do you want to jump on uh on the demo I'm happy I'll see to show stuff if I I'll fix quickly my my share screen but you can you can go ahead if you want first right so uh here's one I set up

43:10earlier um so I'm just going to inst start start by um installing Delta uh and then loading Delta and my file is um

43:21kind of in this mess um so I can do uh select star from um then the keyword to use is Delta scan and in here I'm going

43:32to put the location which I have copied and I hope this works if it doesn't I'm GNA look very silly I I can I can go then and then if I F it's it's fine hey I'm a everyone else just a just a comment so this is uh this is the local table right yes this is an one yeah this is a local table you

43:55don't need actually to install and load um because uh dug DB have special ooks

44:02that would install and load the extension for you and there is a hook on Delta scan no so that you know so uh it's true that for certain extension in dug DB like GI it's always good sometimes to um to install and load because there is an hook for all the function within an extension but for basic uh basic scan that is yeah so here

44:26there is uh you're doing a describe right to yeah so just to find out I've got loads of things in here um I tried to make some really kind of ugly data and uh you know I've got things like hashtags which is um a uh an array that

44:42I made here's one I made earlier um you know and I can go ahead you know select the hashtags that I've got from my data and again you know I've got some some interesting things that are here I think it's explain you also get some interesting things that come out of explain as well so I mean yeah aside from this this just like

45:05boring old squel but um I feel like we spent a lot of time talking about file format only for me to be like anyway and then you can do select star from the no but it is like it is a lot of work to to do and there is a couple features I pointed uh to the documentation so if

45:23you go to Doug DB documentation you have a Delta page with the features that supported and so at the moment is read only but it is right uh coming soon what kind of application would you say like at the at this point of time with uh the CLI people people could use it within their workflow yeah so definitely you know in

45:43terms of um the reading I have been very impressed I was playing around with this yesterday I've been very impressed with how fast this is for local development and also just to get set up um to be able to you know very quickly get started with a Delta file I think that's probably the fastest way to do it um

46:03there's also some other things um so when it comes to I mean you can probably call me on this um when it comes to more complicated joins again you should never feel any kind of stickiness or any slowness because you should get um a lot of the you know speed UPS I appreciate the data might not be as kind of chunky as you know you

46:26know something like what you do with data breaks where you're doing kind of PAB bites of data um but yeah you should really notice um the the kind of slickness of this I would also say um I've seen a few anecdotal things about duy beam being like anyway so we're faster than the people who originally created that technology using

46:48this thing so yeah I've been really impressed with how fast it is but I guess if you're used to working with um you know things that are local like maybe speed isn't a consideration from you but yet really shouldn't chug this isn't a case of like setting something to run and then coming back after a coffee like this should be super

47:06super um and then also yeah just just a word on that uh so I get that because I'm mostly using dgdb but still using here and there b query or spark sometimes and I get this thing where I feel like those technology is getting slower which is not true but it is because it's kind of like the same

47:25effect like where you get a new laptop or a new smartphone and you say it's super fast and then two years later we say this is so slow and it yeah maybe you had a lot of things going on but it's also just your brain which is getting used to the new speed and it's just imposs really hard to you know kind

47:44find this this new reference but yeah I

47:48I got to say like that phenomenon I notic it when it comes to headphones and you're like my headphones are fine and you buy new headphones or you get new headphones from someone you're like oh it's Bliss music like it's beautiful and

48:01then after one month you you forget about it you have a we had a comment uh that uh someone said like yeah uh it's loading super fast indeed um yeah so this is uh gosh how big is this I think this is around 10,000 records um you've got some really gnarly types in here so let the array type that's not uh you

48:23know that's like they're all kind of like fast types but you know there are some things in here like the tweets the data itself um the text like this is a really long string it's full of emojis like you can't like ask um ask you encode this like this is super long um again we have those those hashtags uh

48:44which are quite you know some of them are quite chunky because people go nuts with their Twitter hashtags and again it's like a variable length as well so yeah it is um sorry keep asking me questions I was going to write out a joint and just show you literally I tried to think of like what was the most hideous

49:02query I could write about this data you can uh so Alexandra how you doing Alexander I think we met uh yes we met in dgb in Paris uh meet up is asking uh

49:15would it be better to have uh an uniform connector on Doug DB instead of a Delta Lake Iceberg plus foodie uh one and I think uh like my thoughts first uh that I'm curious to hear your all is that this is everything is still early like this acquisition and this merge in ter of like not only in company in term of

49:36Standards it's still early so I think um from the DP point of view it makes much sense to focus on different extension at at this point of time um do you want to add something on that uh so I I guess sorry the in terms

49:54of like a a uniform uh connect her so it's it's more about kind of like how do you want to read your metadata I'm not sure I don't think we'll get to a point anytime soon where it's like and use any metadata I don't know maybe that's a feuture request that should go in but um you know this kind of omnivore way of in

50:17like choosing the metadata I've not seen that in any road maps yet um so I think just sticking to like a Delta read or an iceberg read or a read or or whatever I think that's going to be enough from an Eng's perspective now if you're an architect and you're looking after 10,000 data Engineers great feature request and I'm definitely gonna take

50:38that back like that's that's going to be really beneficial but I think as an end user um it's just knowing that some of that stuff is there yeah um some someone is asking also uh if I start a VM with a lot of memory and low Delta table in Doug DB can I continue with that as in memory

50:55Doug DB database without triggering a new scan so the best practice would be you can load it as a DB table uh you create you load it from your your Delta if you need cash um because if you need to trigger easily uh um WR reading from

51:12an object storage that would that would trigger another uh uh another scan um

51:19there is another beginner question can I ask uh where do I write my table object is the data the table itself or another layer like uh I've met a store I let you answer that so uh I'm assuming that when you're talking about write so if you are reading in Delta Unfortunately today you cannot write out in Delta so you would

51:43be writing out in kind of the uh available choice for kind of format um if you are looking at this as part of a bigger picture in terms of like where does someone else need to write out their Delta for you to be able to read it um you've got two main choices uh so you either have

52:01um I guess the main one is kind of S3 is probably the best option so Amazon S3 um if you are not in a um Amazon or AWS environment that does make it a little bit trickier um so that would be a way that you could grab that file from your local machine um I think there's a

52:20couple of creative ways that you can do that um but yeah you're probably looking at S3 if you're like working with a WI team yeah um I want to show something uh

52:31so let me just uh sorry maybe I should better answer that question if you have a hive meta store you can like point a table to a particular area in S3 um so that is called a manage table

52:47no it's not what's the opposite of manag not unmanaged self for it no what is it called external external sense yeah opposite of manage obviously external um yes so if you've got an external table so if someone's using a hi meta store you can point to a specific location in an S3 get them to write it there and then

53:09you'll be able to to do it all right Med's going to show me how to do a proper demo now no that's not true that's not true uh U and the topic is on a half help you know you're from data briak Delta so I think uh you deserve all as much credit that I have um no I wanted to show so we already

53:29show for local and I wanted to show uh if you use mod duck you can so we we talked about local workflow right you have a local database but so of course you can uh so if I have a Doug DB CLI and by the way this is really important we show the dug DB CLI as inry points

53:47but a dug DB has multiple client so you do can do exactly the same in Python in Java if that's your Vibe or so this is not limited to the the CLI right uh it's available for in the different extension um so if you create a a secret and then

54:06you can read for example this is uh reading uh a pretty large I mean more than 1 million table um from a tree um and so this is doing a Delta scan and I'm just doing a limit so uh let me do a timer just going to add the timer so uh this is reading the directly from a

54:27stre um and we'll see okay about four four seconds uh now I can basically do uh the same query uh but from Mod duck and the good thing about that is that you don't have your network traffic is happening in the cloud and you're going to just Reeve you locally on the client you know what you need so it's kind of

54:48like the the same setup when you are a notebook um most of the comput is happening on the server so I'm just opening again another uh ddb I'm going to uh connect to mod deck um I actually

55:02don't need I can create the secrets but I have already uh Secret store on M de site um just to show you this is

55:13um there it is it's there sorry uh this is how you create a Secrets it's going to basically take your your uh your secrets based on your local credential uh from ads if you sign in with the W SSO and store it uh securly on on mod deck um and then you can you can do basically the the the exact same query

55:36and I'm going to put uh a timer on again

55:41um and and you see so this the that was much faster uh this is just a basic limit five uh I'm not going to go because we're running a bit of time but I had like a a more complicated query over this data set which is around um you know uh about 2 million uh rows so

56:05two gigabytes uh uh size and so if you get like more complex query where you do aggregate uh some parsing in string as you see here it's data about BPI uh statistic um actually if you do that locally so on my left side um that would get much uh again much slower and if you go on the right it's using directly mod

56:30Ser uh Ser compute and return you the result so yeah um just wanted to highlight the workflow that of course you can still query from uh local I think Delta table like as you mentioned all it's working great um when you start to uh query on S3 I think you need just to be more careful about your network

56:51bent with how good your internet connection is um about the filters if if you uh have parti uh that you can push down the filters or you can use uh also mother de so comput yeah I think also something that's probably really Ben or worth considering for people like myself and medy who are based in Amia is that

57:12if your data is in S3 and it's halfway across the planet like connecting locally is not like you're just going to have latency like regardless of the tool that you're using yeah yeah that's and that's very true that's uh that's that's fair that uh those baguett are actually in us um and infrastructure for the moment of mother luck are uh based in us

57:33uh looking back again uh to the chat if we have uh other question I'm basically wondering where should I create my schema object for olab workloads um so this is I think uh um

57:50emo is kind of like looking where with the database analogy Where do I create my you know my schema uh first can you

58:01do you see what I mean uh yes so the way that we've been

58:07showing using Delta at the moment there is no um kind of schema that you are using with it um so so you you don't have to have sorry when you say schema I don't know whether you mean like database schema or table SCH yeah I I think I think it's that I think it's it's a kind of like um you

58:28know the proc yeah the process where I think where is confusing here it's simply that yeah you define a schema with the data itself versus a database you're going to prepare often the the schema and then you load the data yeah so there's no need to necessarily do that um with the implementation that we've shown here um

58:48you don't have to set that up now Technically when I set this up myself and I I read from um a Unity catalog table in an S3 it's again it's that thing of like it's just all files it's files all the way down and you know there's a file location or a folder which is my catalog name and then some

59:09other bits and pieces and then there's a database name and then another you know we keep going down the file locations down the file path until we get to a table so if you know that you have to receive you know read from table Temp and you know it's in you know default or whatever um but you know which um schema

59:28or which database it's in that will be in the file path and they' normally got like the the name of the schema. DB yeah as part of that file path so if it's like if you're wondering like where it is that you're reading these from to be able to like find where this database is or where this catalog is spoilers it's

59:47normally in the file name the file path I should say yeah cool um we're going to close in uh because the time already passed we have have a couple of question but uh I'm happy to follow up uh where can we find you uh Holly easily it's uh it's LinkedIn can I share your your LinkedIn or where are you absolutely I

60:07think I was tagged in this event so uh by all means um if you uh like to hear about Delta things if you like to hear about other things that are going on in the open source space but Al also maybe a bit of data brick stuff um I do Post kind of like very quick videos um about

60:22what is available and what is coming up and what is new and kind of other optimization tools uh or techniques that you might want to use so yeah um if you just search for like data bricks Holly um I come up like on various platforms so I go for that yeah cool I just share for the YouTube people um your uh people

60:41following us on YouTube uh your LinkedIn um and also you have the mod duck slack if you ask a question around mod duck and duck DB please don't hesitate to come around uh hly that was a pleasure time flight uh was a pretty uh pretty we

60:57had a lot of question um thank you very much uh for joining us again uh hopefully it's not going to be the last one we do with uh data breaks I could bring uh the others of your people or you can come back anytime yeah I would love to come back when I know that integration um catalog integration is a stronger I'd

61:18love to come back and talk about that let's do that let's do that when we have the unity catalog and maybe the right uh on Delta L I think they were going to be

61:28image that's full Secrets because product people are GNA hammer on you um anyway this was quacken code we are here uh every other week uh and uh

61:42the next time we're going to talk about geospatial data with Doug DB it's one of the biggest use case of Doug DB itself um and we understand why with some pragmatic uh case I'm not going to spoil uh yet about the guest but uh it' be hopefully as awesome as I had today with hly so uh I wish you all uh a good uh

62:06morning afternoon or evening wherever you are and I'll

FAQS

How do you query Delta Lake tables with DuckDB?

To query Delta Lake tables in DuckDB, install and load the Delta extension, then use the delta_scan() function with the path to your Delta table. This works for both local paths and remote S3 locations. For S3 access, you first create a secret with your AWS credentials. The syntax is straightforward: SELECT * FROM delta_scan('s3://your-bucket/your-table'). Currently, DuckDB supports read-only access to Delta Lake tables, with write support planned for the future.

What is Delta Lake and how is it different from Parquet?

Delta Lake is a protocol (not a file format) that adds a metadata layer on top of standard Parquet files. It stores JSON transaction logs alongside your Parquet files that track which files belong to a table, schema enforcement rules, and change history. This gives you features that plain Parquet lacks: ACID transactions (writes either fully succeed or fail), schema enforcement to prevent data quality issues, efficient deletes and updates without table downtime, and time travel to query historical versions of your data.

How does MotherDuck improve Delta Lake query performance over local DuckDB?

When querying Delta Lake tables stored in S3, MotherDuck routes the data transfer through cloud infrastructure with 100 Gbps network speeds rather than your local internet connection. In a demo comparing the same aggregation query on a 2GB S3-hosted Delta table, MotherDuck completed the query in about 25 seconds versus longer when running locally through a European internet connection to a US-East-1 bucket. The improvement is especially noticeable for larger datasets and when there is geographic distance between you and your data. Learn more about open lakehouse architectures.

Are Delta Lake, Iceberg, and Hudi merging into one standard?

The table format communities are converging. All three formats use Parquet files as the base layer but differ in where they store metadata. Delta Lake's UniForm feature already writes out metadata in multiple formats so the same table can be read as Delta, Iceberg, or Hudi. The upcoming "coordinated commits" feature will allow writing to all three metadata formats at once. The Databricks acquisition of Tabular (the Iceberg company) aims to bring the best features from each project together rather than eliminating any one format.

Leveraging DuckDB and Delta Lake Together

TL;DR

The Pain Points of Using Plain Parquet as a Table

The Challenge of Schema Enforcement

Inefficient Updates and Deletes

The "Small File Problem"

Delta Lake: Adding Transactional Reliability to Parquet

Under the Hood: A Peek Inside the `_delta_log`

How to Query Delta Lake with DuckDB

Step 1: Install and Load the Extension

Step 2: Query a Local Delta Table

Step 3: Query a Delta Table in S3

Level Up: Accelerating S3 Queries with MotherDuck

The Bigger Picture: A Converging Ecosystem

Delta Lake vs. Iceberg vs. Hudi

Conclusion & Next Steps

FAQ

What is the main advantage of Delta Lake over Parquet?

Can I query Delta Lake with standard SQL?

Why should I use MotherDuck to query a Delta Lake table in S3?

Can DuckDB write to Delta Lake tables?

Transcript

FAQS

How do you query Delta Lake tables with DuckDB?

What is Delta Lake and how is it different from Parquet?

How does MotherDuck improve Delta Lake query performance over local DuckDB?

Are Delta Lake, Iceberg, and Hudi merging into one standard?

Related Videos

pg_duckdb: Ducking awesome analytics in Postgres

Can DuckDB revolutionize the data lake experience?

A new paradigm for data visualization with just SQL + Markdown

Leveraging DuckDB and Delta Lake Together

TL;DR

The Pain Points of Using Plain Parquet as a Table

The Challenge of Schema Enforcement

Inefficient Updates and Deletes

The "Small File Problem"

Delta Lake: Adding Transactional Reliability to Parquet

Under the Hood: A Peek Inside the _delta_log

How to Query Delta Lake with DuckDB

Step 1: Install and Load the Extension

Step 2: Query a Local Delta Table

Step 3: Query a Delta Table in S3

Level Up: Accelerating S3 Queries with MotherDuck

The Bigger Picture: A Converging Ecosystem

Delta Lake vs. Iceberg vs. Hudi

Conclusion & Next Steps

FAQ

What is the main advantage of Delta Lake over Parquet?

Can I query Delta Lake with standard SQL?

Why should I use MotherDuck to query a Delta Lake table in S3?

Can DuckDB write to Delta Lake tables?

Transcript

FAQS

How do you query Delta Lake tables with DuckDB?

What is Delta Lake and how is it different from Parquet?

How does MotherDuck improve Delta Lake query performance over local DuckDB?

Are Delta Lake, Iceberg, and Hudi merging into one standard?

Related Videos

pg_duckdb: Ducking awesome analytics in Postgres

Can DuckDB revolutionize the data lake experience?

A new paradigm for data visualization with just SQL + Markdown

Under the Hood: A Peek Inside the `_delta_log`