Can DuckDB revolutionize the data lake experience?

2024/11/22Featuring:

Mehdi explores DuckDB as a catalog for Data Lake and Lakehouse pattern. He'll define what we mean by "data catalog", gives clear examples on how they work and dive into a pragmatic use case with DuckDB & MotherDuck.

Transcript

0:00ta Lake and Lake housee are highly discussed Topic at the moment and this is because it's much easier and cost effective to have a Central Storage in an object storage and be free of which compute andine you want to use against it but many people forget an important part of the story the data catalog over the past few years it has become more

0:21important but what is a catalog anyway in this video we'll cover definitions and review some Basics around data Lake and Lake housee to understand why data catalog have gain a central component and finally we end up with some code around the dub use case as a portable catalog let's dive into it drawing inspiration from a great blog from

0:42jeremia Hanssen catalog can be break down into two main categories data governance cataloges these are informational and help for the centrally defined governance policies across different databases and searchable metadata and the other one is data object catalog this one is operational and it's used directly by data platforms and quer engine to read and write data and is also often referred as measur

1:10while informational cataloges can be used for operational purposes this definition clarify how they relate to the databases or query engines so an operational catalog is used directly by the engine to query data whereas an informational catalog is accessed by people for documentation and data said Discovery sometimes the distinction between the two categories can blur and features from one may appear in the

1:37other but that's roughly the two categories of catalog that you can find in this video we'll focus about database object catalog if this sounds too confusing don't worry we'll get into an example of how this works just a bit later so why our catalog essential for future data platforms well in the past data system combined storage computing

2:00and the catalog it was just a build-in feature for example if you were using Oracle for your analytics you couldn't switch a different compute the storage the compute and the catalog were all fixed together since the time of adub we've began to separate storage and Computing and the hive meta store was the first open catalog to emerge from

2:21this change and with strategies like data L and lak house we've lean into open file formats like par and Arro and more recently table formats like Delta Lake heberg and Hoodie so these new formats introduce features like AC properties and others including schema Evolution and deletes if you confused between the difference between data Lake versus Lal to put it simply a DAT L is a

2:49centralized storage solution that holds the row data in its original format would it be CSV parkage Eason and it's still leveraging the classic object storage like ws3 and a lake house builds on top of this and adding table formats like Delta Lake and Iceberg enabling AC transaction and schema management while still using plain old object storage so

3:13as we separate storage from Computing we need a common and open place to manage our table States and our data League let's go over a simple example to understand why having a catalog is so important so when using a parket data Lake managing the catalog was relatively straightforward since parket file are immutable meaning they cannot be changed

3:36you simply scan all the parket files needed to represent a table and giving the following files over an object storage the contents of my table would be the total of the data from the paret file file 1. par and file 2. Park if there were updates or deletion to the data new parket file would replace the old ones and all we have to do is to

4:00scan them again from the computer G the task is relatively simple just read all the parket files and for that you have two options through a catalog interaction interact with the catalog which organize all the data so that you don't need to worry about the file locations or you directly scan through the object storage so they you can scan

4:21the park file using the base path location and also most of this engine understand hi partitioning too so then stable formats like deltal Lake and iachi Iceberg came into the picture and unlike basic bark file they support operations like update and delete this format are also design to reduce the amounts of computing needed when accessing the store data we'll come into

4:45that so here is how they work these table formats are still based on Park file but they include additional metadata files so let's say we make an update or delete and instead of having to rewrite an entire park ofile The quarian Joint simply has a line to a medatative file usually in Jon format so here is what a Delta L folder might look

5:08like for instance you have the Delta log which contain the meta data as Jon files and then you have the actual data still in par but here is where it gets a bit complex compared to Vania par if you just scan the data from file one. par and file two. par after an update or a delete transaction you may not see in

5:29the table is current correct States this update or delete operation may have occurred and the information about this operation is store into the metadata Jon files without changing the actual parket files and because of this our query inine must use the catalog to understand the correct current states of the table catalog become essential when working with this Advanced table format another

5:55way would be to have the query andine decode this metadata file and represent the correct view however this patch the complexity back to the quar engine so man so let's talk about duck DB now so dougb has its own file format it's storage efficient and it supports ACD transaction it's one file that contains all tables data and metad data and so

6:19ASD DB can interact with many databases like postgress MySQL and file format par CSV Delta L Apache Iceberg would it be local or over an object storage at SSG as your blob storage Etc is therefore a great candidate for a portable catalog working with data especially when doing data wrangling orot analysis can be a messy Journey anyone working in data has

6:46probably experienced this at least once in their life so how to avoid this well you could actually share one file that contains all metadata information ready to be queried with Doug DB but without the actual data itself plus the file is super light so let's get our hands dirty and start to code I just open vs code

7:06I'm going to launch shell using the do DB CLI you can watch down for more information about how to set up the DLI if you want to follow along but basically all the comments I'm going to do here are available no matter which client of drgb you're going to use would it be python r or whatever it's your jam

7:24I'll also share a GI up gft with old the playground queries I'm using there if you want to copy paste and play around so I'll start here with an attach command which going to attach a public duck DB database so I'm just going to do that so I'll show you all the tables included so you see that in the ducky

7:43catalog database we have four tables and basically we can start querying those table and that's pretty much it there is full table so what is exceptional well I have actually the file locally downloaded and it's about the size of

8:01268 kiloby so what is happening over there well as you may guess in our introduction this du DB file includes all metadata but no data at all and we do this through the process of views so I can still query as it is regular table and I'm going to show you actually the definition of those views so I'm using a

8:23table function which contain metadata about the views let me actually change the the mode line pretty quickly so that we can see the full line and so you see here that we're basically creating the air quality view is based on a parket file on his tree we have the customers which is a folder containing multiple CSV on his3 we have a table from a

8:49postest database here which is actually coming from a neon hosted database and then we have finally a nice work table store on Google Cloud storage so this is really pretty extreme and mostly for the sake of demo but with just one simple small file I can attach a lot of data sets together now let's talk about how

9:12you manage access to those data set so there is no secret store in this file and that's the cute thing because we want to Ray on user authentification mechanism to make sure that this user has access to the data actually so in our example when I show you how to query customers this is actually from a public

9:33bucket but if you query a private bucket the only thing you need is to leverage a Doug DB Secrets manager and you can create a Secrets based on theb key or directly based on SSO mechanism so that was assume that you first authenticate using the AWS CLI so typically you do abs SSO login and then your profile name

9:56and you're going to log in it's going to generate your credential and then within dug DB you create a secrets that will fetch those credentials so you can do similar thing for a Google cloud storage and for poqu you can also use the secret manager but the only thing here is that we are just linking one table and the

10:16secrets manager works at the database level at this point of this video so a neat trick that you can also use for other things is to basically create a view with the postgress scan function and you see here that I'm using onvan variable this basically is going to fetch my local PG database PG and Etc so

10:38roughly the only thing you need to do within your current environment is to you know make those global environment available whether you do an export or if it's within a Ci or whatsoever you make them available over there and to create the view you see let me put it over here so that it's a bit more readable I'm

10:57simply piping a text you know fied to the get M function to use the environment variable know that you can also do that when you have you know development or production database and you don't want to hardcore within your query your database name that's also a neat trick to use the get function all right so now basically we have this

11:18ducky catalog database which is super light and how can we share actually a common catalog with other users well we did have P it on you know an object storage but how do you manage permission to this one concurrent right and this is where a single binary file has its limits and mod duck supercharge jdb in the cloud and there is this mod duck

11:41share features where you can create a database and give an access to a link to the authorized people so let me show you here how you can upload a local Doug DB instance to modu pretty quickly so the only thing you need is actually a modu token and for that you can sign up to modoc.com go in UI and generate your

12:04token so in 10 you make this Modoc token as an environment valuable available here so export muru token equal the value of your modu token another way to do this if you have a modu account is just to follow the authentification web flow sign up so if there isn't any modu token environment viable present the CLI will pop up a link you follow up that up

12:30you authenticate and it's going to populate a token for you so once you have the authentification set up the only thing you need to do is do another attach command with the attach MD column and because my mother token is already populated as an environment viable that's it now I'm connected to basically a modu if if I do show database again

12:53you see I have plenty of databases don't judge me it's a bit messy available in the cloud so if I run a queria against a dose Cloud database now I'm leveraging not only the storage of mod duck but also uh the compute of mod duck and the duck DB CLI is acting as a client so to coming back what I want to do here is

13:13that I have this ducky catalog here which is my local duck database and I want to create a cloud ducky catalog you see I've done it already so I'll put it just a thir time so how to do this again really simple I just do a create database Cloud ducky catalog and let me suffix it with three from the local duck

13:35DB database that I want to upload and that's it it's done and as you can see this was super fast because it's just again metadata so it's just uploading 268 kiloby and my cloud database is there so now you can explore the taable server there and you can query using mod the cloud which is pretty nice especially if you have tables Ed on this

13:58tree you leverage the network Cloud interaction between mod and for example ADB ss3 versus your local dougb your internet connection and the cloud it with is three let's say and let me just quickly show you the modu UI for a sec you see that we have the cloud ducky catalog here and I have and you can explore basically the different table

14:24schema and so this one is hosted on his tree this one is also hosted on a street that's the CSV this one is a table on postgress and this Iceberg table is actually hosted on Google Cloud Storage finally not only you can leverage a cloud compute but you can start easily to share database with other people so

14:45here I'm going to create a share based on the cloud ducky catalog tree because that's our database I'm going to put it public so anyone with the link you can also restrict it within our organization and update automatic meaning that if if I update this share with new tables for example the user that have attached this share basically can see the update

15:06directly so this is it I create the create share command and you see I have a share well which is backup and any other user basically just need to do dasch and a dis share to be able to start query this database so in short with this example we saw how to share one database that spawns data set across

15:27multiple Cloud providers even a PO database can also be a myql database we saw how easy it is to push it to the cloud and they're leveraging the cloud network band with and also manage access with Shar so that we manage updates safely so what's next well we saw that Doug DB's capabilities continue to grow including experimental support for other

15:50cataloges like Unity catalog and there is also an exciting G up discussion that explores ddb as a meta catalog concept where duck DB could host child catalog and there is also other discussion about potential features including materialized view a more flexible refresh mechanism for views similar to external tables in other systems so Can duct DB be the best open portable

16:16catalog while we seen it has serious potential as of today and for the rest we have an exciting future ahead full of possibilities in the meantime keep quacking and keep coding I'll see you in the next one n

16:36[Music]

FAQS

What is the difference between a data governance catalog and a data object catalog?

A data governance catalog is informational. It helps centrally define governance policies and provides searchable metadata for documentation and data discovery. A data object catalog is operational. It is used directly by data platforms and query engines to read and write data, often referred to as a metastore. While the distinction can blur, understanding this difference clarifies why catalogs have become important as data lakehouse architectures separate storage from compute.

Why are data catalogs essential for data lake and lakehouse architectures?

With basic Parquet data lakes, managing the catalog is straightforward since Parquet files are immutable. You just scan all files to represent a table. However, table formats like Delta Lake and Apache Iceberg support updates and deletes through additional metadata files, meaning the query engine must consult the catalog to understand the correct current state of the data. Without a catalog, simply scanning the Parquet files after an update could show incorrect results.

How can DuckDB be used as a portable data catalog?

DuckDB can work as a portable catalog by creating a lightweight database file (as small as 268 KB) that contains views pointing to data stored across multiple cloud providers and databases. These views can reference Parquet files on S3, CSV folders on S3, Postgres tables, and Iceberg tables on Google Cloud Storage, all from a single file. Because DuckDB relies on the user's authentication mechanism (like AWS SSO) rather than storing secrets, the file can be safely shared.

How do you share a DuckDB catalog database using MotherDuck?

You can upload a local DuckDB catalog to MotherDuck with a simple CREATE DATABASE ... FROM command, which is nearly instant since it only contains metadata. MotherDuck's share feature then lets you create public or organization-restricted links that other users can attach to query the data. This also takes advantage of cloud network bandwidth between MotherDuck and services like S3, providing much faster queries than running from a local connection. Get started with MotherDuck for free.