YouTube Data Pipelines Tutorial MotherDuck Features SQL Ecosystem

Watch Me Deploy a DuckLake to Production with MotherDuck!

Q: How do you create and deploy a DuckLake in MotherDuck?

Creating a DuckLake in MotherDuck is straightforward: run CREATE DATABASE my_ducklake TYPE ducklake in a MotherDuck notebook. MotherDuck handles all the metadata table storage and management for you, which in a self-hosted setup would require a Postgres database. Then save your S3 credentials as a secret, and use CREATE TABLE with a SELECT * FROM pointing to your Parquet files in S3 to populate the DuckLake. Your data should be in S3 US-East-1 for best performance. Learn more about DuckLake .

Q: How do you append new data to a DuckLake table?

To append new data to an existing DuckLake table, use the ducklake_add_data_files function, specifying the DuckLake name, table name, and the path to your new Parquet file. This performs a batch append of the new file's contents to the table. For most analytics use cases, this append-only pattern is sufficient, though DuckLake also supports upserting (merge and insert) for cases where individual records need updating.

2025/12/10Featuring:

Hoyt Emerson

In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!

Transcript

0:00If you clicked on this video, then you probably know what a Duck Lake is. If you don't know what a Duck Lake is yet, I definitely suggest you go and watch the two other videos I made about Duck Lake, where I go through the quick start and then I get into more of a deep dive of how you can start a local Duck Lake

0:15instance and do some cool stuff with it. But if you've gone through those and you're ready to take the next step, then this video is for you. Today I'm going to show you the fastest way to get a ducklike into production and it is with the help of our friends at Motherduck.

0:33Motherduck has once again offered a very very nice and easy experience. This time an beta implementation into their ecosystem that lets you create a ducklake and to read off of files in an S3 bucket for example like we'll be doing today. I wrote [snorts] a substack about this as well and I will put the link to that to offer anything to people

0:56who would prefer to read instead of watch a video for this. But today we're working in Mother Duck specifically here. And you can see I've got a Mother Duck notebook up. If you're not familiar with Motherduck, then you should definitely go and view the video I have where I review Mother Duck in its UI entirety and I go through my thoughts

1:17and feelings of of how I think the tool works. Uh, the TLDDR is I think it's really really good and the way that it's built, it's going to allow us to effortlessly create a duck lake. So, there are a few things to consider here before you get started. One is you do need parquet files in a storage bucket,

1:37specifically S3 in the East1 availability zone. For this demo specifically and for Mother Duck in general, they suggest you to use S3 and put it into East1. From what I understand, their infrastructure is center around East1. And so for optimized reads on their system, that's where you want to put your data. Now, not everyone is an S3 person. If you

2:00look, you can find some workarounds for GCS and Azure. I myself tend to use GCS, but I've kind of been forced to go start an S3 account myself and start to use S3. And I find myself using S3 now more often. So, just keep that in mind that for this, you do need to go do some work

2:20and get an S3 bucket started in that availability zone and get your data in there. As well, uh you do need to set up a secret. you need to get access set up for that as well because we're going to be bringing the secret in saving it here. So, as long as you have your S3 bucket set up, you have your secrets

2:36built for it, and you have a few different batch parets, you should be able to go right along with what I'm doing here and enjoy.

2:45Okay, let's jump over to Mother Duck now. So, here, uh, I have set up a notebook. Like I said, we're going to go cell by cell and run some things and and see what we have. But in mother duck, we've got a left side slide out menu that shows us our databases that we have attached to this uh mother duck

3:05instance. And then over on the right is a inspector that will pop up if we're looking at a table or there's anything else that we are zooming into. Okay, it's very very straightforward to start a duck lake in mother duck. I already have one called my duck lake, but we're going to create a new one called manage

3:24duck lake for today's video. But you can see here we just create database name that and they add a type of duck lake.

3:33Very straightforward. Let's run this cell. And you can say zero rows are returned. That's fine. We're not going to have any. But over here the manage duck lake shows up. And we now have our main schema that has been created in the duck lake. There is no data in there yet. But we are going to uh change that very

3:54quickly. Um, one of the things about Ducklake is that it is a lakehouse architecture. And what that means is is it has uh metadata that is saved in some sort of persistent storage. Um, in the previous videos I had, we actually used a duck DB instance locally to save the metadata and that works really really well, but it doesn't quite work as well

4:21in a production state. Normally what you would do for production is use a Postgress database. And while I'm not completely sure what mother duck is using, what they're doing for you, the service they're offering is when you create a duck like here, they are building the metadata tables for you and they are storing it themselves. Thus why

4:44they suggest you also put your data in the same availability zone as their metadata tables that they're storing from what I would assume. So when you put create database as a ducklake in it is going to create the metadata tables.

5:01It does let you run the ducklake snapshots function though and this is essentially running a SQL query that is grabbing a couple of the metadata tables and giving you an idea of what has changed over time.

5:15 >> [snorts] >> If we click on the duck lake snapshots, we can see that we only have one here uh

5:22in the snapshot so far where we created the duck lake schema called main. So that makes sense.

5:29Uh but we'll be referencing this snapshot uh multiple times down uh in this notebook. Let's keep going. So now

5:38that we've got the duckling created and we've just peaked at the snapshot to make sure, okay, we see that, you know, main's there and that makes sense and we also see it in our file hierarchy.

5:48We're going to save the secret that we created for the S3 bucket. Now, obviously, I'm not going to put that in in that here, and I've actually already created this, but all you need to do is put in your S3 access key, the secret key, the bucket region, and then the path to the bucket. And when you run

6:07this uh SQL right here, it will actually save it in Motherduck. You don't have to do it via the UI. This is just really convenient. And uh I didn't know this actually could happen or if it existed.

6:20So when I saw Motherduck add this in their original article, I was uh really impressed uh that you can do this. So you don't have to get into the UI and save your secret. Another pro tip is that you can actually look at the secrets that you have using a select statement here from DB secrets. And I'm

6:38bringing in just the names, but you can see I've got a few for some things that I've created. And that's just nice and convenient. So just in general, if you're ever having to use a secret and you're not totally sure what secrets you have thus far, you could go through the UI or you can just run that select

6:55statement and you can check it out. Okay. So from here we have uh brought in

7:01and created a duck lake. We have brought in our secrets which means we should have access to the S3 bucket that we need to use to read the parquets. You can see here we're going to create a table in the ducklake and we're going to call it streaming data. So within the managed ducklink, we'll call this streaming data. And all I'm going to do

7:22is select all from a parquet file representing data. It's a synthetic data

7:30uh from the from the date of 1127. Keep in mind this bucket is just a dummy bucket here. So I don't really have a problem uh showing the URI, but it will look something very very similar to this. Uh there are different ways for you to organize the data. I just have it this way. Let's run this and create a

7:50table. So, we do have a row output at this point, 9,120 rows, and I know the data set, and that is the correct number of rows. Let's get over to the file tree here. We now have streaming data showing up under the main schema. So, if I was to click on this in mother duck, I can see 9.12,000

8:11rows, uh, which is what we want. And I can kind of work through the schema here. Get get a view uh get you know just just some original some basic high-end uh metrics here within the for the schema. So the usual stuff we've seen in these types of tools but it's nice because it gives us a a validation

8:31layer as well. So we have created a table called streaming data and we have brought in a day's worth of data that represents 9,120 rows. This is all you

8:43know pretty straightforward so far. Where the production level comes in is well one the fact that you are not having to manage any of the metadata tables right now. So motherduck continues to um store changes and updates and it keeps it in its I'm assuming Postgress databases. Uh and but all you have to do is basically call to

9:08the data add it to your data set and your table in the duck lake and then you can look at the snapshots again and we see here that after creating this table we now have two entries in our snapshots and again these are coming from the metadata tables that mother duck is hosting and handling but we see right

9:26here there's our main streaming table added so we are are allowed time travel when we use this type of snapshot schema evolution and while I don't have that represented here uh in this uh notebook it is just you know something to to note is we can go back different snapshot ids and schema versions to see previous versions of how the data was. So this is

9:54just one of the really really cool features of of the lakehouse [snorts] and I think of ducklake in general. Um, so great, we've got our table created.

10:06Let's actually look at the data now here. So, just a simple from, we're going to select everything. And we see the created data 11:27.

10:15U my, uh, you my created ats uh, UTC's

10:20are a little wonky and are running into the next day. That's just that's a happen stance of the way the date is created. Uh, but it 11:27 is the created date. That's what we're going to lean into. And you can just see a number of data points here. I use DLT to actually load the data into S3 which is why you

10:36have a DT load ID uh connected to it. Um but I also do just want to note that you know this data is a synthetic data set that I have running on a pipeline daily.

10:48So um it's just very convenient to have uh something that's creating sort of dummy data for you. Uh so you can always just try some tools out. I really highly suggest it. I have a Substack article on that as well of how I build those data sets and I can also add that in the descriptions. Okay, so what we've done

11:07to this point is we've created the ducklake in motherduck. We have uh added the secret so we have access to the S3 bucket and we've brought in the initial data from an S3 file paret to create our table called streaming data.

11:22Now let's just say it's the next day now. It's now 11:20. We'll say it's 11:29. We're always going to grab the data from the day before. So yesterday was 11:28 and we uploaded 11:27 data. A whole day of business has passed since 11:29 now. Now our pipeline has spit out

11:40data for 1128 into our S3 bucket. We'd like to append that to this current table that we have in Motherduck. It is

11:50incredibly easy. You simply call the ducklake ad data files. Now there are some different ways to do this. This is a very rudimentary, fast way to show you this is an append of new files. It's just a batch append. Uh, and that's going to make sense. A lot of times that's all you're going to have to worry

12:08about. But make no mistake, uh, Douglake does support upserting, like merging and and inserting uh, as well. And and I don't use uh I don't use it that way.

12:21That's a little more of how a transactional database works where you need to actually update a record, a specific record for say a customer. But a lot of like in analytics most of the time I just want to get everything that happened the day before and then just append that information into a data warehouse and then I will go and I can

12:42clean things as needed or I can sort and order things as needed. So this handles most of your use cases, but all you do is just call it. You you name the duck lake, you name the table, and then you point to the parquet file that you want to append to the current table. And so if we just run that,

13:01it says zero rows returned. Um, however,

13:06that's because we are just calling a function. If we were to refresh the table here, um I actually ran it twice, which which I thought I might have done, but all that being said, it's 27.4,000 rows now. So, uh the 20th got in there uh twice. Not a problem. But that being said, you can see that that's all you

13:26got to run to get in a pen added to it. So if we now go to the snapshots, we see

13:35ah right here we see that there was two tables inserted. So there's two inserts.

13:41So I sort of ghost ran that cell and then I ran it again. I'll just probably use the play button I guess next time.

13:48But this is a a very good example actually to show the way the snap the snapshots tell you the behavior lineage.

13:57So, let's think about where this would even come in handy. Well, hey, you're supposed to be getting, you know, roughly 9,120 rows a day. So, you would be expecting a

14:10little over 18,000 rows, but now you've got 27.4,000 rows, and that count is kind of off.

14:17Well, we look at this and we can say, ah, well, there was a a double insert uh into the database. Now, one thing that we could do to eliminate that problem is to is to use an acid transaction that has a requirement for that date of that data to not be in the data set yet. But we will get into that later. I think

14:39this is just a really cool accidental way to show that we can decipher that,

14:45hey, this version of the schema has the first initial insert and this has a second insert. And that actually explains why we have double rows. We go down to double rows for for that day. Yeah. If we go down to here, we see 1127 and then we see 1128 again. And we're going to see a lot of 1128.

15:11If I was to do distinct,

15:15we see the 18,240 rows. Um, so, uh, one of the things that we could even do right here, and and I won't do it in real time, is we could go back a version in a schema, and we could view that. Um, I could also just recreate the table here, uh, as a as a distinct u, and so the the fixiness is not much of

15:39an issue, but I think it's really really cool to be able to see that. So, we've got appended data there. we've got the next day available. And let's say, you know, we do um want to get in some logic

15:54that says, hey, we don't want data being added in if it's already in there. So, I have an acid transaction set up down here.

16:04A lot of people, I think, act like they know what acid transactions are, but when I have asked before, it's it turns out most people don't. Um, and I think what actually is the case is asset is just a term that I think is a little old. It's slightly outdated, but it references something that we already usually know as data people who are

16:25engineering pipelines and working on data sets is we want things or a set of criteria to be successful before we move forward with uh adding data to a data set. And that's that's really what a an acid transaction is doing. It's essentially saying that a set of criteria uh a number of functions or steps need to successfully happen

16:50before the data will be added to the data set. The example that's already always given is transferring money. If you were transferring money to someone, you would first want to validate that the that amount of money exists in your bank account. You then want to validate that the bank account, the routing number you are sending it to is valid.

17:07And then you would want to validate uh and have a successful response that the data the data the dollars did actually get sent over to that bank account. When all that set of criteria is successful then you would say the transaction was a success and you would ultimately respond you know to the customer uh the sender and the receiver that this was

17:32successful. So, it's a really good mental model to to think about when it comes to an asset transaction. What we have here is I am bringing in the next day 11:29 of data and I'm asking to

17:47validate first that that date does not essentially exist in the previous data. So, kind of coincidentally, what I did above would have been fixed by having this asset transaction here.

18:00But what I'm working off of is the created at. I'm saying uh I want the minimum created at of this new data to be greater than the maximum created at of the data that's currently in the Duck Lake table. And what that just, you know, means in terms of what this data is, it says, you know, if that's true,

18:18then this data is fine to append. There won't be any duplicate rows. If it's not true, then we want to error out and we want to say it's there is out of order data detected. That could be a different error message. And then also this idea of of rolling back the transaction. It does not automatically roll back. You

18:36have to manually hit roll back. But that being said, we're not going to worry about that. Just do understand it doesn't automatically roll back here.

18:45But if it offers a one, if if the case

18:49when statement returns one, then we will call the ducklake add file again like we did before. But we will add the 1129 the

19:00new data into the table. So it's taking the new data, it's making it a temp file. Then it's taking the temp file and it's comparing it to the current uh managed ducklake streaming data table and giving a one if there's not any going to be any duplication and giving an error if you know [snorts] the the uh

19:18created at uh time stamps uh are not uh distinct from each other from the max and the min.

19:24So let's go ahead and run this.

19:29All right. So you can see here again it ran no error messages but we've got zero rows returned. I'm going to come over here again and I'm going to refresh. And I do now have another 9,000 some rows added here roughly. Right? If we come down to this

19:48cell I've got a aggregate query here. And you can see um we've got uh well the doubling of 1128 but we have 1127 1128 and 1129 added. Um I will distinct

20:04um that and so you know 111 just to kind of create the right number but basically all these dates are being um are being uh referenced now. So that's great. Uh and we were able to do that very very effortlessly. Now let's try something. Let me put back up in the acid transaction here because I want to

20:26show roll back. Let me go put 02. I know

20:30that this does not exist uh in my parket file. So this should give some sort of error meaning meaning hey there's probably a typo or something. Maybe something got switched uh and so let's the basically the transaction should error out. Let me run this. Great. So we get an error here.

20:51All right. Unable to connect to that. So that that's that's that's a that's a an error out. And essentially the transaction ends. Now I'm going to run this again and I should be told, aha, transaction context error. The current transaction is aborted. Please roll back. So when you get an error message with your transactions, if there is an

21:16error, then the transaction is is um aborted. Now to start over in a new transaction again, you have to roll back. Now uh you can programmatically set this up, but I'm trying to show it within the steps. But if I run this cell, the roll back here, that does roll

21:35this back. And if I go uh into this again, uh there is a 30 in

21:43my S3 bucket. Let me run that.

21:48I've got zero rows returned. Let me come here and check that. Uh let me let me accept that. Cool. And I've got I've got all the dates accounted for here. Got all the all the dates accounted for here now. So you can see I tried to do a transaction but it the the file was wrong. So I had to roll

22:08it back. I made the update and then it did successfully commit. Let's now look at snapshots again after a lot of that stuff happened. So you've got the schema being made, you've got the table being created, and then you've got one, two, three, and four inserts into the snapshots.

22:29Now, the transaction that failed is not part of the snapshot. It wasn't successful. And so, we're not going to actually see that. But we're going to see yes, when when one, two, uh, three, and then four were added in there. So, uh, and then initially the table when we did this had an insert. So, this was the

22:49first date. This one and this one are, um, where I did them back to back. You can see they're two seconds between each other, which is why that metadata is really really useful to have can determine some some issues here. So this was the doubling up of the 1128 data.

23:05And then I did a transaction for 1129. Then there was a failed transaction. I rolled that back, ran it again, and 11:30 went in. So at this point, this is how you simply build and uh interrogate a duck lake and the data that's in it. From here, you could go over and set up something like a DT

23:28pipeline that's going to uh actually automate this. And then you can be in the warehouse just reading it like any other warehouse you have. The data will just make its way in. It'll append via these calls. Keep in keep in mind all of these calls I have here, they can just be ran as SQL in Python with with Duck

23:49DB and and then you can um orchestrate it with uh DT and say like orchestra is a great orchestration tool. You can use that and it's just happening in in the background and you're just coming in and you're and you're reading it, but you're also offered the snapshots and some of the insight from the metadata. So if you

24:07do see things that are wrong and problems, you can roll back, you can do some time travel. These are more advanced methods that I am going to get into, but today I just wanted to show you how to get something up and running in a production data warehouse like mother duck. And I think what mother duck has offered here is just a

24:23fantastic opportunity to try it out for yourself. Uh stay tuned for something that uh might be getting built on my end to to be able to let you be able to look at metadata. So all that being said, I hope you really enjoyed this. Um, I I I love just what's happening with Duck Lake and and Mother Duck is uh just

24:43there in parallel and and offering some really great stuff, but I I think the speed at which Duck Lake is moving is super awesome and uh I'm really excited to just keep building off of it. I had my aha moment with Duck Lake some six, seven months ago and I've just been smitten with it since and I've really

25:01have fell in love with the lakehouse architecture. I really do enjoy it. So that's all for today's video. If you like what you saw, please like and subscribe. I also have a Substack where I am writing articles very similar and closely related to the videos that I do here on my YouTube in case you like to read and copy and paste. Uh you can go

25:18over there and I'll have all those links in the description. And as always everyone, I hope you enjoyed the video and have a great day.

FAQS

How do you create and deploy a DuckLake in MotherDuck?

Creating a DuckLake in MotherDuck is straightforward: run CREATE DATABASE my_ducklake TYPE ducklake in a MotherDuck notebook. MotherDuck handles all the metadata table storage and management for you, which in a self-hosted setup would require a Postgres database. Then save your S3 credentials as a secret, and use CREATE TABLE with a SELECT * FROM pointing to your Parquet files in S3 to populate the DuckLake. Your data should be in S3 US-East-1 for best performance. Learn more about DuckLake.

How do you append new data to a DuckLake table?

To append new data to an existing DuckLake table, use the ducklake_add_data_files function, specifying the DuckLake name, table name, and the path to your new Parquet file. This performs a batch append of the new file's contents to the table. For most analytics use cases, this append-only pattern is sufficient, though DuckLake also supports upserting (merge and insert) for cases where individual records need updating.

What are DuckLake snapshots and how do they enable time travel?

DuckLake snapshots are metadata records that track every change made to your database over time, including schema creation, table inserts, and modifications. You can view them by calling the ducklake_snapshots function, which shows a timeline of all operations with timestamps and schema versions. This enables time travel: you can go back to a specific snapshot ID to see the previous state of your data. Failed ACID transactions do not appear in the snapshot history, which is useful for debugging issues like accidental duplicate inserts.

How do ACID transactions work in DuckLake with MotherDuck?

ACID transactions in DuckLake ensure that a set of criteria must be met before data is committed. You wrap your logic in a BEGIN TRANSACTION block, perform validation checks (such as verifying that the minimum timestamp of new data is greater than the maximum timestamp of existing data to prevent duplicates), and either commit or rollback. If an error occurs, the transaction is aborted and you must explicitly run ROLLBACK before starting a new transaction. This pattern is how you build reliable production data pipelines.