How to bootstrap a Data Warehouse with DuckDB

2024/08/19Featuring: Guen Prawiroatmodjo

Guen Prawiroatmodjo, Software Engineer at MotherDuck, shares a talk at SciPy 2024 on how to bootstrap a data warehouse on your laptop using open source, including ETL, data pipelines, dashboard visualization, and sharing via the cloud.

Transcript

0:05really happy to see Gwen give a talk about quack DB and how to make oh a duct DB I'm sorry it's duck DB uh so take it away Gwen let's give her a hand thanks

0:18everybody what a what a quack is this Paul right all right so thanks everybody for coming my name is Gwen Piro atmojo and I'm going to talk to you about how to bootstrap a data Warehouse with duck DB and I'm here together with my colleagues uh Alex Monahan and Elena Felder you'll recognize them by their beautiful duck shirts we're also here

0:41with the duck tub it's a duck shaped tub with duckies in it it's at the door if you would like to grab one feel free and uh we gave a tutorial on Tuesday if you want to check that out it's the link is on the up on the slide uh where we taught you all the sequel that you need

0:59to know as a Pista so first of all what is a data warehouse so a data warehouse is a place used to collect store and collaborate on your data so it's a place it's a database system with a query engine that you can use to collect all your data from all your different data sources with the purpose of collaborating on it

1:22with your colleagues sharing it and it can be a source for data analysis or other Downstream data processes or visualization now you might wonder why would I need a data warehouse you know it's I don't have big data well um our CEO Jordan tagani wrote this blog post big data is dead what he means by that is of course

1:45it's not dead de but 10 years ago it was what everybody believed was the future all right so Jordan at the time was uh at Google he was one of the founding Engineers of Google big query which is Big Data uh product and he evangelized

2:02this message by the year current year plus some Epsilon there will be anathomy large amount of data generated spooky spooky spooky your infrastructure will fall apart buy your product we will save you so that was kind of the story but what he found in that last decade working in Big Data is actually the vast majority of Enterprises don't have big

2:25data they have data warehouses May smaller than a terabyte so you see this graphic it's kind of like a power law or Paro distribution so and even even worse

2:36the customers that did have this giant data sizes almost never queried that data altogether um maybe once a year and doing like a big aggregate report or something like that but most of the time like 99 percentile below it's maybe like 10 gigabytes which is not big data so what about big data and science so we saw the keynote this morning morning on

2:59the atlas EXP experiment at CERN okay that's big data that's exabytes of dis storage it's a little bit special all this data stored on tape it's not something you'd run in a in a like an Enterprise Cloud database but this is like yeah this is some solid big data and I'm not going to say that that is

3:16that is absolutely alive however we look at other you know big data sets so for example if we look at uh Cornell's Ornithology Department eird uh so for the B birders out there you might be familiar with this website um it's a citizen science project if you're a bder you can go there you can submit your checklists and it has data from all over

3:38the world of bird observations up up until like 50 years ago and that whole data set you can actually download it it's about 40 gigabytes if it's compressed if you decompress it it takes about uh half a terabyte on your on your computer so it's still something I can run on my computer so I wouldn't really call that big data and that's a lot of

3:59it's still a lot lot of data um and then hugging phase so is one of the biggest websites for machine learning models and data sets out there actually over 99% of those data sets are below 5 gigabytes which is again not big data um so here's a plot that I took like earlier this week of just the data set you can find

4:18on their website like I took like the first 42 pages on their on their data set browser and as you can see like the vast majority are at least under under a billion rows so your question might be okay then why do I need a data warehouse or a database well at mother duck we actually believe that databases are extremely valuable

4:40for small to medium data workloads uh first of all you can go much faster than pandas you can handle larger the memory data so I mentioned the the eird database right the 500 gigabyt of bird data try and load that into a panda's data frame that's not super fun um there are a lot of other tools out there of

4:59course to be a to do that but databases can do that really really well and they've been built specifically for that purpose then SQL is a beautiful language I actually learned SQL before I learned python it's declarative you just say what is the data you need and the query engine figures out for you what to do and how to do it and how to optimize it

5:19oh folks in the back feel feel free to take a seat there's some more space over there um one other really great thing about using a database is you can package all your table into one file so for duct DB which I'll talk about shortly um it's just so much easier to put everything together instead of like CSV hell you know where

5:40you have like table one table two table two final table two final really final December 2023 yeah you know you know you got where it coming from and uh another uh advantage of using SQL using a database that that is SQL um Implement SQL you can manage data pipelines with modern data engineering tools I'll give some examples later like DBT for example

6:05um is very convenient so now back to duct DB so um

6:12duct DB is a lightweight inprocess sqo analytics engine that is taking the data World by storm it's been developed at the CWI the center for mathematics and computer science in Amsterdam it's the same place where way back when developed python it's developed in a group of Peter bongs by hes mullison and Mark Rasel who have started a company called duck DB Labs at

6:35mother duck that we collaborate with and um it's actually quite awesome it's really taking off um it's MIT licensed there are clients in over 15 languages so I'll listed some on the left here python wasm R there's a CI client it works with rust Java uh JavaScript uh there's about one million doall a week just on pii uh and it's just you know I

7:00I took I took this info from an older slide and it it I it used to be 1 million downlo a month like a few weeks ago and has $1 million a week so it's really taking off it supports a lot of different um data format so it's very similar to pandas in the sense you can very easily load data and it's very fast

7:18and it has great interoperability with the tools in Python ecosystem like Arrow pandas numpy polers it can read sqlite files um and it works really great with with Cloud systems so for example if you have a bucket with data on AWS for example or some other blob storage you can read it um with with duct Deb and um

7:39check out Dub stats.com on the bottom left if you are watching this talk a few few weeks from now the number that I'm quoting here will probably have changed uh there's a live version of the plot on the top right that my colleague Med put together and it's it's really cool actually so check that out and it runs

7:56some duct B um so so um just a little bit more about databases for those you are familiar with like postgress or MySQL or SQL Server so the typical architecture of a database is one of a client and a server so I have my application that runs a client and it connects over the network to some server like a postgress

8:19server like that little elephant guy on the right um it's a monolithical application it hosts the data and a query engine which then separated from your application um and the network is slow and untrusted actually that network connection between um postgress and your in your application can be slower than just downloading a CSV so it's very old

8:40technology very stable of course there's some really good use cases for it but you know for the modern data analytics World it might not be exactly what you need like you connect to it with a host and a port and a username it's all a little bit a lot of lot of overhead so the beautiful thing about

8:55these inprocess database architectures like SQL light or duct DB is that everything lives in the same process so there's no trust boundaries to Traverse it's super high bandwidth because there's no internet pipe or anything to to trus and it's serverless so serverless is a bit of a a technical term but what that means is I don't think about my database as being a

9:18separate entity somewhere on the internet I just connect to it with duct DB as you see in this uh example here below um you specify a file name duct DB my db. duct DB and I'm off to phes s crying SQL so now what is mother duck uh mother duck uh is a serous duct DB platform uh

9:39we aim to enable low cost low friction analytics that will scale to support more complex absent data and what that means is uh when I say serous it works very very similar to duct DB in fact anywhere you can duck DB you can you can mother duck that's that's one of our Mantra so if you connect to duck TB in the CLI for

10:03example that's what that would look like if I now want to connect to mother duck all I have to do is I replace that string that goes to my persistent database on disk to MD colon and then the name of my robote database and what that does so as you see in the picture here if you got your application with

10:21your duct Tobe it it creates or allocates you another duct Deb instance in the cloud that's going to talk to to your duct TB instance on your laptop and since this is sine a lot of physicists in the audience I'm going to use lasers to explain this so let's say your your duck DB on your local machine and your

10:41duck DB on your Cloud instance they have laser eyes and they talk to each other with lasers over the network but the beauty of it is you don't notice any of that it's all abstracted away so everything um happens both local and remote but you again you use that SQL that declarative querying language it automatically uses cache and I already

11:05mentioned serverless so that query is optimized so can tell you a little bit more just to demonstrate an example let's say again you have that duck with that bucket in Cloud let's say it's on the AWS S3 storage um if you you

11:20know if you connect to that with duct DB which you can you would uh run a query like this like in the bottom you uh select star from read parade what that does is it looks at all the parquet files so I do star. parquet which I think is already super cool that you can just do that it scans all of the paret

11:39files in your bucket and then this query just gets the top 100 rows right so what that does is I make a request and I get the data back um let's say I'm somewhere and I have like maybe 100 megabit per second internet connection um and then most of the data crunching like the MCH MCH here in this picture happens on your

12:01laptop right so duct Deb has some optimizations here so it doesn't like download all of those paret files and then run the query it does actually do some subset so not all of it is sent over the wire but there's still you know pretty big chunk of data that goes over that not super fast internet connection so enter mother

12:20duck uh the nice thing about mother duck is it's it's all deployed in the cloud um so we can take advantage of this 100 gabit second um speed that's going to be happening between uh friends uh services in the cloud so instead of that initial query H like initial data transfer happening between your laptop and the cloud now that's happening between your

12:43mother duck duck DB instance uh and your and your uh S3 uh uh server so now some

12:51of the mching some of that data uh uh processing is happening in the cloud so that only those 100 rows that you requested are actually being sent over your slow internet connection I've Illustrated this again because we're all physicist a lot of us are physicist the green laser is higher energy than the red laser so do a green laser for more

13:10data um the advantages here you've got pretty fast inje of cloud-based data uh a lot of that uh you can you can uh Outsource and it this is all again it's all abstracted away I don't have to think about this there's a bunch of really smart computer scientists who figured out how to make this all very very uh optimized all you need to do is

13:31just say declaratively this is the data I want and it will show up so to just go into this a little bit more we've got this explain statement in duct TB SQL uh explain is going to show you uh what the query plan looks like so the query plan so this is a very simple query right select star from RE bar so

13:50I've got only like and the limit right so I've got only two steps here I've got re parquet and limit right so it is on Duck DB then on mother duck because we have that dual execution engine for those hybrid queries there's now going to be these L's and rs in my explain statements so some of the parts of the

14:09query are happening remotely that's the r and some of them are happening locally so you can see here the re barquet the limit and the download are all happening remotely and then the last download is happening locally so just to give you an example so my colleague Elena here who lives in New York City with very fast

14:25internet connection you know with all the stock uh trade exchange you need fast internet um and it's close closer to the dub server the mother duck Service as well so I've run this on duct DB without you know mother duck it takes about 300 seconds which is still pretty good I would say if you're doing like a

14:45select um then on mother duck it takes about 40 seconds so PE PE PE PE lasers it's all very fast

14:56so we got 10 minutes left so if you want to learn more about this um our colleague Stephanie talked about this in this recent qu and code episode on Dual execution and there's also a paper about it in more detail so now back to data warehouses now I have a really cool inprocess database how do I build a data

15:15warehouse out of it let's extract load and transform some data into it um duck DB and mother duck have a pretty great ecosystem of tools here are all these different kind of ingestion data science business intelligence tools all of them have at least one open source option in there that you can use and um I'm going to show you how to do that with duct DB

15:37and only open source tools um so step one is you just build your data pipeline you come up with some analysis and some uh data crunching um then step two is you share it uh so that's something you can do uh with with the tools I'm going to be showing you and then step three is collaborate so that's the only step that

15:55we'll need mother duck for but one and two can all happen just on your laptop with free tools so I've created a cookie cutter for this um here you see the commands you can use to install it and to run it uh and I'll just show you a really quick Dem what that looks like so what tools are we using for this cookie

16:12cutter um so I used a data set called AET which has the morphological ecological and geographical data for all birds um and um what I'm going to do is I'm going to uh do some quick elt on that to get all the birds that are ducks then I'm going to store that data in duct de and mother duck for the last

16:30collaboration part and then I'll make a quick data visualization with plotly and and Jupiter book and then publish it on GitHub Pages um so yeah let's go to the

16:45demo um so let me see here we go so I've

16:50already created a virtual environment but if you go to the cookie cutter page it explains to you how to make a virtual environment and how to install all of the um um dependencies so I'm just going to run this cookie cutter yes okay then it's going to ask for your full name sure I'm ducky McDuck face why

17:12not my email address my get up username so this is all just sort of link so my project name so I'm going to call it my ducky data warehouse and then just a quick project

17:25description and there we go and I can pick a license so now my cookie cutter is cutting cutting the cie cookie for me so now it's created a folder ducky data warehouse and go in there um so here is the read me and it uh which kind of just has the text that you put in your description then you go into your

17:46Jupiter book folder uh there's the config the table of contents so I've got an example in here so the way you would build this locally is you would run Jupiter book build dot in the book folder and what that does it's actually going to run all the notebooks in your example folder so I've got a few notebooks so I'm kind of

18:06using Jupiter book here both as a data pipeline tool so to execute my extract uh load and transform steps but I'm also using it to generate a report so the data is the report basically so as I'm running here and I generate a summary at the end so as you can see here I've named all the IPython not books with 1 2 3 4

18:30that's kind of to enforce the order cuz Jupiter book is going to run these notebooks in alphabetical order all right now I can open this here in my local folder here so as you can see I'm just fully local got my little ducky data warehouse got my index here so yeah so here's the iris data set wait a

18:50minute that's not that's not iris's those are ducks oh yeah of course thank you bigger more Ducks so this is inspired by the classic Iris data learning dat sorry machine learning data set um I've taken a few different duck species that are easy to distinguish uh took the measurements from the AET uh data set so Tara length it's like the length of the

19:17leg then the beak length and then the size uh is the um beak width so you see Egyptian goose is actually a duck even though it's called called a goose so learn something today then yeah more plots so this is kind of my my summary report that I made is just a markdown file all the Jupiter cells

19:37are hidden um then my colleague till who's leading the machine learning team at mother duck made a little like uh you know like a oneone one1 machine learning model for for us using this data set um and it looks like it's about 80% accuracy so that's pretty good uh given that I only have five or six data points

19:59per duck species and then I've created a data share so this is now shared on mother duck uh and I I can actually attach it so if you want you can try this out and you can attach it um so just to go into this Jupiter book a little more so this is my my um Jupiter notebook I used as

20:16the first step so this crunches all the data it loads the raw data from the bird data set it gets the ducts it does to join and then it uh gives me a year I do a little bit of data analys here to make sure that I actually picked the right ducks and then I save it to a database

20:34so here I create a table called duck Iris so if I actually go back into my terminal here and then into my example

20:43folder I see that actually created a duct DB database called my data so now all the data from from my little pipeline has been stored into that duct DB file which is really nice because now I can leave these csvs behind this is now history and now I can go into that notebook that my colleague till put

21:04together I connect to my database where my prep data is and now I can do some more prepping here you know I got my target mapping so I map the species onto a number I create my machine learning table and I do some sampling I'm not going going to go through this in detail but yeah basically now I can create my

21:24model and do my predictions and see if based on these measurements of birds I can predict what kind of duck is it then I did a little PCA anal so so as you can see I can just iterate all my data I can modularize my data analysis into different files without having to drag those files those CSV files with me I

21:44can pre-process all my steps are now in my database so there's some some PCA so there's different different components of the duck measurements and then here if I want to create a share on mother duck so that's step step three if I want to share it um now I can connect to mother duck create a share from that so first I create a

22:04database called my data from that duct TB database so this all uploads all my data into the cloud and then I can create a share here and now I can send this to my colleagues they can attach it into their mother duck account and then they can start playing around with my duck data so yeah so this is now all local so

22:22how do I actually publish this so the cookie cutter actually comes

22:29with a GitHub workflow so when I take the cookie cutter and I push the code to a GitHub repository it will actually run the Jupiter book build step in here so it will pip install the requirements it will run the Jupiter book build here you can see that and then it will upload the HTML artifacts to GitHub pages so you

22:53can use this it will automatically build everything when you push new code just make sure to enable the the GitHub Pages before you do that and then here so here's an example if you go to the cookie cutter data warehouse It's on bit.ly link here is the link to the example that I just showed you so it's all online just uh

23:14yeah everybody can try and and and do some uh do some duck characterization um then yeah I just wanted to give a bit quick plug to my colleague medy he created this really cool data pipeline called Pipi duck flow so that's this duck DB stats page that I talked to you you know about you earlier to you about earlier so this shows you

23:35all the downloads from pii um and there is some G up workflow in here that periodically grabs the data from pii so if you're interested in actually having a full data data Weare data pipeline here he uses DBT to get all of that set up so I've got about a minute left so I'm going to just quickly wrap up

23:55here um so check it out I wanted to give you you a really quick uh sneak peek for save to date on Monday follow us on social media on LinkedIn something small is coming soon so we'll we'll announce what that is on Monday uh so thank you all so much for coming and for your attention and my

24:15slides are published here and join us on slack if you want to continue the conversation and uh I'll take your questions now

24:28thank you Gwen um are there any questions here we go thanks that was really nice um I have a question maybe it's about the Dual execution architecture earlier in the talk you said like postgress is great but it's on the other end of this internet connection which may be slow and then you said mother duck is great because

24:48it's on this other end of the internet connection which may be slow um what is it that duck DB and mother duck do that like kind of solves that problem I don't think I fully understand that yeah thank you that's a really great question let me see if I got a go back here well lasers of course so it's mainly if we go back

25:10here so it's mainly this query plan optimization right so if I work with postgress and I run a query all of that is going to be happening like on the server it's not going to use the compute on your local machine so what mother duck does we actually use your your compute power on your local machine in

25:30addition to the compute power in the cloud so there's still that slow internet connection but we're going to make sure that that as a bottleneck is minimized so any so it it just let's say you have a really big query a bunch of steps it's going to calculate what is the minimum data that needs to be transferred to make this query happen

25:48that gets planned to be transferred on the internet and then the rest of the query will happen locally on your machine so if you're like me and you've got like a Mac MacBook or some powerful machine you're not going to notice the difference whether it's running on my local machine or in the server like it's probably even faster than the than like

26:06a cloud compute instance sure great question other

26:19questions um is the remote instance configurable and is there any benefits of changing configuration like the size of where that remote instance is running like how much memory you have or how much dis space and what you're actually reading is it auto optimized that's a really great question so right now we don't offer that option uh I would if

26:41you're interested uh I would recommend you to come talk to us after my colleague Alex here is a a forward deployed engineer he works a lot with our customers to make sure that you know they have the compute they need um we intend to be like I said completely serverless so it's wouldn't be something you have to worry about out but

26:58definitely if you have you more more questions about this come talk to us all right I didn't see anything on slack either let me check one last time uh maybe I'll ask a question how do I know if I need uh a data store like yeah what at what point do I go and build one yeah if you feel if you

27:20feel like you're drowning in csvs there's too many csvs to keep track of if you have a lot of if you have a data pipeline has a lot of different steps and you would like to be able to go back to any point in time that can be really really valuable to save all those intermediate tables in a database and

27:36not have to worry about you know versioning files and stuff like that cool let's thank Gwen one more

27:46time oh and there's some

FAQS

Do I need big data tools to build a data warehouse?

No. MotherDuck CEO Jordan Tigani found that the vast majority of enterprises have data warehouses smaller than a terabyte, and even customers with massive total data sizes rarely query all of it at once. The 99th percentile query scans about 10 gigabytes. On Hugging Face, over 99% of datasets are below 5 gigabytes. DuckDB and MotherDuck are built for exactly these workloads, offering much faster performance than pandas for analytical queries and the ability to handle larger-than-memory data without distributed infrastructure.

How does MotherDuck's dual execution optimize cloud data queries?

When querying data in cloud storage like S3, MotherDuck runs the heavy data processing in the cloud where network speeds are 100 Gbps between services, then sends only the final results over your slower internet connection. You can see this in the EXPLAIN output where scan and limit operations are marked as remote (R) while the final result download is local (L). In one demo, a query that took 300 seconds on local DuckDB completed in about 40 seconds on MotherDuck. Learn more in the MotherDuck documentation.

How do you build a data warehouse with DuckDB using only open source tools?

You can bootstrap a complete data warehouse using DuckDB with open source tools in three steps: (1) Build your data pipeline using Python and Jupyter notebooks to extract, load, and transform data into a DuckDB database file. (2) Share it by publishing reports with Jupyter Book to GitHub Pages. (3) Collaborate by uploading your DuckDB database to MotherDuck and creating database shares for colleagues. A cookie-cutter template is available to scaffold this entire setup automatically.

What advantages does DuckDB have over pandas for data analysis?

DuckDB is significantly faster than pandas for analytical queries, can handle larger-than-memory datasets (pandas recommends 5-8x your data size in RAM), and packages all tables into a single portable database file instead of scattered CSVs. SQL is declarative: you describe what data you need and the query engine optimizes execution automatically. DuckDB also integrates with Python tools like Arrow, pandas, NumPy, and Polars, and supports modern data engineering workflows with tools like dbt.

When should I start using a database instead of CSV files?

Consider using a database when you are drowning in CSV files (table_final_v2_december_really_final.csv), when your data pipeline has many steps and you want to go back to any intermediate point, or when you need to share and collaborate on data with colleagues. DuckDB makes this transition easy because it reads CSVs, Parquet, JSON, and Excel files natively, and stores everything in a single .duckdb file that is easy to manage and share.