YouTubeEcosystemQuack & Code

DuckDB + dbt : Accelerating the developer experience with local power

2024/02/29Featuring:

Hands-on stream with Mehdi Ouazza and Josh Wills discussing how DuckDB can fix the data engineering workflow with dbt while using your local laptop compute power!

0:04hello everybody and uh welcome to another episode of uh quack and code and

0:12today I have uh really a special guest I

0:17would say I'm really uh amazed that he accepted to uh be on this podcast because he doesn't do that much apperance we will discuss about that um but we also discuss about dgdb and of course DBT because is the author of the adapter uh the ddb adapter for DBT so if you want to learn about uh Doug DB and

0:42how Doug DB is working with DBT uh this session is for you and uh Josh welcome

0:50thank you man thank you so much for having me here that's cool so how how is it going so it's it's breakfast time for you oh it's it's noon it's lunchtime for me it's lunch time for me I'm I'm in San Francisco um it is it is a breathtakingly beautiful day um it is one of the days that makes you very

1:05happy you move to California yeah it's going good I I had I had breakfast with a venture capitalist friend of mine and uh yeah just having a great day happy to be here cool uh no but I'm super uh what

1:22why don't you do more talk or podcast I'm really curious I you know used to do this stuff professionally you know man like that's worked at Cloud era for four years like during the the peak of the whole big data hype cycle and all that kind of stuff and I used to fly 120ish thousand miles a year and give talks

1:40anywhere and to anyone who would invite me to do anything and I don't uh want to do it anymore I really am just very happy being an engineer uh and working on my little projects and helping people out in slack or Discord or whatever when they have problems but I just don't I'm not a company you know I'm just just a

2:01guy just a person not trying to sell anyone anything like for all these kinds of all these kinds of reasons yeah I I don't uh I don't do this stuff anymore it was great time I'm glad I did it and I'm grateful that I I don't do it anymore yeah no but uh that's it I mean you you already spoiled a bit your your

2:19background so you worked at clera I did and what yeah and so you're your uh I think your job description on on LinkedIn is uh

2:31wait for it can you remind me oh my current one or my gainfully an employed data person gainfully unemployed yeah can you can you tell us a bit uh more about uh I mean the recent years after you GL era background that what you do did you do U I mean before cloud or after Cloud I'm sorry after after after

2:55oh yes after I worked at slack um I worked at slack for four years and I worked on uh slack State infrastructure help build that help build the team from you know not quite nothing but pretty close to nothing um and then I worked on slack search for a couple of years and so I wrote the slack search indexing

3:11pipeline which it turns out I still the code I wrote back in 2017 still Powers the slack search indexing pipeline which is just kind of insane to me that you know if you're using slack right now and searching for things than your exercising code I wrote which is kind of crazy yeah um and then yeah good and so

3:31and so now and so after after slack you went into coding uh DB adapter

3:40for what's the story what's the story behind this yeah so so after slack went public uh I you know did kind of the standard thing and I became an angel investor um and and DBT was I think technically like if not my very first Angel investment like pretty close to it and I had this sort of dumb idea that I

3:57would become like the world's most useful angel investor which which is really like sort of a like Angel Investors generally speaking are not like super useful you know in in in general and so trying to be like a useful angel investor is kind of an oxymoron um but my thought was that I really like I still liked coding and I

4:15liked building things but I didn't like you know having a boss or jira or you

4:22know anything related to anything so I still like so my idea was okay I'll just look at my my portfolio companies and I'll build use open source stuff for them that will you know add value but without me having to actually like have a boss or a real job or any of those kinds of things this was my this is my

4:39brilliant plan um and so my first my first DBT adapter was DBT materialized so you know materialized is like the streaming streaming squel like differential data flow database and so I built that and showed it off to to Arvin and Tristan and arvin's the CEO of materialized and Tristan's the CEO of DBT and both of them were kind of like

4:56well this is weird I don't know why I don't know why anyone would do this very strange thing to do um but it's funny I just saw ramp just posted a video or materal just post a video of ramp using DBT materialize to power their real-time fraud detection system which which made me very happy um yeah and then I

5:15discovered I discovered duck DB uh during the pandemic actually when I was helping out some epidemiologists um some folks at uh the California Department of Public Health and Hopkins and later FEMA and CDC and stuff like that um you know kind of do analysis and sort of try to forecast the course of the pandemic um I sort of discovered I needed something

5:34kind of duck DB like we were running these gigantic you know mcmc uh simulations that were kicking out like enormous quantities of paret data into S3 after running on like thousands and thousands of bash instances and we wanted a way to analyze them but like we didn't want to have to freaking you know ETL all that data into a database like

5:52who wants to do that it's terrible po no no one you want to do that so I'm kind like I want something sort of duct be shaped so I went looking for it and I found duct DB and unfortunately it couldn't like help me at the time I was busy with other things and stuff and there wasn't you know it was kind of on

6:05the path to becoming what it is today but it wasn't it wasn't there yet right but what's what's the timeline here just to because like yesterday yeah no for sure this is this is like April April May 2020 I'm like looking at this kind of stuff so yeah so that's pretty early for Doug DB given it was early it was

6:23early for dougb I mean I can't even remember what version I like I say you know I've talked about before I started working on the adapter in September M of 2020 like once my I kind of my pandemic work had kind of wound down and I was looking for something to do again um I started working on DBT DB just to see if

6:38I could kind of get DBT ddb like so s like to see if I could kind of make it work I thought it'd be kind of interesting and there was stuff dctb didn't have yet like there wasn't an information schema yet like all of the all of the metadata functions were still SQL light based as opposed to kind of

6:52post like they are now right um no PG catalog none of that kind of stuff existed yet and so I started like adding that stuff basically to duck DB I Whi you know I like I said I had written I had not written C++ in like a decade

7:09um it was that can be painful it is it was you do it I mean at least at the time I mean I think it's still true I still occasionally fix bugs and duct TV when I find them um it's really one of the best C++ projects I've ever worked with it's so great it's so self contained it's so

7:26well organized the cmake stuff works great um I had a fairly easy time kind of Dipping back into it um and so yeah I started adding the functionality I needed to make DBT DB work so yeah that

7:38was that was how I got started and like I you know I worked on it for a few months um cut the first release in November of 2020 and kind of like no one cared you know much much like a lot of my early ideas like people like wow Josh that's a strange and odd thing to do and I don't

7:52understand why you would have done that um and so yeah I worked on it put it out there yeah and that was that was kind of it for a while like yeah cool no that's uh it's really interesting like to to go you know in the history because again it feel like yesterday but it's still like only four

8:13years right but I feel like in in the in data ecosystem it's it's almost count like dog years right yeah I mean totally 100% things things change like so quickly and stuff like that like yeah without a doubt how how do you how do you see like the project has been involved since like you you started uh with the we adapter and now

8:34um I mean I think the I mean sort of two different forces driving I really three different forces driving the adaptation of the adapter or like the evolution of the adapter since it was early early days right one obviously is DBT itself um DBT you know Cuts new releases every three months and they add new features and I need to support them um or I don't

8:55I don't I need to support them I know not all the adapters are super great about keeping up up to date with the latest and greatest um I like to I guess just because I'm curious and I'm you know I don't know I like to Tin I like to Tinker with things right um so I I try to make a point of like cutting a

9:09DBT duck DB release um often within the first day of an actual DBT release and people seem to like that because um DBT DB is a great way to test out new DBT features in fact the DBT team itself like uses DBT dub to test DBT features because it's you know it's like so simple right it's so easy um so I try to

9:32I try to stay on top of that DBT Drive some of that the the main driver by far obviously is duck DB itself um dub itself just you know they add so much new great stuff and they fill out so many gaps and things like that um the the most recent od. 10 really started adding um commentable objects in duct DB

9:50which is like you can comment on table and column this is a big deal for DBT ddb um because DBT docs relies on this functionality and it's kind of been a stub for us for a very long time because DB didn't have comments um yeah now now they do and the very last piece I needed is actually coming in in

10:050.10.1 um and so I will add kind of full-fledged complete DBT DBT docs functionality to DBT ddb um in my next release right all of the parket stuff all of the fs spec stuff all of these things have been hugely useful and then you know on top of that is is kind of the Synergy between that and the duck DB

10:24Community itself and people coming up to me with ideas and saying hey I would love to be able to do this with DBC duct DB um and that's generally speaking something I have a lot of fun doing uh and I'm I'm like happy to work with people on if they want to send poll requests if they want me to do things

10:41like I'm we generally have like a good time sort of exploring the outer limits of like how useful could DBT be if we didn't have it sort of chained to an actual cloud data warehouse like if we just really you know what I mean and that's and like so easy example of this is like DBC dotd has complete and full

11:01python support you can run anything you like in your python models you're not limited to snow park or data bricks or whatever none of that nonsense just run python that's it like that's the whole thing right um and then yeah the other I think most popular feature is the external uh data file support for working with like par files and S3 all

11:23that kind of stuff we we make a point of making that really easy so that you can do um all of the great data Warehouse stuff you need without actually having to run a data warehouse my f the other funny one for me is always is like it's most of my like users are either very small companies you know just getting

11:38started um my other favorite one is is giant institutions like where they have right like huge financial institutions huge Healthcare institutions where they have very little it support and a lot of their like data warehouses is basically a bunch of CSV files on network going to be like a shadow it and Shadow it

12:00exactly they they wanted people there are good they're smart they're talented they want to use modern development tools they want testing they want the Version Control they want all things but they can't get the IT team to set up like a goddamn postgress instance let alone like snowflake right and to be there for them and let them work

12:17productively even in these like backward you know fairly isolated yeah is is great for me the these are my people like I love these people so them helping being able to help them work is tremendously rewarding and so yeah I love that too yeah yeah you you already uh spoil a bit like what we are going to

12:34dive after into into the coding with the DVT ddv adapter I just want to no is I just want to zoom out so uh you know people using D DPT already you mentioned

12:47they they have a dependency to the cloud data warehouse and basically Tech DB can

12:54uh cut this dependency but how is you think this dependency actually hurts you know the the development uh in general what what is what is wrong actually to have this dependency through the life cycle of the you know development of the data set yeah yeah yeah um I mean I've been thinking about it's funny you mentioned I've been

13:15thinking about this a lot lately and I'm been trying to figure out a a a sort of good way to express what I I think of the problem in data Engineering in general right now with the modern data stack with with all these kinds of things um and my friend Kent Beck uh wrote a book recently called tidy first um and

13:35he talks a lot in the book tidy first is kind of it's a book about the philosophy of software design and not just the philosophy of software design but the economics of software design and he talks in particular about uh coupling versus cohesion coupling versus cohesion and um coupling is is kind of Fairly self-explanatory right coupling is like

13:55two things are coupled if when you change one of them you have to like make a change in the other one right if you don't make a change the other one something's going to break okay and then coherence is really like are these things are things that are coupled organized together into a single unit of some sort be it a module repo whatever

14:12right so that it's it's easy to change them to when they when you have to change one of them right and like the data stack is basically like a sort of coupling coherence disaster like um at the ingestion level

14:27you know we have we have fan or Stitch or or you know whatever tooling you're using for doing data ingestion into your data warehouse um sucking in everything from everywhere coupling your Upstream production systems Salesforce what netw suite whatever the hell you use all these kind of stuff is now coupled to your data pipeline code anything changes Upstream in the systems

14:49it's coupled to the data pipeline data pipeline might break with it and then Downstream of the data pipeline Downstream of the DBT and downstream of whatever stuff you're using you have the bi tool which again is also a kind of coupling coherence mess it's it's all the queries and visualizations and stuff in there are coupled to your DBT code

15:06and again be like the transitor property of coupling they're also coupled to the Upstream system right um but again we have a lack of coherence we have no idea that when we change something here it's going to break things Downstream this this is the world that we've been working in for it's not again this isn't like a a aspect of the modern data SE

15:24it's been around for a long long time yeah yeah yeah indeed and it's kind of well like why is this happening and a lot of things like data contracts and like the whole metric like data contracts on the ingestion data pipeline side and then like metric trees on the bi side of things are really just tools that we're trying to address this

15:42problem this C we're trying to like manage this coupling make these things coherent right not just couple but coherent data mesh same thing trying to manage coupling and coherence the whole thing right um but architecturally speaking uh I kind of blame the centralized data we warehous I I do and this this is this is potentially like this is my heresy right I blame the

16:03centralized snowflake Hadoop cluster red shift teror dat instance whatever it is you're talking about I I sort of blame that thing because the whole and I mean this is sort of for me an echo of like what the whole you know big data is dead perf is not enough kind of stuff you guys talk about over at mother duck um

16:22like the the need for scale has forced us to design our entire systems around this Central Godlike data warehouse that is that is the repository of all things that is it is like the god object it's where all the data goes and this is incredibly powerful um but as a result

16:41data Engineers all of our tooling everything is kind of oriented towards the centralized data warehouse and what I like about duck DB is is in of itself it is kind of the antithesis of that duct DB is Tiny duct DB is modular duct DB can run anywhere you need it to run it can run locally on your developer

16:59workstation it can run in a GitHub action or Circle CI it can run as part of your cic CD pipeline um it can run on a Raspberry Pi for God's sake like you know it will when we build neural implants we will run duct DB there as well it can run in your browser your bi tools back in caching layer whether it

17:17is mode or real or Omni or whatever you using is using duct DB as the caching layer and look if you so happen to need a cloud data warehouse that actually is really what you need architect speak well guess what we got mother duck too so like you're good to go here you can run duct DB in all of these places

17:34anywhere you want to um yeah I think I think one thing I want I want to highlight in in this thing is H yeah um kind of like the to get inspired from the software engineering workflow if you like I've said that multiple times uh we had actually a panel um at cols with uh Matt uh and

17:59also for yeah no jo Joe was not on badel

18:04but anyway was about data engineering life cycle and we talked about you know the development Loop and I always take this example as a web dev you want to build a website you don't create any dependency on the cloud you just write exactly HL locally you run your local server you have a result you can you know inspect the result totally and I

18:24feel like I feel like data engineer I've been like setting this Norm which is totally normal to wait for a CI pipeline to do something to inspect like is it working I know U there there is this huge Loop that we that we are created due to the fact that you mentioned as a central data rarehouse which is not uh

18:47part you know kind of like software engineer best practice and and it's funny because this this has been in my mind for years now I mean started like post Ado thing and like when the clouds were starting what I felt like there there is something wrong there yeah um and uh and actually at the keynote of KS

19:07by DBT they talked about that they talk a lot about this L development uh loop

19:15life cycle to say we want to reduce this Loop they talk about DVD Cloud which I think is one option but still um exactly

19:25there is an issue where we cannot basically everything locally and if you I've trained some software engineer to move to data engineer and I guess you get the same kind of feeling if if you consider yourself as a software engineer it's like wait I cannot run unit tests locally I I need to connect my clouded awarehouse like does that sound familiar

19:48to you I mean I mean it's not it's not simply just that like in principle there's nothing wrong with running something in the cloud I mean again people do this all the time it it it's great yeah it's it's to me it's not just the like in the cloud it's the like it's the fact that because we design these

20:04systems for scale there is this inevitable fixed overhead to doing anything with them even running like a noop query like takes a decent amount of

20:15time it doesn't take like a millisecond it takes like at least like two 300 milliseconds that's a long time especially when you're not running one of these things you're running like several hundred of them right even but you're talking millisec I think like the average people in the data way they're used to segs I just I just kind of D

20:32inside when I hear that bud like I just I mean that's just that's might as well be Infinity for me and I think what's weird for me right is you know like one of the one of the jokes of me m is like no one actually knows what I do like really I'm like famous for a tweet I

20:43wrote a million years ago right um I haven't done like data science proper or like fit a machine learning model in anger in like I don't know what six years or something like that right I'm I'm just a software engineer now that's pretty much what what I do and you know it's kind of like you said I've gotten

21:01addicted to the fast feedback loop and the power as a developer of running everything locally like you know yeah compos got my Docker container like let's rock and roll let's add features just as fast as we can like let's operate at the speed of thought like why how else would we work um and so I just can't I can't fathom going back um and

21:21so I I guess like you know to the extent that I you know bringing this back around to the extent that I should do more podcasts and I should more talks and stuff like I want to drag everyone in the data ecosystem kicking and screaming out of this I don't know

21:36yeah they they kind of need to wake up I just shared the the panel of the people on YouTube we have a couple of people watching on YouTube please tell us what's what's your average like expectation when you're running a pipeline to test something a couple of milliseconds or a couple of seconds I'm I'm curious to hear I'm also curious

21:55what do you do what do you do with that downtime have you gotten really good and making espresso have you like learned how to solve Rubik's Cubes what what is it like how do you spend this downtime like how what are you doing with this free time you have I don't know have you adopted guitar anyway yeah

22:11and I think it's exactly as you said it's like it sounds minor right it's like yeah it's fine it's three four seconds but how many times you're GNA run it and how many data people in your team in your company are gonna suffer from the same right and when you head up yeah there's a generational thing here I

22:30was say like you know I don't you know I'm I'm old so I remember very old things right I remember like using Chrome for the first time and how Chrome you remember like Chrome was just like you had Firefox and you whatever right and chrome was just like a little and then again this is a million years ago

22:45right but like Chrome was just a little bit faster consistently over and over again and all the things you're doing just a little bit faster you and you get addicted to it and you can't and when you try to go back to doing the other way it just feels like you like you you got to be kidding me with this like yeah

23:01anyway so and that is true for a lot of Technology right I think we we get uh there is multiple study that shows that we get used uh quickly to slow liness like when you get basically a fresh new laptop it's always the case like yes then a couple of month after you feel like it's slower it's probably you maybe

23:22you're doing something wrong right often it's between the the the the chair and the keyboard uh but also sometimes it's just you that got used to this speed and you expecting just a bit more and then if you you know going back to something slower like yeah you you never go back that's that's maybe that's that's the thing exactly

23:43like I can't I I refuse like I I simply I will not do it like absolutely not yeah that that's that's a good point maybe the the the success also DB is that when when people try it and they get this feedback loop which is uh pretty fast just just for the development sake that yes um basically

24:02um that make the things just uh easier and faster and people just don't go back when they go back they realize wait there there is something one with this setup absolutely speed speed is speed is a feature if you can't figure out what else to do with a product make it faster that's the rule worked really great for

24:18Chrome Works really great for duct DB highly recommend that as a product strategy in general yeah cool um so we

24:26can a bit d into um the the code and talked a bit about um the the adapter so what I suggest is that let's let's just go first on the on the on the read me

24:42um are are you gonna shame me for my read readme I clearly need to work on this read me I need like I'm still waiting for the mother do people to hire hire a technical writer to come fix this thing for me but yes just like here copy paste in here gbd4 fix this for me yes yeah yeah yeah

24:59know um definitely so um so yeah so

25:03about the REM me I have a a small issue but uh with the display let's fix it let's open the request and fix it man I got a browser here we can do it yeah yeah um so yeah uh I'll share I'll share also uh the link for uh people to uh read on their on their s y yeah but so

25:22um so basically what um so for people not familiar is that that is basically um what is needed to be built in DBT if you want to connect to a data warehouse and something um that

25:37actually is not obvious for people um that uh you know not software engineer but people using uh DBT with less technical knowledge is that uh DBT is just this client that's sending SQL right yep so there is uh nothing uh

25:55basically uh computed on the client because we're going to come back to that right um but um and so

26:04that's that make basically the things this dependency to uh the cloud always because as far as I know mo most of the cloud doesn't have um a way to run it locally I know that bigquery for example as an emulator and I played like a couple of years ago I don't know what's the states but I bet it's it's still

26:26painful because mentioned it's like yeah it's like those those uh technology has uh haven't been designed to run it locally on a single note so um they usually kind of a wrapper solution to uh where versus basically dtp is just running the same way than it run locally than on the cloud indeed yeah so uh will you be able

26:54to use with DBT Cloud WB uh there is is something going on uh I'm not sure uh where it is uh and where it when it's going to be there but uh probably at some point uh at least for mod deck uh that's that's for sure because mod is in the cloud um so on your read me you

27:15basically uh mention uh how so I think that's that's like one um let me move just okay uh a

27:25bit one sec is my is like is my head blocking the view there you go yeah move me yeah yeah here here it's uh perfect

27:34there we much better sorry for that nobody in a corner yeah anyway it's fine go back in your corner um so uh we

27:44have uh here an example on how you

27:49basically um Define in your profile so

27:53in your profile uh as you would you know Define an another adapter like B query or snowflake here you define ddb uh a path to the ddb uh database um if you want to something persisted yep um and then the extension so for people that doesn't know extension are um basically dug DB features built also in C++ to

28:19have support for cing data on three raing Park and so on um I think actually most of the time now you don't need to specif by do and I mean this is this is a great call out for me here man because like this is you're exactly right I mean this is documentation I wrote many many moons ago back before the present day

28:37which is you know the httpfs and paret extensions are bundled with all python duct DB releases now so you don't actually need to specify them they're just there for free yeah but you know for other extensions uh maybe spatial maybe other sort of stuff like things that are very cool but are not part of that core distribution yeah this this is

28:55where to specify them and this is this is f in I agree with you yeah totally yeah and and and do you uh when you specify them you install and load it uh or in the adapter yeah so like it's good

29:09you called out like again most DBT adapters go out of their way to really just be a templating engine and just they compile SQL they execute SQL against the connection and of course you know we're a little different with with DBT dctb we do a lot of things locally because duct DB is local too right so there there's nothing you know separate

29:27to connect to and so yes we do in fact install and load extensions for you if you tell us about them absolutely yeah and I think uh so sometimes like uh dgdb used in the

29:40past where you have some operation like you read the park and you need to specify to load the extension of par but now there is special hook in the uh in the code base that recognize automatically and will low the extension for yout so uh even if you're the the adapter is actually loading the extension you actually some most of the

30:02time actually doesn't doesn't need but again totally agree for uh specific extension like spatials and so on that's different uh we are talking here about extension that are um supported by ddb so there is like trusted extension and then there is like uh also a bunch of third party extension that's right that is yeah exactly that you can also

30:27install and so here you setting also uh the cral we had a discussion on this like about that I'm not going to go no Bring It come on let's make this contentious have a good time with it right yeah okay okay so but again I think it's important to know that uh mother uh dougb is revamping the home

30:48managing secrets so they have like now a secret objectum and because uh for people not familiar with ddb uh basically there is this weird thing today where if you want to connect to Google Cloud Storage you can but you also need to specify those value which is a feel a bit weird and so now there is kind of a secret object

31:09where you can define a hisory secret object and a Google Cloud secret object um and there is also the DDP um uh AWS

31:19extension and for that you have actually

31:24something to even simplify uh how you want to load the extension specifing spe sorry specifying profile AWS profile um rather than um the CT so you have different way but but but but but you also have that as part of the pl as a part of the Plugin or or an extension of the adapter tell me yes I mean I think you call it a

31:52few different things here man which I think are important um you know duct DB moves so quickly right yeah that's they just do they add new stuff all the time um you know I think my favorite stuff recently is the new attachments feature where you can attach an external MySQL or post database and query directly from MySQL

32:11um that is a thing that like I guess when I'm creating adapter functionality for duck DB I'm trying to basically a let's meet duck DB where they are right now and when I wrote all of this stuff back in the day this was the only way to specify yeah you know like secrets and that was how you did this sort of thing

32:29right um since that time they've added the credentials chain they've added the like via the aw extension they've added the uh yeah sorry the secrets functionality you were just talking about secret object yeah secret so I kind of I have when I build this stuff I have to be very like very flexible I guess in how I approach things because

32:49of they move so fast and because I do also try to support a large range of versions so the minimum supported version right now in in DBT de to is 0.7.0 which you know it's like equivalent to supporting like Oracle 4 or something like that in you know in duct DB years like that's ancient history in duct DB land but yeah I do

33:10have those big institutional people running their you know super important Financial calculations on Naas instances and I don't want to like go around forcing them to upgrade stuff that's you know critical for them if they don't if it's not absolutely essential that they do so um yeah so the con the consequence of this is is a sort of development philosophy which is

33:30like kind of do it however you like in whatever way works for you is is fine with me and I'm I'm not prescriptive and this is also like a personality flaw in some ways to me I'm generally like militantly anti-prp of about this stuff but to your point like this is not the best way to do this now you can do this

33:46all these things I'm doing in this here you can do this better now there's better tools for it yeah but I think it's also like you mentioned like I mean to your honor is that you've been you know following ddb pretty pretty fast right and it's super hard to predict and to update all those things like even me

34:03like there is literally people from the community doing PR on my uh I would say repo example on thatp here and there to fix Fe to say hey this is though and um I realized like it's it's actually you know when you have an early project which is which is zero do something right yeah when which is not officially

34:25GA things can break or things can change uh where you need can to adapt especially if you build a wrapper or things around it so I mean that's it it is hard and I think you did a pretty good really good job on like following that up so I I think it's nice to call that out too uh because like yeah it's

34:49not like the project of theb hasn't move in three years right so it's been doing some some some sweating for yeah I mean three weeks three I mean three months is an eternity like really and and just you know and part of it too again just like the the fact that the DBT releases are not align with the duck DB releases and

35:05stuff again you know these are just things I have to have to sort of it's fine exactly such as life um so let's dive into an example so we are doing a series of uh a hand to-end data project um and basically uh how does it look like it's um it's like this I'm going to share the the repo if people

35:28I I want to click on it make it bigger for me too so I can or can I make this make this bigger yeah I can uh I can I can make it bigger uh but I can I'll just I just share it yeah please um CLI on make it big too yes um here so

35:47basically uh how does it look this project is that we did the first part uh to extract data we using pure python um we start basically Source data to have PPI statistics so ppi is where python Library are hosted and so if you want to know uh you know the usage of a given uh python Library this is where you look at

36:08at the number of downloads which python version they are using which uh you know C uh CPU version if they are more on the arm or Windows whatsoever you have a lot of information actually uh really interesting and so what we did in the first part is basically just ingest data uh from the D DB python library library

36:29project and so we have this data now either on his Tre or error on mod duck I don't want to put some the dependency on modd itself and now basically there is the processing where you use DBT andc andb uh together um so that's the second part and the second part video should come later this week um and this is

36:53where we're going to dive basically give a preview on on this repo so Zoom that out yeah awesome um so we have

37:02um a transformation folder here and uh simply let's go back to um

37:12the profiles that we just looked before into uh you know the the rme right um so

37:19I basically simplify it because because as we were just discussing no need for a specific extension if I do read from S3 and write to S3 because again uh happening automatically behind the scene unless you require specific extensions um and then for mod duck uh basically the difference when you want to connect to ddb and the cloud some mod

37:45du service you just specify MD uh

37:52um and yes exactly and um the best practice is just to store your um mod duck token that you can access to the web UI uh as an environment viable so that's why it's not specify anywhere here it's specify in a in a d f that I'm not going to click out because otherwise you're going to see my secrets oh that's

38:14bad careful yes but I have a a template on the repo of um basically what kind of M you need to fill and and so what I like do do you use make file um do I personally use them yeah just no I just I'm just like basically a perfect software engineer and instantly remember all commands all the time and

38:38so I I have no need for make files I'm just you know Flawless and I'll make mistakes that kind of thing um I mean I to be honest I don't I don't I mean like I have used make files and I very much like make files and stuff especially when I'm working on a team but since I

38:50don't work on a team it's just me uh no I don't use them for my regular projects yeah okay so here we do we do uh have a make file because the project like the reples start to be pretty big as I said like it's like kind of a big modoro for end to end project where you have the

39:06inje the transformation plus the dashboard after and so we have a couple of endo this was for example uh to ingest the data and then to transform uh basically we do uh a DBT run into the

39:22the DBT folder project um and basically

39:26uh specify we can specify some uh start it end it yeah so we always process here because the the the data for PPI can be

39:38pretty big I mean on the source it's a once you get from the source and you ingest it it's pretty small actually uh but you know just for the sake of being being able to rerun the pipeline on a specific window um always specify the start date and the end date what is what is engineering if not running back fills

39:58over and over again forever like that's you mean a nutshell yeah that checks out exactly exactly so um yeah so that's that's about it and um and basically uh I mean for at least for the for the entry point so let's let's go into now uh the code we have just one model um

40:20so here we are specify specifying an external Source right yes so with dug DB um you can basically uh

40:32read uh from his tree and it's compute

40:36uh in the client right in the DB client and I think uh maybe like let me just uh

40:45get um oh my God a quick uh a quick graph to

40:52explain to people so we have uh basically DBT client which is running and typically usually you have uh a

41:12database out there somewhere yeah yeah somewhere exact far from you exactly far far away and and the thing is that here as we I I mentioned earlier this is only

41:25uh since in SQL right and so this is

41:30where uh the compute is happening yes yes yes so now with dug DB DBT uh

41:38basically dug DB is a inmemory yep so same process same process as the DBT client right right alongside of it so it is it is not anymore U you know in the cloud OFW server it's running directly in the DBT client in Python y um and so the computer is also happening there I feel like it's about a matter of time

41:59then I before I give like programmatic access to the DVT like compile graph of tasks to the code itself so you can like dynamically reprogram your dag while it runs using duct TB like just yeah you

42:15know it's absolutely and silly but it's kind of like you know anyway like let's make let's do some meta programming in here you know no no that's true uh but so basically for ex example here um here we are reading data from um from a S3

42:33and so what is happening actually is that it's loading um you don't actually need to load it to to ddb but ddb can read data directly uh from s so it's ddb which is doing the magic and reading data from ad3 which is in a different setup where the setup we we just talk uh B basically let me copy paste the old the old way

43:00the yeah the classic way we have um so

43:05this is where the static way if you want to read the data from a ss3 you always typically create an external table uh where your database in the cloud uh will point you right so you're gonna you're G to read uh like this and this is this is what is happening but here um basically we have um dgdb which is so the computer

43:31here is here and the computer here is here with uh with de so yeah so that's so just to if this

43:43somewhere this is probably like pretty handy I might steal these images for the kind of I can make them better

43:51OB make a good point I should proba explain this on the like duct DBT dctb website or something like I know your mental this is how this is every other DBC adapter Works we're a little different how yeah exactly so the the way I was explaining to it when I was at coal and people so you know the getting

44:08started from uh dbts using Doug DB yes uh the shelo and so I was like did you like people were not familiar with ddb and then I'm like did you do the the getting started guy they say yes and then I'm like did you have to connect to a a warehouse they say no so they say so

44:26how is it working and then like you know I start to I was like this is the magic this ISB which is running in the process so you know you can also go I think it's that that's why I wanted just to take five minutes to U display this because it's really different and and the thing here also here is that uh there's

44:47something interesting is that this is still when you run DBT if you don't run on your local it's still compute resource yes that is probably underused yes because it's just you know it's just running and sending SQL yep um

45:06so sometimes when you do do a small you know um I would say query on a small data set it's crazy how much resource you are using because you have a client here which is use just sending text and you have the database which is you know scale which is Computing resource um so this is also a bit related to what we

45:24discussed earlier development Loop Right Where You Realize like yeah you realize really appreciate like how good your laptop is really until exactly yeah it's it's hard to really viscerally because again otherwise it just feels like you're sitting around waiting for compute to happen someplace else little babysitter process doesn't really need anything um yeah yeah cool 100% And it's

45:46so so yeah so basically when you want to connect to external Source you just name external Source specify an history location it can also work with um local files csvs on NZ and pretty much

46:01anything you can think of and this is something you know people propose that I do basically so this is a feature of the DBT dect to be adapted this whole external location construct for sources and stuff like that it was was requested by users who wanted a good way to manage a big list of you know fairly similar

46:17file paths essentially in like S3 or keys or whatever in kind of a coherent way again coherence and Co it's coupling in cohesion all over all the way down right yeah that's that's that's we support that yeah yeah and so here I'm basically uh rating um just just the data and so on my model what do I do

46:38have my sound scary but bear with me so

46:43this is the main uh basically data set and what I just do is like here I'm splitting um kind of like just taking the minor version of python so because you know there is some patch and I want to regroup the data uh so that I don't have the patch because I want to see for example who is using python 3.9 and I

47:03don't care about 3.9.1 3.9.2 so totally um yeah so basically just getting the the minor version so this is the case when because sometimes the data is not there depending on the download um and then um and then basically yeah passing the start date as the download date as we discussed earlier and what I do it basically

47:26regroup uh all those fields as a daily download sum because I don't need to have you know single download absolutely Ro so this is an aggregation but it's aggregate on a pretty granular thing because there is the system the release the version uh but yeah that's that's still fine in size of data is pretty small yeah and then I'm basically

47:49adding a load ID do you have any other strategy that you use usually to kind of track what what you you load no I don't I want to call out that I I added the md5 function to to duct B sorry that's

48:02another one of mine sorry can I brag about that okay cool I it was exactly I needed it for a DBT dub thing so I went and they already had an md5 library in there it wasn't like a big deal but I went that's that's one of mine too anyway sorry I can just we just bragged Josh just tore the

48:17internet all but was wasn't on purpose I swear no

48:24I mean well it's a useful fun it's yeah definitely something I'm often using you know abely um um techniques to basically have

48:36a unique ID that represent your row totally and so that you can know what basically data you've been loading and you can use the incremental feature uh you know from uh DBT is there better I mean I'm trying to like crit here is there a better way using like the columns construct like Hamilton rmer is turned me on to this whole columns

48:58construct right because my thinking is here is like you now again have a coupling here between like the ex the exact columns you're grouping by and this md5 hash so you're kind of duplicating something right you know what I mean like I'm wondering I'm wondering if there's a way if we could like stagger that so that you wouldn't

49:12have to keep those things you know sync together they would just be by by definition sync together but again yeah nothing talking again like the power of Ducky B really like this whole columns construct is incredibly powerful for bi developers UI developers and like The programmatic Meta access to your query while you're running it is just really

49:31cool anyway yeah yeah that the 3D call out good call out I wonder if you have some fun with that it' be a really nice pattern for people anyway because this problem happens a lot yeah anyway indeed yeah uh yeah I'll I'll pink Hamilton uh Pink Hamilton asking if we can do this better I bet he has a way he's he's really good

49:52at this stuff yeah so um so yeah yeah basically this is my um data set so at the end I have basically daily download some uh with some clean up on like the python and just keep the field that's the most interesting for the rest it's not really important and I just convert the time St into a date because I want

50:13to have a daily view absolutely it's right yeah that's that's that's way enough it's often like that like in data engineering you often make compromise on you know aggregation uh because like that there is no point of like having such a granularity and if you do you always still have the row data um so that's not like so much irretrievably

50:37lost forever right it's like go get it and if you don't then you don't and it alls well absolutely yeah um so now it start to be uh the thing interesting so I have a macro um so first I think the

50:52the the place to to look actually is the DV project okay so two things is that um

51:00basically we saw that there is let me let me put the um profile here so we saw that

51:10either we run uh dgdb and when we run Doug DB um without modern duck dependency we just read from A3 um and

51:19uh right to a three okay so that's kind of if I have it again here it was the

51:26option to say Okay I want to transform the data which is on S3 and write it back on S3 which is kind of classic use case you can do it with local file but I think if you do it with a you can do easily with loc f um and so for mod deck

51:45um basically what we're going to do here is still uh load the source from S3 uh

51:51Because unless you're active Moder deck user and have already your data there sometimes of your data which is on this tree yeah and we're going to load it to um to mod duck so there is a couple of things is that incremental uh DBT incremental feature is not supported for external model right it's not I know it's it is it is

52:14not something I can easily do so this external external rights external ex like so DBT feature most DBT adapters again they write tables and they write views that's what they know how to create do is special we know how to create literal files in S3 paret files CSV files file systems S3 and stuff and so we need a way to handle that and we

52:35do that with the external materialization type yeah totally but consequence it can't do the incremental stuff at least not yet I hav't figured out a good way you have a good way to do it that would actually make it amazing I mean it's a work around it's there I think there is pattern that might not support it but basically what we do here

52:50is that we do a full table Yeah but because we write in a partition way and this is what is calling a post hook you see with a specific macro which is export partition data yep and so what I do here is that let me close this one is uh simply uh basically use the copy command from dgdb which is really magic

53:15really and here are just um basically override per partition yeah so my data would look like just to to be clear I'm gonna have like S3 SL my bucket good job

53:31good job yeah so here yeah yeah so basically

53:36the partition will look like this um and basically I always rewrite uh an existing folder U so overwrite so it's kind of like a pending um I think there is a corner case if for example you have partition that makes a dat with like um you know specific Dimension um and if for example because I don't think uh Doug DB is deleting

54:06behind the SC is just assuming clear path gotcha okay so you you could you could still have like orphans um orphans partition that wasn't

54:18existing but in our case like there is always um you know specific data on on a

54:26mpace um so that that shouldn't be a problem um and so the other way around so coming back to uh the DBT

54:38project We have basically uh if we use a mod duck which is go count of the prod um and we read from his three and then there we can use basically the incremental features in mod deck yes yeah using the unique ID key load ID right um so basically that's a work around where if you want to use this

55:05tree incremental feure is not supported but you can still work per time window and overwrite specific partition and if you use mod deck you can use the incremental feature and other things to call out is the credential so that's the discussion we had before so here I'm also using um basically a specific a specific function from ddb I don't need to install any

55:31extension it's part of an extension but I ddb will install and load it automatically and it will load my ads credential which is you know typically at aw. credential right so this is what it is looking uh with the default value so if you're using that um yeah so that's that's about it so how does it look like yeah um let's run it you

55:59basically um do this so if I do for example Dev so here it will uh basically

56:07load from Edis Tre and then write back

56:11to uh to the partition so that's it so you see it's a table model it's not an incremental model but it's override on a specific partition which would be uh you know year uh 20 23 and month uh April

56:29yep and for mother ducks that's just basically that and here it will just be uh an incremental uh load so we'll check if there is data already there and just

56:43uh put it so here it's the same time window I took and you see it's Rel fast well yeah bad not bad I'm okay with that I can tolerate that that's about as long as I can to yeah yeah I think span yeah so here basically we have IO right because we download data from hisory and then write

57:04it back yeah so it's still really really fast if you count it uh to process uh basically this data set and you're doing this are you running this like a container on your laptop or where you running this yeah I'm running in a container and Dev container on the laptop on your local machine so it's not like you're running you're running this

57:22in like an ec2 instance you know it would run that much faster because it's like right there yeah yeah yeah yeah so um so so so yeah one thing I want to talk it's already time do you have a five minutes got a few minutes my wife may like walk in you know randomly but like it's you know it's fine any cares

57:39how do you do how do you do unit test today how do I do unit tests I mean I use I use P tests and stuff like that I mean I think this is this is the new DBT feature they very excited about for 1.8 but yes yeah okay so I I think we're on the same page I'm waiting um I'm sure we

57:56are for what for that for the features everybody's some time yes exactly in the in the meantime you have a nice package which is this package yes yeah it's pretty popular it is pretty popular it seems like a pretty good package then that's awesome yeah so uh for any testing and um and basically what you do here is that you uh mock the data so I'm

58:24mock the s so I'm not depending on S3 I'm not using SQL yeah um so that's

58:31the the package which is doing I'm not super fan of that but I like to use just a simple CSV but that's also okay because here I have a complex Tru so it's also good to highlight that it's not that hard to create uh complex see more more it's funny mention because I see more and more folks doing

58:49like like there was a good like post about some some go developers like synthesizing Json on payloads in duct EB queries and returning them as end points and I'm like oh what a great idea like that's just so clever I love that yeah exactly so so here basically I'm just uh specifying two rows right that's my source data this isn't like this isn't

59:09like that much worse than Jon you know what I mean like I'm thinking about CSV it's denser than that yeah but I can clearly see what's going on I mean this this isn't bad I don't know I kind of like this yeah yeah and then um and then after we call the cat right I can hear from me yeah she's very very opinionated

59:27and stuff and like so uh the we call the DBT testing

59:34expect and basically this is the data frame that I'm expecting and I'm just checking that the data the you know it's a date which is this date right U April uh 2nd DB project and we have two as a daily download sum because we have two rows so they should be two load and to run those stats I'm just doing um I

59:59think it's PPI transform and test so I have an endpoint and I'm just running DB test but the big magic why I wanted to spend five minutes is that this is real unit test there is no internet connection there is no dependency yes to uh you know a third party uh service it did not cost you you ran that for free

60:22it didn't cost any money beyond the development time yes and so you see here that that is super fast we run the model in Doug DB um

60:33against two rows right but so compared to you know what we run initially with like the iio of S3 and so on or mod deck yeah um so there is a network thing but here there is that's where D D shy so just to show you for example you can really uh play around with the model this is how you basically can you know

60:57develop an adapt thing and see okay what is happening if I'm doing something here so here I just changed the test to say Tre so that would uh basically the test

61:08would fail yep and as you can see it tell me exactly what it would expect this is a this is great it's really good yeah that's awesome on a on a specific on specific column so you say no this is the data set I've been generated with the model that you created y um and this is what I'm expecting so yeah so to

61:31close the note basically on um kind of like where um D DB uh shine let me bring

61:39back GNA fix my Square oh that I'm fixed wow okay exactly thank you sorry I had today but yeah just to to hand on the note that um I think um you can include dgdb today in your you know workflow just for improving your developer experience and testing I think if you develop like complex um against a complex data search

62:03which is nested you don't want to do multiple you know trial and error in dban absolutely it's Insanity yeah precisely yeah exactly that's great man so what what are what are you looking forward um so we have actually a question here um yeah I saw that I saw someone asking about duck DBT which I talked about duckon last year and like

62:24that um so I honestly I feel I feel bad but I've not done much with it unfortunately um so this is my my Blen a Vista library and I think I have run into the same problem that most people who you know idiotically try to take on the postgress monster have run into which which is the PG catalog the PG catalog is just like

62:44an absolute killer and it's a killer for ducky B like ducky B is not even done implementing the full PG catalog and they're a lot further along than than I was and stuff right um so I'm kind of torn about it I'm kind of torn about it I think it's like I still am empathetic to people who

63:02are in this situation where like they want to be kind of going iterating quickly back and forth between DBT stuff and then looking at things in like a bi tool or something so like they want to hook up you know de Beaver or or like one of these kind of things to work very quickly against their data that's that's

63:16still a problem and it's still an issue for like the way duck DB works and stuff like that I am kind of noodling a bit on since really since DBT itself has become much more of a python Library than it was before um whereas it really very much a CLI tool which you really couldn't run you know as a as a

63:36as a library since it is more Library like now I sort of wonder if I should pivot the blend a Vista idea to creating like a streamlit app or something like that for let you letting you interactively do your DBT query your stff against a single python like a python process that's running the stuff you know what I mean like that

63:58kind of thing that feels closer to right to me given the way the technology has evolved and stuff like that and like how do we again keeping with the them of how do we make these developer Cycles as absolutely fast as possible without requiring an internet connection until you need it like that then I kind of think because then I mean I mean you saw

64:16when you're running the DBT in the container there's still that stupid startup overhead right like you still have like that time to load the graph and like this is this is all wasted time that could be done better if it was just running a process that has DBT duck DB you know a nice HTTP API for you to

64:31query stuff or or like look at things to your hearts content I kind of am thinking that's the way to go um I haven't actually like started working on that in any way I have you know lots of other things keeping me busy in retirement I'm researching you know World War II and stuff I don't have a

64:47good answer here I'm great no no but that is but I I think um I think you mentioned that there is interesting things to play around in the Futures I mean you know Doug DB as Doug DB was I know DBT is relying on python right but yeah who know I think I think I think that I think there

65:06is there is something to do also for non-technical people like pure SQL user yes uh you know in the browser processing things in the browser with minimum I mean I completely it's it's it's just such an unlock to be able to have that direct access in the browser I mean again it's kind of it's like theme I was saying of like run it locally run

65:26it in your browser run it on a Raspberry Pi run it in a GitHub action run it in the in a million computers on the cloud run it in Lambda run it wherever the hell you want like that's the point um and what yeah and just we've never had that before like ever and I we're still kind of wrapping our heads around the

65:43possibilities from it me me me and just you know me and everyone else yeah no that's true but uh we are glad to to have you in the community and and thank you for for those contributions and and just the work you've been still putting out there as a passionate because yeah you're just you're just having fun it's

66:01it only really doing things that I find fun I guess is the is the really critical aspect of it yes yeah yeah doing covid work that like I really I'm happy to write software for people for free it's just like going to meetings updating Jura tickets these are things you actually have to pay me an enormous amount of money to do yeah

66:20software card is free I'm happy to do that that's no problem cool so so uh yeah CH just again I'm putting again the uh the project but please have a look at the DBT ddb adapter uh project uh there is also the ga preo U that we you know played around with Pipi data set that you can watch there is a good example on

66:41how to read from his three connect to mod duck um and also an unique Test example to basically example is great that's a really good that's really nice so yeah so yeah play that round uh give us uh some feedback and see you uh to the next quack and cod so each uh every other week because I don't have time for

67:03every week but I'm happy to have you every week if you want I again I'm back a lot of there's a lot of World War II research to be done here man I got gota get back course um but uh I would be happy to have you again anyway further down the line to talked about like you

67:18know a couple of years data years from that it means like in three months right three months exactly precis but anyway quack and code every other week you can check theenda on mod events and our social and see you around and thank you for joining cool thanks so much man cool cool cool

Related Videos

" The MCP Sessions Vol. 1: Sports Analytics" video thumbnail

2026-01-13

The MCP Sessions Vol. 1: Sports Analytics

Watch us dive into NFL playoff odds and PGA Tour stats using using MotherDuck's MCP server with Claude. See how to analyze data, build visualizations, and iterate on insights in real-time using natural language queries and DuckDB.

AI, ML and LLMs

SQL

MotherDuck Features

Tutorial

BI & Visualization

Ecosystem

"Watch Me Deploy a DuckLake to Production with MotherDuck!" video thumbnail

2025-12-10

Watch Me Deploy a DuckLake to Production with MotherDuck!

In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!

YouTube

Data Pipelines

Tutorial

MotherDuck Features

SQL

Ecosystem

"The Unbearable Bigness of Small Data" video thumbnail

2025-11-05

The Unbearable Bigness of Small Data

MotherDuck CEO Jordan Tigani shares why we built our data warehouse for small data first, not big data. Learn about designing for the bottom left quadrant, hypertenancy, and why scale doesn't define importance.

Talk

MotherDuck Features

Ecosystem

SQL

BI & Visualization

AI, ML and LLMs