DuckDB & dbt | End-To-End Data Engineering Project (2/3)
2024/03/01TL;DR: A hands-on tutorial building a dbt + DuckDB data pipeline from scratch — reading PyPI download data from S3, transforming it with SQL, writing partitioned Parquet back to S3, syncing to MotherDuck, and writing real unit tests with zero cloud dependencies.
Setting Up dbt with DuckDB
The project starts from a blank slate using dbt init with the dbt-duckdb adapter:
- Install with
poetry add dbt-duckdb[md](themdoption includes MotherDuck compatibility) - The dbt profile defines a local DuckDB path for dev and a MotherDuck connection string (
md:) for prod - Extensions like
httpfsandparquetare bundled automatically — no manual installation needed
The key architectural difference from other dbt adapters: DuckDB runs in the same Python process as dbt. There's no remote server — all compute happens locally on your machine.
Exploring Data with the DuckDB CLI
Before writing dbt models, the tutorial demonstrates interactive data exploration:
- Use
DESCRIBEto inspect the schema of Parquet files on S3, including nested structs - Query S3 data directly from the DuckDB CLI — no import step required
- A VS Code trick: bind a key (Cmd+K) to send selected SQL from the editor to the DuckDB terminal for rapid iteration
The source data contains PyPI download statistics with nested struct columns for installer details, Python version, and system information.
Building the Transformation Model
The dbt model performs several transformations on the raw PyPI data:
- Convert timestamps to dates for daily aggregation
- Extract Python minor version — strip patch versions (3.9.1 → 3.9) using a CASE WHEN expression
- Aggregate to daily download sums grouped by date, system, distribution, version, and country
- Generate a load ID using
md5hashing of all column values for deduplication - Use
GROUP BY ALL— a DuckDB feature that automatically groups by all non-aggregated columns
The source S3 path is defined in a sources.yml file using dbt's env_var function, keeping all configuration external to the model.
Writing Partitioned Parquet to S3
A custom dbt macro uses DuckDB's COPY command to export data as partitioned Parquet files:
- Partitions by year and month (e.g.,
s3://bucket/year=2023/month=4/) - Uses
OVERWRITE_OR_IGNOREto replace existing partitions without appending duplicates - AWS credentials are loaded via a dbt pre-hook using DuckDB's
load_aws_credentials()function - The macro is triggered as a post-hook in
dbt_project.yml, automatically running after each model build
Conditional Hooks for Dev vs Prod
The dbt_project.yml uses target-based conditions to handle different environments:
- Dev target (local DuckDB): Materializes as a table, loads AWS credentials in a pre-hook, exports to S3 in a post-hook
- Prod target (MotherDuck): Materializes as incremental (using the load ID as unique key), skips credential loading and S3 export since data goes directly to MotherDuck
This means the same dbt project works for both local development and cloud deployment with no code changes — just a different --target flag.
Unit Testing Without Cloud Dependencies
The tutorial uses the dbt-unit-testing package to write true unit tests:
- Mock source data with SQL — define test fixtures inline, including complex nested structs using DuckDB's
struct_pack - Assert expected output — specify the exact rows and column values the model should produce
- Run instantly — tests execute locally in DuckDB with no network calls, no cloud credentials, and no cost
When a test fails, the output shows a clear diff between expected and actual results at the column level, making debugging straightforward. Tests run in milliseconds and can be included in CI pipelines at pull request time.
Syncing to MotherDuck
Deploying to production requires only a MotherDuck token:
- Set
MOTHERDUCK_TOKENas an environment variable - Run
dbt run --target prod— dbt connects to MotherDuck and uses incremental loading - AWS credentials for reading S3 are stored in MotherDuck's secrets manager
- Data lands directly in a MotherDuck database, ready for BI tools and dashboards in part 3 of the series
Transcript
0:00tvt is a great and easy tool to develop production ready data pipelines using SQL it's a toolbox that helps you package test and deploy your SQL pipeline but the thing that is sometimes misunderstood is that DBT does not process the data at the very end it's just a client that connects to typically a cloud data warehouse where the actual
0:22Computing is happening so you always rely on this Cloud dependency and the development Loop then can be painful in this video we will process data using Doug DB and DBT we'll go beyond the yellow word through practical use cases analyzing python libraries downloads We will understand how ddb and DBT can simplify your architecture speed up some pipelines and enable to finally write
0:47the real unit tests without any Cloud dependency improving the overall developer experience we also see some best practices regarding authentication to ITB ss3 and handling increment Al loads all the source code is available on get up but don't check up the solution yet so if you are just getting started on your DBT Journey or you want to learn more about dgdb or you want to
1:11know how many caps I have I'll answer all these questions in this video except the later one or maybe you just have to count and by the way this video is part of a series of a endtoend data engineering project so if you want to check the part one about ingestion of the data getting your own data set like
1:29you're big boy check out the link in the description it's the same guy uh same sweater just a different cap anyway and off cap let's get started so quick recap of the part one of the series we ingested data from PBI to get download statistics of a given python Library here it was tug DB library and now we
1:50are going to transform this data to get all the metrics ready we need for the part three which will be building the dashboard and this part will be more fancy than looking at the data set but that's for the next video and for the sake of this tutorial if you haven't done the part one of the project and
2:07you're too Ley to do it now don't worries I got you back we have some simple raw data that you can use as input for the transformation Pipeline and it's available on a public it S3 bucket so let's talk a bit about the DBT and dark DB integration we saw in the intro that DBT is a python framework
2:25that generates SQL and send it over to a cloud data warehouse for Computing and typically you have your yel file profiles configure with the settings of your cloud data rarehouse and you can quickly switch between different environments the magic here is that for our use skills we'll use Eder dub to read and write from ITB ss3 and all the
2:48processing will happen locally directly within the same python process so yeah I can AE in my intro because I said that DBT does not process data where here you could say yeah the DBT process is processing data but it's actually theb behind the hood which is doing all the computes and the second option for our transformation pipeline will be to use
3:11mod deck and because mod deck is duck be in the cloud you have a seamless transition from working locally to skill it to the cloud Plus for theart will need a SQL engine because typically bi tools rely on those to fetch the data and feed the dashboard so mother duck will be us useful there so let's get
3:31into the code so here we go the first thing you want to do is to GE CL the project which is in the description you're going to see more files because many from the future already code the solution so you can also decide to delete the transformation folder so you should have a transformation folder at the root here just scratch that one and
3:52start from scratch don't Che for the setup I'm using vs codes and there is a Dev container definition so if you have Docker install you can just just reopen in a container and you're ready to go you have the python and bryman set up otherwise you can check out the read me uh with the information Also regarding
4:10the setup it's basically Python 3.11 and we also use poetry for python package dependency so let's start first by creating a transform folder it's a kind of a monor repo setup we have the ingestion folder which was the other uh part of the sir to inchest the data and here we're basically going to do our DBT project within uh the transform folder
4:33so first thing we want to do is install DBT and the TB adapter and you can install it actually that in a single line so if you do poetry at DBT
4:47tdb and the MD option stands for modu because modu only support at this time of this video only one dgdb version so it will install the dgdb version which is compatible with Mod de all right now that it's done we can go to our transform folder and do a DBT in it so this will guide us to
5:08kickart our DBT project so we just basically has to uh specify the name of the project um and then it ask which kind of database we like to connect and because we install to DBT uh Doug DB packages
5:25you see we have already the option of Doug DB so we can just but this one and it will uh set up a configuration for us already all right so now if we go to uh our folder we see that we have a couple of folder which has been created with have some example of uh of models uh
5:45it's just basically template to get you started we have a DBT project file we'll come back to that and usually have a profiles so you see that it's creating here in The Roots which is within Docker cont container we'll use the profile directly in the repo for Simplicity so that we can version and you can adapt uh
6:06further after if you need it so I just copy paste the profile that was generating into my home directory directly in the repo uh the root of the DBT project REO and as you can see you have basically uh here the type which is Doug DP we have a path uh which is pointing to uh Doug DB database if you
6:29want persist it and you can have also other option you can actually uh check out the readme of the official DBT techb project to have more information on what you can do within that profile but don't worry we'll cover the basics and what we need actually for our use case to read the data from hisory and to use mod deck
6:49so with DBT we write SQL right and usually it involve a step of exploration where you use a SQL client or you go to the online UI over your cloud data warehouse you you do a couple of queries and when you're happy with it you go right in within DBT so let's create a simple SQL file I'm going to call it uh
7:09PPI da stats because I know that the row data which is um one roll per each download of the Python Library contains a lot of details and actually I'm probably going to do some aggregation per day because I don't need to have this level of granularity and in dat engering it's often a compromise between how much granularity you want at a cost
7:32of you know a size of a data and a complex data set now that we have our SQL fire ready we can actually um start to play with uh the duct DB CLI so I have the DU DB CLI installed you can watch again also Link in the description how to install a dub CLI if you're on
7:51Mac OS like me you can use umbre and just install duck DB like this and what I like to do is basically I have the duck DB uh C C running in the terminal below and just on top in vs code basically I'm going to send SQL query and basically this is a simple trick where I basically assign a given key to
8:09a specific vs code command and the key here is uh command k for me you can pick anything and this is the command to basically run the selected text into uh the terminal the other thing you want to grab uh on the repository actually it's on the readme you'll find an URL with which contain uh basically the
8:32sample of uh data for dug DB library for a given period it's uh April 2023 so just grab this S3
8:42URL from the actual Remy so now we can start to inspect this so for example if you do a describe table so I just did a command K and as you can see let me zoom out a bit so as you can see we can see basically uh what is the schema of our source data and here basically we have a
9:03complex trict let me change the line mode so that we can see basically uh the full stru definition and you see details here
9:16contain uh installer and then it's uh struct with a name version then the python version and other things cool um you can also get just a sample of the data let me go back
9:36to the tck boox mode so if I do just a
9:40from I don't need to do a select star whatsoever is going to give me um some sample of data set so we see we have the country code RL year month Etc because
9:53it's partitioned by actually year and month that's also another interesting thing is that the data is partition you know U by year and months and we can actually select all the party file here I select everything because it's just a sample of data of course if you want to inspect some sample of data you can do some partition printing within the bath
10:14directly all right so let's start to construct our query let me uh just put
10:21back the EMA so we probably want the
10:26time stamp but as we mostly interesting on how things are evolving within the days we can convert this one to a date and let's call it download date so here is the fields
10:44I'm mostly interesting in in the data set one other thing I realize is that the python version contain minor patch which might be problematic if I want to know who is using python 3.9 let's say I don't really care about 3.9.1 or 3.9.2 so I'm going to do a simple case when there uh to basically uh convert and
11:06just keep the major and minor patch for
11:10python version all right so now I have my base query with all the necessary Fields I'm interested to so let's run it
11:20again and see what the results and as we can see we parse actually correctly um the python version so there is no a patch version it's only minor and all the field that we are interested um you know the version of the system uh distribution and the country CPU we have
11:42those fields so you see it's it's pretty straightforward to query actually a complex type you just use the type annotation um to navigate to those so now what I want to do is basically aggregate those fields per day to have a total sum of of download per day because I don't need that much granularity in my data and I also want to have a load ID
12:06which going to identify in a unique way uh my role and for that I'm going to just use a simple common technique in data enging which is using a hash of all the colon field value all right now we all right so what I've done is basically create subquery where we have our initial query right and then I'm uh
12:30creating my ash based on the colum value here and I'm creating uh my counts as daily download sum and I'm grouping by all that's beautiful feature from dgb when you want to group by all the columns you don't need to specify each of the column you just specify group all so let's run it again just to confirm
12:51that this one is
12:55working and as you can see we have the daily download some regroup by all those fields so here we have Windows and the correct python version um with only the minor version that we keep and the load ID which is generated now that we have our base model working the only thing that we need to do is templating so
13:17we're going to have some filter condition removing hardcoded value here like the source and we can also add uh some unit test to double check that the model is working as expected so to remove the art coded Source we're going to create a source. yaml file here and in this file we're going to basically specify that it's an external Source
13:38because it's sitting on abss 3 and we specify the external location and for that we're going to use the unar features from DBT where we just load an environment variable and it's going to pick up from there and finally the table's name is the alas that we're going to use in your in our model so the way that we reference that in our model
14:00is simple we use the Ginga template specify source external and BPI downloads which is the name so if I bring the source. yl 5 which is uh sitting into our models directory right uh we see here that basically I'm picking exal source and this is the name whatever you can choose uh of the table and so this was the location for the
14:26location itself we use already Ault M file I'm going to come back to that uh so we're going to put that as an environment valuable now the next thing I want to do is basically add some filtering so what I always want is basically to run my pipelines against a Time window this is really useful for writing specific um window here that
14:51would be our partition the download date and also for back filling so we give a start dat and a hand date whenever we run the pipeline and that's so useful for incremental load because we don't need to read the full data set where we only want to process the new data so let's go on our make file
15:11now and add an entry
15:16point to run uh this pipeline so how does it look like we go into the DBT folder right and then uh just proceed uh a DBT run we specify the target so the target is basically watch can envirment here for us it would be do we want to use it local dub or do we want to use uh mod
15:39duck in the cloud so one is Dev and one is prod you can name it uh differently and we're going to pass some uh bars as I said so start date and end date and finally because we add already this in our make file which is including a m and loading it we can just add to our n file
16:00the value of the S3 Source bucket so you
16:04have a n template here that you can just copy paste and actually most of this you don't need and let's catch again our M which
16:16was in the source so we pick
16:22up this m and we're going to pass again
16:28the the S3 paath so now the beauty of it is that I can go uh run so let's take
16:36again my make file for reference I can run make bybi uh transform then uh DBT
16:45Target which is Dev and then then start dat and end it one important information is that the simple of data only contains data from the 1st April to uh the 7th
16:58from 2023 so if you pick other date outside this range you won't load any data at least if you rely on the simple of the public backet that we offer in this exercise but before we run this command there is one thing we need to do we're going to head over our DBT project yaml file and we're going to marize uh
17:20our model as a table so it's a full load so how does that work we basically specify the name or a model which is by da stat and we specify The Miz uh type which is here a table so now we can run it and see if it's processing
17:41correctly and we see that we have a def dug DB database has been created and if I go and launch the dug DB CLI and open this database so I just need to specify the path of the dug DB um database and so I can see all the tables so this is all my models and so for example if I
18:05take this one I should have here some data so that's great so we just load data from S3 using a DBT model and we
18:14pass a specific variable which is the time window and also uh the S3 bucket
18:22which is given to uh environment valuable now let's add some unit test to double check check that this data is actually the one that is expected to do so I'm going to use a package called DBT unit testing so DBT has a package mechanism where you can install like plugins and not and note that at this point of this video DBT is also
18:44revamping the unit testing meod as part of the 1.8 but it's not released yet so I'm not going to address this in this video but if you are from the future check out our YouTube channel there might be a dedicate video about that any anyway what we need to do to install a package is to have a package. yaml file
19:03where specify this path so at the root of my DBT project I'm going to add the packages.
19:12yaml and now I can run DBT depths which is going to install this specific package all right it's done so how to use uh the unit testing feature first what we need to do is replacing this source with something else so it looks pretty familiar and why we do this is to be able to mock The Source data so by
19:37passing this the framework knows that we can actually mock this object with a given data set and how this package is working is that we Define the mock with SQL and we assess the result with SQL so I create a file in the test folder which is called test and then the name of the model that I want want to test and I've
19:59done a couple of things so this is basically my test I can have as many as I want and this is where I'm mocking the source so you remember the abstraction in the model this thing this is how we going to be able now to mock this specific source and so as I said I mock using U
20:21basically SQL so this is my source data if you remember when we were exploring the data set we have two uh different struct file colon which is not really important and then the details colon and here you can see that I can Define uh a complex struct uh pretty easily with Doug DB using struct pack so here we
20:42have this struct which is use which is nested by dist trct this is basically one rle and I'm just adding another role using Union all so you do a copy paste
20:56and can you can change you know know a couple of value in my case here I'm changing you know the version which is here Python 3.8.2 and uh 3.8.1 and because we are removing the patch we should have expect as the result 3.8 right it's here we have we
21:16aggregate by the date which is um the
21:20second of April and we should have two as a download uh daily download sum so this is basically my expected data so to
21:31recap we mock The Source data by calling
21:35this we declare it using SQL and we then
21:39declare the expected data against using SQL so let's try running this so the way that we run test is just by using DBT tests and let's use again our make file and do a copy paste of this one I mean almost copy paste and we're going to add tests and here instead of doing a DBT run we do a DBT test and see if it
22:10works so we have a couple of things pass an error which is weird but the reason is that we didn't remove any of the example models that has validation within the schema yamala so here you see we have a couple of test that we can Define in this uh uh file which is like test on unique not new so let's just
22:34remove all those things in the model because we actually don't need it so I just deleted uh the example and the schema and I recreated one in the roots of models and just specify the model names here so let's run it
22:51again and as you can see here is require of course our start date because that's uh a requirement for the sake of Simplicity I'm just going to pass them to my DBT test command and be careful that this one is depending on the fixture data right so because our fixture in our test data is actually you know between that range of data and so
23:15my model will filter based on the Range
23:20and would hopefully result uh the test successfully so let's try again
23:32so now it passed um so let's change just some value to uh to check so for example if I do if I change .1 I would have one row which is on 3.9 and another which is
23:493.8 so you see now it said it's failing because the expected data doesn't match and as you can see this was uh what do you receive from using the model and the mark data one B of 3.9 and one B of 3.8 with one and one and I actually pass it this is what is expected so we have a really clear uh
24:12feedback clue on your unit testing and you can really quickly um build complex models and everything here is happening locally I'm using dgdb and it's really super fast so no Cloud dependency and that's how we should do unit test is without any thirdparty uh cloud service dependency and because they run locally they canel run on your CI at poll
24:37request or whatever have you in your process so now we have our model ready we have our test ready but the data is only staying locally in the dev darkb and I would like to export it to est3 or sync it to mother duck so let's going back to our profiles so here we can leave it the dev as it is
25:02but to connect to mod deck which will be the the pr instead of passing uh a file pass we just pass a connection string which is MD colored and let's take a moment to visualize what happening uh using the death Target and the pro Target so when using the dev Target basically your DBT client contains dug DB in the computer
25:28is happening locally that means that you're reading data from his3 and then potentially after writing also uh to hisory everything the compute is happening locally when you're using mod duck you act like a client where sending SQL and mother Doug is going to read from there your S3 bucket and comput on the cloud side which need that you will
25:52need to provide to modu an adbs credential that can have the access needed to either read or write to a street so coming back to our profiles basically that means that for the requirement to sync to Moder do you need a modu account you can get it for free on modoc.com and once you create your account you can add your a secret right
26:12in the UI or you can also use the CLI command connect to M du and create the secrets and as you hear you can also grab your token here that will be used to connect to mother duck from the DBT client if you check the M template We have basically the mod du token which is the expected name of the variable and
26:33you just pass your mod du token all right to S to modu to recap we need a modd token and that basically it to run this tutorial because the S3 bucket is public so you actually don't need to pass any lbs credential but if you have a private bucket with our own data that you use in the part one of the tutorial
26:54ingestion then you will need a specifics credential itial to read this S3 bucket that you own and for running ddb where it's processed locally kind of the same story to read the data bucket is public for this tutorial but also you need here we want to write to S3 also that was the goal of this tutorial so let's fix that
27:18so what I just did is basically create a macros so export partition data and in this micro I basically use the copy command to uh from Duck DB so here you see that I'm using the copy and I'm selecting basically my table and I also set an environment pth for the S3 path you see here again no hardcoded S3
27:42bucket for the output data and I'm writing the data as partition per year and months so it's I partitioning so the data would look like uh basically something like this my bucket slash year
27:59to3 and then month uh equal sorry April
28:05because we have data from April 2023 and also important thing is that we overwrite always a specific partition so if you rerun uh basically iess same data
28:18set would not append the data we would just over write the existing uh data set not that we use S3 here as part of this macro but you can easily adapt this macro to write locally as CSV or as
28:33spark so now the question is like how do we actually trigger this macro and that is happening in the dbd project. so we going to add a post hook which is calling the macro this do
28:48name refer to the current uh model so it
28:52will basically pass the pipi daily stat which going to be used here as a table name which is the basically the table name the model equal the table name so we don't need to hard code it again here it's a DBT variable and here I'm just passing because that's part of the macro uh what is the date column that is going
29:15to be used for partitioning if you remember and here I'm creating a year column and a month column based on this uh date column to be able to partion the data as we saw earlier so table we always reprocess the full table but because we only process the data based on a Time window it's kind of a
29:37workaround strategy to support incremental load or append uh to your S3 bucket because incremental do is not supported for external model because in that case we are talking still here about um our Dev where we compute locally and we export to A3 those par meters at the moment are applied for both when we run dub locally or when we
30:05run to the cloud one thing we forget also is because we run uh the copy command against NRI bucket we need to load the Eds credential for ddb to have the authorization to write to this bucket and for that we're just going to use a preo and for that we don't really need to uh create a specific micro we're
30:27just going to call directly call loads credential this is a dark DB command which is using the Eds extension and so when you actually uh do this it's going to load where your credential is typically on you know adubs uh folder which is in your home directory and by default it's just load the default profile you can specify uh a specific
30:52ABS profile that you want uh to load the credential so here we have the pray which is loading the credential that will enable us to authorize the writing and then we have a POS hook with the macro which is using the copy command to copy the current model uh which is in dev. DB to um S3 all right now if we
31:14rerun everything basically we're going to have the same results but the data afterwards is going to be exported to A3 so I have a small error is simple I haven't B the transform S3 uh bucket so
31:30I'm going to go to mym and provide this field we reun it and it's working by the way I had a small syntax issue as you can see here I need to use the Ginga template to tell uh DBT that I'm actually uh calling a macro which is difference here where I'm just basically uh calling directly a SQL command to DB
31:54now the last thing challenging is that uh we don't need to have those things when we target to Pro why is that so
32:05because when we target to prod we target to mod deck if you remember and we don't want to copy the data to S3 we want to actually load it directly to mod deck as I explaining in the intro because that's going to be useful when we build our dashboard and we connect to mod duck so you can put actually condition here that
32:25based on the target you do certain pre and certain post UK action there we go let's inspect what I just did first regarding marized I explained that the incremental feature from DBT to load only uh the new data is not supported for external model but for mod duck for internal model that would work so basically I specify that the mariz
32:49should be incremental when we connect to prod which is M duck and else we can use the table and all override the partition onest stre and specifying the unique ID which is needed for the incremental features and finally regarding pruk I'm just saying that if you work locally using the dev Target we need to load the ads credential to be able to write to3
33:15but if you're working with Mod deck that's not needed because the because the secret is stored on mod de side finally for the Post hook it's kind of the same story I don't need to export um if it's on prod all right let's run against um mod do this time so I'm using the same make Command right specify
33:35start date and the end date and then basically the prod uh Target which is mother duck I have my modu token specified in an environment variable and so that's basically it because the AWS credential are stored already on mod do site and the data source is public right it's DS3 bucket we saw earlier so let's just run uh this one
34:03and you can see here it's going to be an incremental load so it's going to only uh load data uh that is new and now we
34:11can actually check if we go basically using the CLI you can go to the UI right and check out if the table is there so this is the UI right that we saw earlier but I want to show you through the CLI I like using the CLI to so if you want to connect to modu from the CLI you do an
34:31attach uh MD uh column if you don't provide your token you'll be redirected to the web page and it will authentification you I have my mother token as an envirment viable so it will just connect directly and now I can see all my cloud database and basically my table should be on my DB which is a default DB and
34:55the name of the model the DBT model so let's check the data and the data is there and that's basically it so we have a couple of input and files let's just go again over it we have the models that we built through exploration then we basically abstract The Source we had uh someware condition where we provide all
35:16start date and an end date to always process within a Time window um the source file basically Define uh the actual path to the source data we use environment valuable we use a specific package for DBT unit test and the tests
35:33are pretty straight forward we Mark the source using SQL it works pretty nice even with complex type and then I'm defining what is expected and is using the model and evaluate this is all running WB locally with not Cloud dependencies so it's pretty fast and easy toate against your model and finally we have on our DBT a project of
35:56file a couple of of things materialized to specify uh incremental load when we run against mod duck on otherwise table
36:07and then specify preo and post hook on when using locally because we want to load the credential and Export the data to S3 and this is done using a macro uh
36:19and the copy command from dgdb to right to S3 with a partition which is our gear
36:26and months here all right what's next well in this video we started a brand new DBT project with the ddb adapter we created our first model reading data from history using du DB capabilities and we had the option to Eder P to history and compute locally or use mod duck to leverage the cloud computing and incremental load
36:49feature we also saw how easy it is to create a unit test and finally we can create real one without any tur party Cloud dependency we Mark the data and we run thatb in the same DBT process and that can be local or in the Ci or whever you want I mean look at this it's beautiful anyway I'll see you in the
37:10part three and we'll quack some dashboard using B code tool and made a quick [Music]
37:23bie
FAQS
How do you set up a dbt project with DuckDB for local data transformation?
Install the dbt-duckdb adapter (which includes DuckDB) using a package manager like Poetry. Then run dbt init inside your project folder and select DuckDB as the database type. Configure your profiles.yml to point to a local DuckDB database file, and you can start writing SQL models that read from and write to sources like AWS S3, all processed locally without any cloud dependency. Learn more in our getting started guide.
Can you write unit tests for dbt models using DuckDB without cloud dependencies?
Yes, one of the biggest advantages of using DuckDB with dbt is that you can write real unit tests that run entirely locally. With a dbt unit testing package, you can mock source data with SQL (including complex nested structs), define expected output, and validate your model logic, all within the same local DuckDB process. Tests can run on your CI pipeline without any third-party cloud service dependency. For more on the dbt integration, check our ecosystem page.
How do you move data from a local DuckDB database to MotherDuck cloud?
In your dbt profiles.yml, set the production target to use a MotherDuck connection string (md:) and provide your MotherDuck token as an environment variable. When you run dbt run targeting prod, dbt sends SQL to MotherDuck, which reads from your S3 source and stores results in the cloud. MotherDuck also supports incremental loads, so you only process new data on subsequent runs.
What is the difference between running dbt with DuckDB locally versus with MotherDuck?
With the local dev target, dbt embeds DuckDB directly in the Python process. All compute happens on your laptop, reading from and writing to S3 or local files. With the MotherDuck prod target, dbt acts as a client sending SQL to MotherDuck's cloud compute, which reads from your S3 bucket using its own network bandwidth. Local mode is ideal for development and testing, while MotherDuck provides cloud-scale compute, incremental loading, and works as the SQL engine for downstream BI tools.
How do you export DuckDB data to S3 as partitioned Parquet files?
Use the COPY command with partitioning. In the dbt project shown in this video, a custom macro uses COPY ... TO with an S3 path, specifying partition columns like year and month. This runs as a post-hook in the dbt_project.yml file and uses DuckDB's AWS credential loading (CALL load_aws_credentials()) to authorize writes to private S3 buckets. The overwrite_or_ignore option ensures re-running the pipeline overwrites existing partitions without duplicating data.
Related Videos

2026-01-27
Preparing Your Data Warehouse for AI: Let Your Agents Cook
Jacob and Jerel from MotherDuck showcase practical ways to optimize your data warehouse for AI-powered SQL generation. Through rigorous testing with the Bird benchmark, they demonstrate that text-to-SQL accuracy can jump from 30% to 74% by enriching your database with the right metadata.
AI, ML and LLMs
SQL
MotherDuck Features
Stream
Tutorial

0:09:18
2026-01-21
No More Writing SQL for Quick Analysis
Learn how to use the MotherDuck MCP server with Claude to analyze data using natural language—no SQL required. This text-to-SQL tutorial shows how AI data analysis works with the Model Context Protocol (MCP), letting you query databases, Parquet files on S3, and even public APIs just by asking questions in plain English.
YouTube
Tutorial
AI, ML and LLMs
2026-01-21
The MCP Sessions - Vol 2: Supply Chain Analytics
Jacob and Alex from MotherDuck query data using the MotherDuck MCP. Watch as they analyze 180,000 rows of shipment data through conversational AI, uncovering late delivery patterns, profitability insights, and operational trends with no SQL required!
Stream
AI, ML and LLMs
MotherDuck Features
SQL
BI & Visualization
Tutorial


