DuckDB & Python | End-To-End Data Engineering Project (1/3)
2024/02/02Featuring:Mehdi tackles an end-to-end data engineering project : getting usage insights from a python library using Python, SQL and DuckDB! This is the first part of the series.
Transcript
0:00what's up everybody in this video we are going to shoot two birds with one store wait a minute what indeed we'll build end to end data engineering project using python SQL with some dub magic and even publish the result through a dashboard in order to get valuable Insight from PP and yes that's how it is pronounced if you're interested in
0:22learning about data inuring and you want to embrace best practice when building python data pipeline we'll take a step-by-step approach to guide you to use all the useful libraries to do schema validation unit testing logging and other neat tricks it's beyond the L word project plus you'll get the ready to use project to get insight on the
0:41usage of a given python Library this video is part of a series and in this first part we'll talk architecture and dive into the first step of the project which is the ingestion pipeline of the source data that being said if you are completely new to Doug DB you may want to check out first or video Doug DB for
0:59beginner finally all the code resource is available in the description but please don't cheat all right let's talk about the architecture for our project first the data source ppi is where we need to get the data from it is the repository where all the python libraries leave and we can get a lot of Statistics regarding each one of these useful if you want to
1:20monitor the adoption of your python project or understand how people are using it for example do you have more Linux user or Windows users since a couple of years the team has made available the logs data directly in Google bigquery so we can get the r data directly from there this is great but the problem is that this table is huge
1:42so be careful when you start doing some query as you may explore your credit card pretty quickly in our pipeline we were only going to fetch the relevant data for us on a specific python package and timeline using the partition date then we will transform the data into our relevant table that would contain all the matric we need for our dashboard
2:04we'll keep the modeling simple as we just have one surf table so we will have just one table to feed our dashboard at end the transformation part will be done using pure SQL DBT and ddb and finally
2:17we'll use evidence which is bis code tool to create our dashboard using SQL and markdown the fun thing with this stack is that you can run everything locally a modern data stack in the box or in the pounds but in the real world you basically want to have a remote storage somewhere for sharing and access control and here I'll give you two
2:40options either ADB ss3 or mother deck this is actually a must have at least for a publishing part as the bi dashboarding tool are relying on a query engine to fetch the data something I won't cover in this series is how to host your python runtime and schedule these jobs you can choose any python runtime and this did our cration tools
3:01like airflow Dexter or whatever have you and if you want a video dedicated to this leave a comment and while you're down there hit the like button and subscribe that's basically just right next to your mouth so now that we understand the high level pieces let's focus on the inje part as this is the main topic of this video so as we saw we
3:21will need to get the data from bqu so the first thing you want to do is create an account on gcp just at our console
3:30Cloud google.com there is a feed here so don't worry the data is public of PPI we still pay for any query that you run so please be mindful about those one but all the query that we're going to do uh will be basically covered by the feature plan so no worries about that so let's over to uh big query and we going to
3:54search basically file downloads which is the table that we looking for and we click on search all project so that it's going to go to all also public data sets
4:07and as you see I have here public data set and I have PPI and there is a couple of uh table the most interesting one is basically this one the file download which basically uh give you one single Row for each download of the Python
4:28library and as you can also see right
4:33here is that the table is kind of
4:39beig so the most important thing to remember when you uh create this data is to use the partition uh column and you
4:50can see here that it's partition birthday and the partition field is time step so when you query this uh by specific timestamp then basically you're only going to fetch uh and compute the data relative to that uh time
5:10stamp the other filter we want to use of course is to filter per project because here for this specifically project I would like to know how the duct TB usage has been going on on python project so we'll filter on the duck DB uh python project so let's let get started um let's open a query Tab and you see that
5:34by default actually um dat pads a specific time stamp um and you see if I
5:42recover everything from uh that
5:47day and basically this is going to process 200 gigabyte which is already quite a lot we don't need that so I'm going to put basically another condition here to
6:05filter
6:12project all right so when I filter on specific project uh here Doug DB you see I'm down to uh 7.3 gigabyte and I guess
6:22for smaller projects you're going to have even less data dgdb is pretty popular project so they request still some data to retrieve um but that's basically it so if I query if I run this query I get the data basically on the schema that we saw so the data process as we saw is around 7.3 gigabyte but
6:42actually the the the the result itself is really actually small because if I do count on this you you only have that much and actually if I save the result let's say as uh basically it take nine
6:58bytes not megabytes not kilobytes nine
7:02bytes of space it's really a small data set so it's just to show you that the source data is pretty big but actually at the end what you get the useful information is rather small and that's still the raw data right we're going to do group by and so on afterwards for other metrics and that would be even
7:22smaller all right now that we have our big query query ready and let's build the bth on Bly so here I just opening BS code uh there is a Dev container actually also on the repository I'll be referencing in the description um I'm using Dev container and Dev container are pretty handy basically what he open is that if you
7:44open the common pallet on vs code you can run any environment pretty quickly just by picking what of the distribution they have available so here of course I have already a definition so I'm just going to do reopening container but you can also just create your own python version but if you have already python installed that's the requirement we'll
8:06need uh python we work with 3.11 reason why we don't work with 3. uh 12 is because sometimes there is library that takes some sometimes to update right and 3.12 is still a bit new so let's take a bit some sometimes you know for all the dependency to update to 3.12
8:27but 3.1 is pretty safe all right so now I have my Docker Dev container running
8:35and of course I didn't mention it but as a requirement you need uh Docker to be installed you can use Docker desktop and so as you can see I have a 3.11 uh python definition ready which is corresponding to the docker image here um other things I can mention on the dev container is that we also mounting some
8:59some credential path usually typically for uh Google cloud and for AWS so Google Cloud will be obviously for uh fetching the data from bigquery and for iws it would be if you want to push the data towards S3 there is already some uh post create comments um that I've put there just as a template and as you can
9:22see I disabled the virtual L because we are working in a Dev container so there is already isolation within that so ritual m is uh a bit Overkill in that sense and we install uh the the python P
9:38packages but we don't have any python packages so this is what we're going to start so we going to start with a poetry in it uh name of the project description yada you can just click enter so we're going to need to add uh D DB for [Music] sure we're GNA need uh Google
10:06Cloud B query uh Google Cloud old python package which is used basically to authenic to against Google cloud and there is also a weird thing
10:22when working with bigquery is that they have a new API and point which is with faster and for some reason if you don't install Google Cloud bigquery
10:36Storage python package your uh your pipeline would would be much more slower because you use an older API end point there is a get up issue around that somewhere but at least this is how I deag it I'll put it in the description or here on the screen if you want to know more about this thing but pretty
10:57handy to know that you need to install those to uh use the latest uh endpoint
11:03and also a p Arrow I think which is a dependency next to that we're going to use also pandas because bigquery can retrieve the data directly in a panda data frame and there is good enter availability uh with ddb we're also going to use fire which is a really interesting library to bootstrap a CLI you'll understand that uh later we I
11:28like to use log Guru uh for logging it's pretty indd has a lot of option finally one of my favorites is a pontic uh which
11:38is used to define models so specifically here we are fetching the data from bigquery and we want to validate uh the schema so that if something is going fishy or sideways with sourced data we get directly proper uh error all right so let's hit enter so this will add uh to our by project. toml which was created uh just before and it's going to
12:05also create a lock file and we're probably ready to go two thing we'll need at least for Dev point of view is uh by test for unit testing and also
12:19rough which is basically a linter and a formatter so if you use black um or pint
12:26uh that's basically your replacement is super fast uh because it's writing in R actually with python binding and so it's been quite popular so let's create first a folder called inje and had a a file in
12:41it and we call it ply not Pi so what I usually like to do when doing a data pipeline is basically have a main file which going to include the main flow and basically all the other components into a separate file that way I can understand stand the pipeline just by looking at one file and also I'm going
13:02to pass all the parameters directly to parameterize my pipeline directly at this point and it's important to be able to parameterize your pipeline because here for instance you're going to be able to process given a certain window so we need uh you know start date and a end date for a process uh remember our query we're using the time stamp column
13:25so this is the one we're going to use other things that you want to be able to inject into your pipeline so that it's really staying at the CLI and you can run different in the different uh way your pipeline without changing any arced value so let's start by doing just the main function and we're going to
13:52say hello
13:57pipeline and we're going to run this function so the other thing I like to do is to have a make file for a couple of
14:07command so that basically the only thing that I would need most of the time to run this pipeline is like make a make b or for testing for formatting so let's try to do first end point and call it PP
14:23ingest and what we're going to do here is basically do a p tree run and we're going to call uh Python
14:33and we're going to Target to our model
14:37so now if you want to run this pipeline we just have to do make pipy ingest and we have AO pipeline now that we have uh this main entry point goes uh let's start by creating another file called do bigquery Doby and in this file we're going to have basically handling the bigquery CL client running the the
15:00query and maybe other things so here is the first function to get the big query client uh basically we just uh trying to
15:09the get the environment Google application credential which and this specifically you're going to need to go to I am and basically create a service
15:20account around here uh that has uh this kind of roles so like bigquery data editor bigquery job user bigquery user you might not need all of them but anyway they don't have right access to this public data set so that's not a big deal and basically you create uh the credential after um it's a Jon file and
15:42you're going to put this Jon file on a specific uh location uh I put the jon5
15:49into the config do gcloud because when you're using SSO so single sign on or temporary credential are also getting there but here uh for the specific python pipeline we need a service account Json you can also decide to actually use the SSO but it require a bit different uh access because here you see it's like from a service account
16:13file and it's actually easier to just handle this way because that's how we're going to use it when you deploy it to Cloud so now that we have our client we need a function to run the given query and fetch the result and that's this function as you see I'm returning a Punda data frame and basically I'm just
16:31doing a small counter to kind of measure how long is taking my uh my query within the logs and I'm displaying which query uh I'm sending so it's useful when debugging to see if the problem is within my pipeline of my sour query and finally we have a function to build the query that we will need and you see here
16:52basically I'm doing the same query that we do did earlier so I'm filtering on a spe specific project I'm creting the open uh public data set and I'm specifying some time stamp start dat and an end it so let's go back to our pipeline file and just do import from the function we just run to see if we if
17:15everything is working unexpected uh we going to import those function and I'm going to rent the final result is a data frame and I want the get piery result why wow it's been already super smart so this is not going to work I need to have uh a specific uh Google project as I told you you need to create a Google account and
17:40a Google project and this is where basically the compute billing is going to be Associated so that's the project name I need to also set up the environment uh Google application credential so for the sake of testing I'm just going to set this variable and
17:59uh funny thing that he auto complete with uh something from Alexander my name is not Alexander all right so now if we go back to our make BPI
18:15inest as we can see now with the with the logs we put there uh using log Guru we have the query which is being executed uh the time and basically that's it we could actually print uh the data frame just for result so let's go again and we see we did retrieve uh the data all right so now uh we have a bunch
18:40of arched values so let's take care of that so I'm going to create the models.
18:44uh models files where I'm going to use pantic uh to define the model of our
18:51parameters so here I create uh basically a data class but it's a pantic uh base model where I have all those parameter which is start dates end dates so that will be in our query past uh the project where
19:08we want to filter the data on the query so it's the in the work condition uh the table name which is the output table name uh we're going to use later the gcp project uh that's that's already our coded the time time column it might change but actually it's not really needed there I put it there then we have
19:28the destination because we're going to once we ingest the data we're going to be able to either write it locally on the cloud on his3 or mod duck or all at the same time so basically this is just a list where we're going to be able to take local S3 MD then the S3 PA if we
19:46are putting uh the data to S3 and the AWS uh profile I don't think I'm missing anything here I will see afterwards basically the idea here is that when I run the pipeline which today just to um make pip Pi injest now what I want to do is basically be able to uh pass certain parameters to that pipeline so that I
20:09can decide how to run it so now that I Define my model the beauty of it is that if I come back here I'm going to pass basically um this
20:20model two things I want to do first had some test on this model and just so that you get to grasp on how things are working and maybe had uh two endpoints one for test and one for formatting as we haven't done anything yet so the format will be just and so now if I
20:38do make format R has been formatting
20:42three files and so another one now for test which going to be just calling uh by test and test folder that we need to create then call it
20:57test models so what I want to test actually is that I want to use my model to be able to pass the different parameters and test that the query string is correctly built so in other words we have our models PPI job parameters I can pass this as a type which is pretty nice
21:19so that's my parameters and now you can see that if I over it I can actually see the value of the by bybi project should be string this one also is a string the start date the end date um so that's basically those parameters that I'm going to pass uh through my CLI but that's basically just try to add a test
21:42first for uh for this function so here basically we're just creating the model and we have an expected query and basically when I call the function when passing the model model we should expect this Square string to be
22:03generated and we're good all right CH so now let's come back to our pipeline so what we did so far is basically we had a model for draw parameters we have our first function that get the data from uh big gr so now let's refactor this to include the PPI uh model all right it's done but now we need kind of to parse
22:23the thing that we enter in the CLI and convert it into this model right so do you have different option in Python to create a CLI and Par those things like click or typer our example but the cool thing here is that with fire you can generate automatically a CLI based on this pantic model so let me show you
22:46that trick so I just added fire and basically what we do here is that we convert any key value arguments and pass
22:56it to the model to convert it as a model and we invoke the function okay so fire is going to be able to detect anything that come after basically uh ingest of
23:09start date yada yada so if we compare to our models here so yeah we have start date so this is how uh the format is expecting and then the value and fire is going to parse it directly for us into a on line code so we don't need to parse anything and so if we add actually any
23:29variable there is going to be directly available within the CLI so it's really a beautiful thing now that we can pass this to the CLI let's add it to our make file so now basically I added all the
23:44value from my P Java parameters over there so that I can customize directly the query if I want to run the query against specific timeline and pass all the other variable needed in instead of AR coding it in the project a thing that I used to do that I like to do also is that working with uh
24:06an unman file please add it this to your G ignore we'll do it right now I'm adding this to the G ignore and the m is ignore so I can safly uh create the m
24:20and put any sensitive information and it won't be committed and so coming back to our make file what does include is B basically just loading all the variable from the HM and Export it for the session so basically we can start to fill our doam with all those variable all right so now I push back basically the variable needed uh so start it and
24:44it's the ddb the BPI project so this is
24:49the field that's going to construct uh my big query query right and then I have some uh information about the cral S3 PA already I'm getting ready for pushing to S3 so now let's run our pipeline again and see if everything work as expected meaning that we have
25:10the environment valuable passed to our command line here and basically this is interpreted as a penti model which is passed to our code and done still working so far we've done the hardest to be honest because we introduce a penti to manage mod uh the job parameters models we get the data as upon that data frame and now there is
25:36only the fun part which is uh using uh Doug DB but before what I want to do again is validate uh the source data to
25:46a btic model so here we just have the banded data frame and we don't check any type from the result that we're getting from the query and who knows if the source data is going to change over there and we don't want to mess our pipeline because of that b and tick again to the rescue so here we're going
26:04to create a model that Define The Source data so quick hack if you go again to Big query and actually look at the uh schema Source over here and you just P copy this in an ugly way and you pass it to your best friend and you say here is my uh bigquery table schema please create a pantic model so again creating
26:32a pantic model is pretty straightforward the challenge here is that we have uh a couple of complex fields and that's where creating the penting model can be a bit complicated so this is basically um the file downloads and as you can see I'm actually nesting other identic model so specifically for the complex type so here you see this this is another uh
26:59ptic model the file uh details so that's the two one which are had a complex of nting fields so what I what can I do now is basically two things first we can uh create a new test for this model but you we can also create a validation function that validate the data frame we're getting um based on that models so how
27:22it's going to work is that basically I'm validating a given on that data frame uh
27:29given any pantic model and try to match it it's pretty straightforward what I do actually is Loop over each row of the data frame and pass it the model and see if it works so now back on the test that's another interesting thing you see eily you need uh fix your data right and
27:49here our pen and data frame that we are validating is already in memory right it's going to gdbc and then basically load in memory memory so you could try to create a panda specific schema it's actually pretty hard and it's not actually easy to get uh that Panda the frame schema how it's represented internally anyway there is an easy
28:09way uh while using the DB so what we're going to do here in our pipeline is that we going to directly uh copy the
28:22result uh locally using dug DB so I'm going to create the dub connection and now I can basically just
28:33do a simple uh copy command so what we do actually is that we use the copy commands this is the query that we do and we can directly create the data frame object here not need to declare it and then basically export it as a CSV let's run it
28:56again quering the data from Big query I have a simple of uh CSV now and
29:04what I can do is just sck uh let's say the first uh let's say the first uh tree line and we call it sample CSV so I have now um a sample of data
29:21and uh what I could do is to validate my model I created is basically use du DB
29:30um to validate that so I created a fixture and a fixture in P test is basically um a test data set here for our use case that we're going to use to different test and what I want to do basically here is just create another ddbn memory I create a table so this is
29:51uh basically the how dgdb will interpret it when you load the data frame in memory so so you see here I have a complex struct so once I create the table I basically load the sample uh of data we just created into dug DB and I
30:08convert it to a data frame so I have a data frame ready with the right sample so again to create this with pandas like a sample data set is actually uh a bit harder especially for nested Fields because pandas is not easy to declare a schema for nested fields so here you see it's pretty straightforward to understand uh what's going on but if you
30:32do uh read the CSV directly in bandas it's a bit harder because you're going to need to specify those nting fields and that's a nightmare I've been do I tried that so down so we have now our data and now creating the function is uh the test function is pretty straightforward I created two tests one to validate data
30:55so I'm calling the validate data frame uh function we created earlier which is expecting a data frame and a pontic model and so this should be good because that's directly the name or fixture right and this one is good um and I'm also testing when the data is not correct so I'm basically changing uh the data with an invalid entry you know here
31:19as an integer um and that should raise an error so the test should be positive when there is an error which is raised and so let do make test again and we're good so three test as pass so now coming back to our pipeline we can actually uh call the validate data frame function and we are passing the file
31:43downloads uh models and we need to um import also the data frame so now we have the steps so basically our pipeline will stop if uh the schema or the source is not as we use to expect all right what we're going to do now is create it uh create another file called duck where we're going to put uh our
32:06duck DB helpers let's say this function for example is to explore the data locally so we can create uh an helper for that we're going to need a couple of function here uh the first thing is that we're going to load the data frame into DC DB and then from there we can export it locally or to the cloud to hisory or
32:25to other T so first thing first loading the data to uh from AA data frame and then we can create another function to export to S3 and to export to S3 uh what du duck DB is going to do is it's going to download a specific extension automatically but what we do need to do is specify uh the ads profile and so how
32:49you do it is basically uh use this function which is called load adbs credential and specify your EDS profile so EDS have different profile under eds.
33:01config and you should have credential over there not there is a current limitation as per this video unless you're watching this from the future a couple of months from this release and that might be solved but it's not supporting actually uh pretty well uh SSO you need specific adsk key secret and token but you can actually generate
33:21that from an SSL logging and for that I've put another make file entry point you can run and what it's going to do is that export uh those temporary credential and put it as a file uh at the specific location that WB is expected but that's only if you run it with SSO otherwise you can also fetch um
33:43an I am user like gcp and have this credential over there another helper that we can write is to write to S3 from ddb and again that's pretty straightforward and what the nice thing is that you can also use use hi partitioning so here you see that I'm actually using the same copy command that we did for local but I'm going to
34:06specify an S path and uh within that I'm
34:10going to also create two colum which going to be my partition the year and the months based on the time stamp and so that means that my data will be published let's say my PPI and then here so it's going to be something uh like that that the end year um sorry 22
34:31train3 months 12 and then uh my data.
34:37par for example so that's pretty Endy because again you can leverage uh partitioning filtering when querying out over 3 instead of loading the B data set Bargas PR andd is compressed it's typed so you know less schema problems all right two other helpers I'm going to write uh directly now now is want to uh connect to modu so to connect to modu
35:02you can actually head over app mod duck.com uh create a free account and once you connected what you can also do is go to setting and fetch your service token which is right there we're going to be used to authenticate to mother duck so to connect to mod duck is like to install an extension and set parameters in Duck DB so we just load
35:24and install the mod duck extension as you can see here and we set the mod duck token and then we can use the attach command to authenticate to mod duck and then push our data over there finally the last function that we're going to need is basically write to mod de and so what we're trying to do here is just
35:43making sure that our pipeline is impotent meaning that if you run it uh in multiple time with a different states on mod duck it's still working so that's why we have a create database if not exist I create a table if not exist and then what we do basically for inserting data because we just going to always
36:06insert new data fastest and easy way is basically to do a delete on a specific uh Tim stamp uh that we're going to use and then do an insert on the new data on that specific uh time stamp all right so we wrote all the function of ddb that was pretty fast and actually uh the e this way so now what I'm going to do is
36:29just refactor this to enable based on a specific parameter if you remember we had uh this one the destination just big specific condition to say right to CSV right to S3 all right to mod work we use the function that we did all right so now we have specific location that checking the parameters as a reminder is
36:51the ptic model parameters right so this is a list of string which can be be local S3 or mod deck and so here basically we are copying uh to local CSV if it's uh local S3 what we do is load the credential you remember we do just a call load credential based on the profile and we can write to S3 with that
37:18and this the one for writing to mod deck um there is the is checking by the way the mod duck uh token here so you should definitely had on your doam file the mod duck uh token V uh valid here because
37:34this is going to load it to here and make it available to the python process all right we're done so let's play a bit with the CLI now so we can do make vipi inest and let's sync on all the different place so local S3 and mod deck it things works as expected and of course you can play with the different
37:54parameters so like here I'm taking a specific uh months uh still working on Duck DB Library project but you can pick any python Library project just change the value in yourm file and that will load uh directly for the make CLI commands or you just run this command and specifically manually specify those but that's the beauty of it now we can
38:19basically run this Pipeline with different parameters I'm running uh the big query uh query uh so this is the one this is the parameter the correct dates I've put in my do hand file you see that I'm syncing for the tree so locally I just have a file here that just appear so the writing to CSV was successful the
38:41writing data to S3 also a noise writing to butu and we're done so let's check quickly on S3 so we had uh the uh this is the bucket and PPI file download so we have this folder and you see it partition correctly uh two year and two months because if you remember I filter for two month two days for in
39:06April 2023 and we have the data which is uh right there let's see for mod duck uh so on mod duck it should be this one and this uh table and again uh I can query
39:22from there and there that there too so we spent quite some time actually on the first part of the the pipeline here uh and actually it's a best practice I encourage you to do for any other pipeline where you can define a model for your CLI and basically just fire so that you don't even need to parse uh
39:44those uh those parameter and just get heading over time and refactoring any hardcoded value that you can see in your pipeline to say h I may run this pipeline in a different way so that's pretty Endy and we've seen here the power of dub in two places the one first is was regarding test we created a fixture um for our data source and it's
40:12pretty straightforward to create a table uh defined in a schema there and basically load the CSV to respect the schema and we have a data frame ready to be used at different place within our test function and the second place we saw du DB is basically for exporting a different place so here to S3 and here to mother
40:37duck directly in the cloud if you would not use ddb because I'm challenging myself directly pandas you would have to install different libraries in Python and it's not straightforward how to go there so once the data is loaded in ddb it's really easy to push it Forward be mindful that we didn't do any trans transformation here because it's part of
40:58the learning of the series we'll do a transformation layer afterwards but you could also directly do some select statement and directly compute in memory and write the result back all right let's wrap up in this video we did a python pipeline using interesting libraries like PTI schema fire for CI
41:17logu for logging and of course dub to easily inchest the data and store it to any place we want would it be local in the cloud on a three or mod deck now we have the raw data ready to be quer and in the next video we'll dive into the transformation layer using DBT SQL and duct now get out of here and get quacky
41:37I mean
41:46Cod
FAQS
How do you build a Python data pipeline with DuckDB and BigQuery?
The pipeline starts by querying PyPI download data from Google BigQuery using partition filters and project filters to control costs. The results are loaded into a Pandas DataFrame, validated with Pydantic models for schema correctness, and then ingested into DuckDB. From DuckDB, you can export data to local CSV, AWS S3 with Hive partitioning, or MotherDuck in the cloud using simple SQL copy commands.
What Python libraries are recommended for a DuckDB data engineering project?
The project uses Pydantic for schema validation of both CLI parameters and source data, Fire for automatically generating a CLI from Pydantic models without manual argument parsing, Loguru for structured logging, Pandas for BigQuery data retrieval, and pytest for unit testing. For development, Ruff is a fast linter and formatter written in Rust that replaces tools like Black and Pylint.
How does DuckDB simplify exporting data to S3 and MotherDuck?
DuckDB makes multi-destination exports straightforward with its built-in extensions. For S3, you load AWS credentials, then use the COPY command with Hive partitioning to write Parquet files organized by year and month. For MotherDuck, you install the MotherDuck extension, set your token, and use the ATTACH command to authenticate. Once connected, you can create databases and insert data using standard SQL. Without DuckDB, achieving the same with Pandas would require installing separate libraries for each destination.
Why use Pydantic for schema validation in a data pipeline?
Pydantic gives you two things in data pipelines: it validates CLI parameters as typed models so your pipeline is fully parameterized without hardcoded values, and it validates the schema of source data fetched from BigQuery to catch upstream changes early. For complex nested fields common in BigQuery tables, Pydantic models can nest other models to handle struct types. This is much simpler than trying to define nested schemas in Pandas.
Related Videos

2026-01-27
Preparing Your Data Warehouse for AI: Let Your Agents Cook
Jacob and Jerel from MotherDuck showcase practical ways to optimize your data warehouse for AI-powered SQL generation. Through rigorous testing with the Bird benchmark, they demonstrate that text-to-SQL accuracy can jump from 30% to 74% by enriching your database with the right metadata.
AI, ML and LLMs
SQL
MotherDuck Features
Stream
Tutorial

0:09:18
2026-01-21
No More Writing SQL for Quick Analysis
Learn how to use the MotherDuck MCP server with Claude to analyze data using natural language—no SQL required. This text-to-SQL tutorial shows how AI data analysis works with the Model Context Protocol (MCP), letting you query databases, Parquet files on S3, and even public APIs just by asking questions in plain English.
YouTube
Tutorial
AI, ML and LLMs
2026-01-21
The MCP Sessions - Vol 2: Supply Chain Analytics
Jacob and Alex from MotherDuck query data using the MotherDuck MCP. Watch as they analyze 180,000 rows of shipment data through conversational AI, uncovering late delivery patterns, profitability insights, and operational trends with no SQL required!
Stream
AI, ML and LLMs
MotherDuck Features
SQL
BI & Visualization
Tutorial


