Bringing DuckDB to the Cloud: Dual Execution Explained

2024/06/28

Bringing DuckDB's analytical power to the cloud requires more than just running it on a server. While DuckDB excels at processing data on a local machine, the dynamics of data analysis change when dealing with cloud-scale data, collaboration, and shared resources. This shift introduces challenges around security, concurrent access, and performance, which is where a purpose-built cloud architecture becomes essential.

MotherDuck, a modern cloud data warehouse, is built on DuckDB to solve these challenges. It extends DuckDB’s capabilities without forking the open source project, creating a hybrid system that intelligently balances local and cloud computing. In a conversation with MotherDuck founding engineer Stephanie, she explained the architecture and the innovative query model, known as dual execution, that makes this possible.


The MotherDuck Architecture: Beyond Hosted DuckDB

A common misconception is that MotherDuck is simply DuckDB hosted in the cloud. The reality is a more sophisticated system designed to overcome the limitations of using a locally-optimized engine in a distributed environment. MotherDuck’s architecture is built on three key components and a core philosophy of not forking DuckDB.

Instead of maintaining a separate version, MotherDuck leverages DuckDB’s powerful extension system. This allows MotherDuck to add new capabilities at multiple layers, including the SQL parser, optimizer, and storage interface, while staying current with the latest open source DuckDB releases. This tight integration means users benefit from the rapid innovation of the DuckDB community almost immediately.

The architecture can be broken down into three main layers:

  1. The Client Extension: This is how MotherDuck integrates with the DuckDB ecosystem. Whether a user is working in the CLI, a Python script, or a JDBC connection, a simple ATTACH command connects their local DuckDB instance to the MotherDuck cloud. This extension is even used to run DuckDB in the browser via WASM for the MotherDuck UI, enabling client-side processing to reduce latency and cloud compute costs.
  2. The Compute Layer: In the cloud, queries are processed by containerized DuckDB instances, fondly called "ducklings." These compute resources are scaled based on user needs, providing the necessary CPU and memory to handle complex analytical queries on large datasets.
  3. The Storage Layer: DuckDB's native file format is optimized for a single writer on a local file system. This model is not well-suited for the cloud, where multiple users need to read and write to the same database concurrently. To solve this, MotherDuck implemented a differential storage system that maps a logical database file to append-only snapshot layers in cloud object storage. This design is cloud-friendly, enabling efficient in-place updates and forming the foundation for features like database sharing and time travel.

Enabling Collaboration with Database Sharing and Secure Credentials

This unique architecture transforms DuckDB from a "single-player" tool into a collaborative "multiplayer" platform. One of the most significant advantages is database sharing. Instead of emailing SQL scripts or passing around large database files, team members within the same organization can grant query access to their databases. This streamlines collaboration, ensuring everyone works from a consistent and up-to-date version of the data.

Security is another critical aspect of any cloud data platform. MotherDuck provides a centralized Secret Manager, co-designed with the DuckDB team. Users can create persistent, encrypted secrets for accessing external data sources like AWS S3, Google Cloud Storage, or Delta Lake. Once a secret is created in MotherDuck, it can be reused across different clients and sessions without needing to expose or reconfigure credentials on each local machine.


Understanding Dual Execution

The most innovative feature of MotherDuck's architecture is its dual execution model. This is a hybrid query execution strategy where the optimizer intelligently decides whether to run parts of a query locally on the client or remotely in the MotherDuck cloud. The primary goal is to minimize data movement and leverage compute where it makes the most sense.

The process was demonstrated with a query joining two tables. When both tables reside in a MotherDuck database, the query is executed entirely in the cloud. An EXPLAIN plan for this query reveals that all operators, from the table scans to the final join, are marked as remote. The only local operation is downloading the final result set to the client.

The real power of dual execution becomes apparent when a query involves both local and cloud data. In the demonstration, one of the remote tables was copied to a local Parquet file. The query was then modified to join this local file with the table that remained in the cloud.

The EXPLAIN plan for this new query showed a mix of local and remote operations. The scan of the Parquet file and its associated processing happened locally, leveraging the user's machine. The scan of the cloud table happened remotely on MotherDuck's compute. The system then efficiently transferred only the necessary intermediate data to complete the join. This hybrid approach avoids needlessly uploading the local file to the cloud or downloading the entire remote table, resulting in faster and more efficient queries.


Taking Control with Manual Execution Overrides

While MotherDuck's optimizer is designed to make the most efficient choice automatically, some situations require manual control. For this, users can use the md_run parameter within scan functions like read_parquet() or read_csv().

By setting md_run = 'local', a user can force the scan of a remote S3 file to be executed on their local client. This involves downloading the data, which might be desirable if the data needs to be processed near the user's location or if they prefer using local credentials. Conversely, setting md_run = 'remote' forces the operation to run in the MotherDuck cloud.

This control is especially useful for performance tuning. A direct comparison showed that querying a 2GB Parquet file on S3 was significantly faster when executed remotely in MotherDuck (25 seconds) compared to a local DuckDB client that had to first download the data (36 seconds). By pushing the computation to the data's location, MotherDuck minimizes network I/O and delivers results more quickly.


A Symbiotic Future for Local and Cloud Analytics

MotherDuck's architecture is a thoughtful extension of DuckDB's core principles, adapting its local-first power for the demands of the cloud. The dual execution model provides a flexible and efficient bridge between a user's laptop and cloud data, optimizing workloads in a way that pure-cloud or pure-local systems cannot.

This relationship is symbiotic. As the largest production user of DuckDB, MotherDuck continuously pushes the engine to its limits, uncovering opportunities for improvement and contributing enhancements back to the open source project. This collaboration ensures that both DuckDB and MotherDuck will continue to evolve, offering a powerful and seamless analytical experience that spans from local exploration to cloud-scale production.

0:00[Music]

0:17he

0:23[Music]

0:30n [Music]

1:09[Music]

1:18[Music]

1:28he oh [Music]

1:38[Music]

2:07[Music]

2:31he

2:36[Music]

2:42[Music]

3:28[Music]

3:37[Music] he

3:50[Music]

3:57[Music]

4:27[Music]

4:51hello everybody hope you're doing well and welcome to another episode of quack and coat and and uh today I'm actually not the only mother Docker which going to be uh live I'm going to be joined by colleague um and we're going to talk actually specifically about mod duck today but why it made you know so specific compared to duct TB uh we're

5:18going to talk also about uh dual execution how it works a bit behind the scene what can you do when we work with the ddb uh client and uh mother duck so we're going to have ends on part uh but so far so good and I want to welcome uh Stephanie on the live Stephanie how are you doing hello hi Madie I'm doing well

5:41thank you thank you for invi inviting me to your uh quacken code uh live stream I feel very honored to be here nice where uh where are you joining from by the way um I'm joining from the mother duck New York office so yeah this is a small view of our little living room area I guess yeah right here nice I guess I I guess

6:04there are there are some ducks around yeah there are some ducks around for sure and they're hard at work yep uh so can you can you tell us a bit so you are founding engineer so you've been there uh since the beginning basically right and um can you tell us a bit uh more about you know your

6:25background where were you working before and how you came into you know mother deck as a funding engineer absolutely so I was at Google bory before coming over to mother duck um I was primarily working on the B open source uh client libraries uh so my primary focus was the Java client library for b as well as the

6:50jdbc and odbc connectors um and uh I was

6:55there for about three years and then um so Tino and Jordan approached me uh about mother dug and um wanted to see if I was interested in working um on maybe like the client API service surfaces of mother dug and maybe the more ecosystem aspect of it so I was very excited to work on something different so I joined

7:19and started as a founding engineer the interesting thing is yeah but actually when I joined uh dub as you know already has an amazing uh coverage for their client API Services right so you they already have jdbc odbc python client Java Cent like all of these things um so there was not really a business need for us to change that um

7:43so then instead I started working on the career execution part of things and therefore the Dual execution that you have mentioned yeah because that is something that we wanted to kind of start to explore and build out um so actually yeah been working on the database team as a result for the last uh two two something years yeah and so

8:06can you tell us a bit like what the database team uh actually do because you know we have dgdb open source so what's what's the main mission yeah for sure so the database team is sort of like the um the middle part of the the whole stack at the company so essentially we interface with the client API team on

8:28the surface area and also the infrastructure team who built a foundation of mother dug things like infrastructure metadata management and uh Etc and we're in the middle and we work very closely with duck DB to um implement this query execution model that allows a query to be executed in mother duck um you know with the power of duck DB essentially uh

8:56so we're kind of like in in the middle over there yeah no so so that's great so can you uh maybe we can uh we can dive a bit on uh what is the main feature different that difference uh basically a Doug DB you know um in beded stand alone and and mod deck itself because Mo deck Mo deck is

9:19running dug DB behind you right but there is this this layer basically um

9:25that uh that you mentioned and maybe maybe actually we can take uh I don't know what you think uh I have the C cider paper uh schema uh so maybe

9:38maybe we could take that one to kind of illustrate what do you think yeah yeah for sure absolutely all right wait let me uh here so um so basically we have uh

9:54uh dug DB here on the back end of mother duck and maybe I can give a quick text for the audience so uh this is actually the cider uh paper uh can you talk to bit explain to us maybe the the the conference and on which context this paper was written yeah absolutely so I would say there are probably uh three most

10:18prominent database conferences out there globally uh one is uh Sig mold which just happened two weeks ago and that is a very large conference I was there actually actually um about two weeks ago and there were about 800 researchers database researchers and um database AI researchers at the conference um and then another one is vldb which is also

10:43uh you know quite quite large and very established and cider is the third uh well-known well recognized conference within database Community but it's much smaller so has less than a 100 people each time and everyone kind of participates in all of the tracks and so this paper was presented at The Cider conference I think last year yeah okay

11:09and so this uh basically means that um

11:13so that that was a paper that give information about mother dark architecture talk a bit what we're going to talk to uh today it was from my anuary it feels like uh a decade right I guess for you also yeah yeah for sure things things are moving uh are moving fast but yeah um coming back to the main

11:35component that basically had mother duck as a as a layer on top of duck DB can you work us uh through that yeah absolutely so mother duck is a pretty complicated system at this point however it can be mostly um summarized as three key components so the first component is the client layer on the left hand side

11:59over here in the diagram you can see um so this client layer can be any type of client API so here's an example of python but we also have lots of other people connecting to Mother tuck through uh Java for example the jdbc connector is a popular one and another popular client is the web browser so the browser

12:21um actually is really cool because we essentially run duct DB in the WM build so it essentially leverages the the memory in your browser and that's really amazing yeah so yeah yeah just to to give a pause for people so if you go to mod de Qi and uh if you run a couple of queries

12:43you're going to fetch some data from the cloud and then if you do some pivot or other uh other uh operations uh that would use your duck DB which is in the browser right so uh in wasm uh to avoid any uh Network traffic so this is pretty new because then it's as you said it's leverage your

13:06uh your browser memory but is also you know avoid uh extra compute cost on the cloud right y exactly yeah so uh yeah so mdck

13:19uh web browser and and the client and also another call out uh is that if you go on our documentation you can see actually documentation for uh the different uh driver that uh that we have that we support yep exactly um so essentially this client layer is also uh so this is where mother dog kind of plugs in right um so this is

13:48also the part where we call as a client extension uh for mother duck we very strongly depend on uh duct DB's extensibility so if you um have some familiarity with duct DB I'm sure you've heard about um duct DB ex extensions so it's quite similar to um postgress actually uh where they have extensions and you can use duct DB as um a library

14:12essentially so mother duck we do not Fork the open source duct DB and instead we leverage this extension uh ecosystem so we have one kind of extension running on the client side that's our client um mother deck extension and then we have a server side extension which runs on this the compute layer this middle part of this diagram

14:37essentially yeah so and I just like to to color we um because I wanted to talk about this and I think it's nice that you bring it up um you mentioned that uh

14:49we uh we don't we do not Fork dug DB right we rely on the same thing and I think that's a big difference from traditional manag uh database that we know like you know postgress myql uh elastic surch you know are available in different cloud provider right um but they often uh you know rely on a fork

15:13and which means that when there is an upgrade on the open source project it usually take uh weeks months or sometime year I know for example for Apache spark for example fors um to follow up you know with the different version it takes some time um quite some time because there is tight integration done given a specific cloud provider but

15:37here the the power of um not forking U make us basically able to upgrade much more quickly is that correct yeah absolutely I would say so yeah so yeah so that's that means that uh actually for GA we went GA uh you know the second day after that that uh Doug DB was released their their 1.0 um so so yeah that's that's a pretty

16:06uh nice and differentiator um so yeah you talked about uh the plugging the extension how we basically connect to mother duck um we'll see after um actually we can do a quick End Zone maybe just to to explain so that people uh picture that rather than just at the end um yeah yeah for sure so just the connection um

16:29so if I'm uh running uh a Doug DB CLI you can install it if you're on mest to brew install uh Doug DB um but basically

16:40uh you have a table uh which is extens so this one M uh which uh are you

16:52using sometimes this stable extension yes I do actually because it kind of gives me a good idea of what are the extensions that I have loaded on my system at my disposal so if an extension is installed and loaded that means I can exercise the functionalities in that um extension yeah and I can also check some other you know parameters on the

17:15extension uh versions and things like that so it's helpful inspect it yeah so

17:22that's basically exactly that so you can see all the extension and a lot of them get installed automatically based on certain function and C hook so Doug DB has about certain hook that if you query for example an S3 file so you do from s32 point and then uh your uh

17:44your file for example like this um what's going to happen is basically it's going to load the https uh function to create uh to download read remote file and it's going to also load the Park and you don't have to um to do anything uh but the but

18:04otherwise there is a mechanism right to install and load an extension right yes absolutely which also yeah you could also explicitly say install uh extension name and then load extension name that will bring in whatever extension that you're interested in bringing into your picture yeah Dr is very U smart about this because this allows them to keep

18:28their Cor binary very slow very small yeah because not everyone for example is going to analyze geospatial data for example yeah so and and spatial extension uh that duct DB offers is actually quite a large one it has quite a few dependencies for example so um you know to reduce the build time of their core project they have you know

18:51basically uh put all of these other you know um functionalities sort of in these extension out of tree extension repositories so your core functionalities are unimpacted and very small very Nimble and easy to build uh portable and easy to use yeah and also uh a call out Doug DB is working on uh Community extension to be installed

19:16directly um so they're going to be a framework where because this is our officially supported extension by uh uh dug DB labs and if you want to create your own and install your own usually you need to had an URL or uh to be able to to download that extension locally but it's going to be much more easier uh

19:36in the future is actually already there but not advertis it yet um but anyway coming back to M duck um mother dug is an extension right so right exactly so we have we have actually uh here right

19:53yeah we are listed so uh yeah so we are listed as an official supported extension uh which means that also you don't need to install or anything uh to uh log modu you can do uh an attach command um like this and attach command is used to attach a database uh DB file for example um but here basically if I

20:17do this I'm going to attach and connect to uh mod deck and because I have my uh

20:23environment valuable store um as a mother the mod duck token which is the only require to connect I can now uh show the database which are those one are the cloud database uh that I just uh show earlier in the UI and that's that's pretty much it to connect to the cloud so that's that's pretty yeah and this is

20:47also why we didn't have a need to work on top of like the dub client um apis because it was it's so easy for mother duck to come into the picture so no matter where you are using duck DB you can easily connect to mother duck uh just by using this command you've shown and in other cases maybe modifying a

21:07connection string to uh prepend MD colon

21:11so super easy yeah yeah so yeah as you mentioned would it be python goang or whatever it's always the same uh process you attach MD with the mod duct token and you're connected um yep to the cloud so coming back to uh to this one uh no not this one this one yeah exactly yeah so coming back to the

21:36diagram here um so just want to again summarize here that uh our client extension in mother duck uh it modifies duck DB in many different uh layers so the life cycle of a query starts with parsing of the SQL um and then it you know moves on to uh binding which does catalog lookup to figure out how to

22:00construct a logical plan from the SQL uh string and then optimization comes in and then eventually we produce a physical plan uh with different operators so we can uh execute this query uh in mother duck so mother mother duck extends the parser uh it extends The Binding it extends the catalog lookup um it even extends the um the

22:26optimizer and the storage so we really really leverage the duct DB um extension uh system here okay so yeah so we we so

22:36there is a lot uh extended and then uh maybe so you have uh can you talk maybe on the on the main components that there is here yeah absolutely um so basically uh

22:51if we focus on the yeah the middle component over here uh obviously you know before coming to mother dog we have all of this you know observability monitoring authentication low balancing uh users and secret all of these things so this also you know to show uh why it is difficult to just run duck DB in the cloud on a ec2 instance right I mean

23:12sure it is doable but to do that properly to do that with reliability with security you know with you know Access Control uh with proper monitoring like all of these things with observability uh it requires quite some work and that's where you know mother duck uh sort of comes in yeah so essentially uh on the server side you

23:34know in this box on more like the right hand side over here uh we basically have these containers um that kind of just you know run duct DB uh on the server side so uh basically these containers are of different sizes depending on the the users use cases so they vary in terms of number of CPUs and also amount

23:58of of memory and of course we Fally call these containers uh ducklings so we we have this ducklings running on the back end for you that like sort of you know process your your queries um essentially

24:14yeah and and also yeah uh so because we

24:19you know have also separation and decoupling of compute and storage which is obviously you know typical for cloud systems nowadays right so data is eventually managed in cloud storage using a differential storage so that is how uh we manage our storage and that allows in place updates um which is you know very powerful uh and allows us to

24:45do time traveling and database sharing and snapshotting uh very easily so that's another part of you know why running duct DB in the cloud can be challenging is that you can do it but to do it efficiently um and cost effectively uh you need to potentially introduce your own uh you know file system to to manage your storage layer

25:07essentially yeah yeah so that that that's already a lot and I think one point is that uh we often see is that if you're using dougb as a Stalone person but then as soon as you need to uh share the data or work together right uh you're going to eat some limits what what are the main limits are you

25:27thinking of like like you know going from single player mode of wgb to multiplayer mode absolutely so for example if I am working on you know some analysis in my database and maybe I've you know produced something that I think is worth sharing with my colleague um so if you only have your duct DB on your local computer uh you most likely will

25:52have to share the SQL you know to regenerate the state that you are currently in and you also have to share the database file in some way or some query to generate that database file and so that your coworker you know who is somewhere else can then be in the same state uh same state as you are however

26:11you know with mother duck database sharing is one of the key features that we offer so people who are part of the same organization can share their databases very easily so instead of having to write all these sqls multiple times and figure out how I share my database files and raw you know database files I can just create a database share

26:34you know from my database and then my my colleague can then go ahead and just you know um query on top of that database share and that just makes collaboration a lot more you know natural and a lot more easily yeah so I that's one thing that we really would love to like highlight you know and mother dog yeah

26:56so the the the database share for example uh if you log in today as uh to mod deck you're going to have a share actually already attach uh by default for any of the user which is the sample database and so in this sample database you have a couple of uh examples data set from acur news uh uh kle movies uh

27:20Taxi New York city data and so you can uh query those you can create a new uh database easily and then share it with with others and I think what's important to mention is that behind the scene is the we are leveraging the the dug DB file format right yes yes absolutely so we are storing our data also just in the

27:42native duck DB uh file format but I do want to mention that um duct DB storage is designed for local storage so a single database file uh means that dropping a table will leave a hole in the middle of database file and this file will then be reused and updated later so even appending a new table you

28:06know using column ner storage could generate IO that leaves gaps basically um and you know write at multiple places in your database files at the same time so this type of implementation that is you know fine for local use cases is not actually very Cloud friendly so therefore mother duck has implemented um a a user file system called fuse uh so

28:31that appears to be a random writable file system and it can be mapped to S3

28:39yeah so essentially we store a logical database file in multiple snapshot layers and we have a blog post by the way on how differential storage is implemented and uh envisioned essentially and so every single right uh that happens in the database will go to the active layer and is append only and it keeps on mapping uh which mapping

29:06this uh which position in the layer correspond to which logical position uh it is in your database file yeah so that that's yeah so that's important to mention that like if you have multiple user you know writing reading or doing the operation to a same um uh same

29:26database file you can enter to some limit with the uh the native one and that's that's where when you move to the cloud and you need collaboration and multiple people working on the same data set uh exactly you you feel some limit and so this is the blog by the way uh I'm going to quickly uh uh share it uh

29:46in the in the in the in the chat uh but

29:52yeah so that that's that's already uh uh a lot of points so coming May I know I'm not sure if we have uh cover we covered

30:02the storage Services um and the the deing container maybe

30:09maybe you can talk a bit about the user Secrets yeah absolutely because you worked on that right yeah I did work on that it is my baby I'm very happy about how the secret manager is now so we co-designed uh and collaborated with duck DB on the duck DB and mother ducky secret manager so essentially um uh this

30:32secret manager is uh supports a pluggable secret storage so duck DB has the secret manager implementation where it can oversee um all of the secrets that you have plugged into duck DB so mother duck for example is one of the designated uh secret storages so essentially uh whatever duck DB supports on the client side uh in terms of secret

30:57types and you know parameters that you can store for a given secret all of any of that stuff mother dock supports the same and if you want to store a secret in the cloud as opposed to only uh locally in memory or in maybe in your local file in an unequipped encrypted format you can simply just say create

31:17Secret in mother duck or simply say create persistent secret when you are already connected with mother duck and then that will uh create you a Cloud persistent encrypted secret um in in mother dog cloud storage and the benefit of doing that is then you can easily reuse that secret and you can uh query whatever bucket you know you have uh in

31:42various you know Cloud storages we support through duck DB uh S3 of course uh gcp Azure uh AB FSS as well uh

31:53hugging pH now for those people who are running ml workloads um and and many more even Unity catalog is another new type of secret uh that dub introduced with the support of uh Delta lake table format which by the way we now support as well I was waiting for you to to put that plug right because Stephanie also

32:16Stephanie also put a lot of effort on um making the Delta Lake uh can you can you work us about a bit The Challenge like of make make it work uh um an extension

32:29like that in mod deck yeah absolutely so duck DB supported Delta L as a table format uh I think maybe about three weeks ago or so so we wanted to also support that in mother duck so you know straight forwardly people would think okay uh wouldn't that just work you know since it works in Duck DB why wouldn't

32:50it work in mother duck so I think part of it is also because we are doing the Dual um execution so that complicates things in some extent so essentially we have this uh kind of bespoke mechanism where we uh we have an Optimizer extension that plugs into dctb optimization logic where we do our own optimization and our dual execution

33:15essentially identifies when is a good time to execute something uh remotely in the cloud versus locally on your machine so Delta Lake for example when you're scanning over um S35 FES uh that would be a remote operation but if you scan over a local file that would be a local operation so uh we need to integrate this into our system in such a way that

33:41um it would work naturally with our existing dual execution framework and it needs to also work in wum which is always a special snowflake you know obviously um also not to mention and this ties back to the earlier Point I've made about why it is difficult to .db by yourself because we also build extensions in many various architectures

34:04right so we built our extension in Linux arm Linux x86 in Windows you know in OSX

34:10of course so we have a handful of architectures we have to build for uh to make things easier and and Delta Lake in this case uh actually uses the Delta kernel Library written in Rust as offered by data brakes which is New by the way right it is in experimental stage right now and gaining adoption as far as you know I saw when I worked on

34:33this so that means bringing in r as a new dependency into uh mother Duck's code base as well so of course you'll run into you know various problems as you you know link this library in various architectures um you know on our end so duct DB only builds Dela L extension for I think four architectures so far um I believe Linux um x86 OSX um

35:00is there as well I think Linux arm maybe not yet released but we actually build data Lake for all of the architecture supported by duct DB um so can you can use you can do a symol you know from Delta scan and then you pass in your uh data Lake path it could be a local path or it could be a remote path in S3 for

35:21example uh and then you can uh interact you can create a view on top of that create a table on top of that or directly you know interact with it um and before doing that you just have to create a uh S3 secret so that is the that is the requirement for Delta Lake and then you can you know uh work with

35:41it in a super like easy intuitive way yeah well we'll go to an example of uh creating secret and comparing uh local

35:51client without modd quering is Tre and then we with modd right uh right after um so if I look at the I lost actually The Cider uh page uh let me go back

36:08to

36:14uh here um yeah so I think we we cover most of of uh of the different things um

36:22that that you mentioned regarding secret is true that if you uh run on your on

36:29your own du DB U on a cuber dis cluster whatsoever you need to manage how you going to pass uh those secrets and there we can store it in in mod duck and just um that that would work out of the box and uh yeah I think we we cover most uh

36:48most of thing uh can you talk a bit like what you what you contribute so far in U in dougb open source because like that's a um a good point to mention like what kind of contribution mother duck has been doing so far for the Doug DB open sour project and uh and a be uh yeah and

37:08maybe based on your experience like what's the easy way to start contributing contributing to dug DB open source absolutely so I started working uh with duck DB about two years ago as well you know once I started at mother duck I think the very first set of contributions I've made was around s like ization um because the way we've

37:30opted for dual query execution where we are breaking the query logical query plan down into uh different parts so that requires all the logical operators and table functions to be serializable for mother duck because we have to make grpc calls and we need things to go over the wire so this is not something that duck DB really needs uh duck DB is again

37:55local and everything is in memory uh but because of mother duck you know they uh very you know very happily agreed to implement the calization logic um in Duck DB so I was the person who uh went and added sterilization code for all of you know many many of these logical operators you know at least for all the

38:17select ones um and then later on I think Mark at duct DB uh refactor some of these things so nowadays the calization code looks quite different and they use a different uh code generation mechanism uh but at the beginning days you know it was quite manual and so I worked on that um and then later on you know I worked

38:37on many many different things uh one of the things was of course with Sam on the secret manager so we designed it together and you know Sam of course did a lot of the hard work of putting in the basic infrastructure and you know a lot of the functionalities was written by Sam uh but you know since we're the

38:57number one one adopter of the secret manager the very first integrator of the secret manager so I also discovered you know some of the missing spots I would say and some of the nice you nicer to have features so I contributed to that as well and then you know many other things depending on what we find in

39:14mother duck uh because mother duck sort of puts duck DB in a production setting and we are tting and using dub in the kind of you know scale and Pace that duck DB uh was not tested for in the past so yeah that's a that that's a that's a really good point it's like I mean aside from the contribution you

39:36mention uh is that we are also kind of like the beta tester at scale because uh I mean it's not necessarily of directly we do have you know load balancing test and so uh and so on but uh over users

39:50suspectedly uh could would grow so the more uh uh user base we're going to have in the usage then basically that's going to push I think also the DB to to its limits right or further push yeah absolutely um so we're kind of the beta testers in many many fronts so uh one recent issues for example that we

40:12discovered and we communicate with duct DB was okay we uh see a connection error

40:18once you've had a thousand connections open while reading parket file right so that happened to us was using mother deck in such a way where it's you know compar concurrently in parallel opening up a thousand connections to rep parade files uh but in dctb you know perhaps the data science users or data engineer users uh didn't you know commonly have that kind

40:41of workload so they didn't detect this issue and we're also you know the I I would say the very first people exercising the data Lake related features such as Iceberg scan and Delta scan right so uh they're very new they're in place and I think being able to support data Lake use cases uh for users at mother duck is super important

41:04I think data Lake will be first class citizen at mother duck uh in no time we really really want to get this right and in order to get that right we need to do that in collaboration with duck DB so with duck DB we want to have good performance on of course S3 paret reads we want to have good performance and

41:21good feature coverage for Iceberg reads and also Iceberg rights hopefully soon and also data uh not just the basic you know scan of the entire table we want also you know filter push downs we want um you know projection pushdowns different things right optimization on top so I think that's going to be critical for you know mother Duck's

41:43success and also I think it will help duct DB's adoption uh if they you know improved on these front as well so it's really I would say a women's situation you know uh for both mother duck and duck DB that you know we are kind of the first beta testers here yeah no that's um that's great that's great to mention

42:04and uh regarding Delta Lake I mean we've talked already enough because we're GNA have a dedicate session with that with uh uh someone from data brakes uh actually where we're gonna dive in the uh in the end zone so we have uh 20 minutes left so I suggest we uh start a bit uh a bit more to play with uh with

42:23the the CLI um and so what do what do you would like to sorry I forgot to answer your question earlier about contributing to duck DB yeah that's true sorry I completely missed that yeah I want to encourage the the viewers and listeners to consider this seriously is that it I think it's super easy to contribute to duck DB you don't have to

42:46be working at mother duck or any Associated big company uh or small company in order to contribute to duck DB um the reason why is there are an open source project and all of their issues a lot of their issues are just out there on open source repo so if you go to github.com dub uh you will find the the issues Tab

43:09and they literally have like hundreds of issues open and they're very good at labeling their issues so if it's something that you know they've already kind of looked at they will label it okay this is reproduced and this needs triage and this is something that's under review and things like that so as a new contributor you could find

43:28something that is you know not too big that you know that's already something that's being reproduced and then you can start you know with cloning the project um and just building the project and then maybe running through the unit tests and then you can write a test that would fail initially and then you would debug the code with your debugger setup

43:48and then you will just make a PR against your repo and they're really really good at code reviews so you'll get your code reviewed very quickly like within you know the day or two essentially yeah and this is not just their main repo like I mentioned earlier they have a very extensive extension ecosystem as well right so that means they have dubore

44:10Iceberg dubore AWS dubore uh Azure and dubore Iceberg all of these uh repos are open sourced and all of them have a a issues page right so if you go to any of those extension reposts you go to their issues pages then you will also see outstanding issues that are particularly related to uh the extension itself right so uh then

44:35if it's something that you're interested in fixing or improving I hear lots of words about people wanting to improve Iceberg functionalities for example feel free to do that right feel free to um you know go take a look at what are the existing issues if there's something that you feel like is immediately actionable then pick it up or if you

44:54feel like you want to propose something you want to discuss something then create a discussion topic and they're very very open-minded and willing to discuss at all times so yeah yeah no that's a that's a good call out and what I would add is that DB is building C++

45:12and that's block a lot of people but you actually don't need to even know how to code C++ or even you know barely to code actually if you only new SQL if you a dug DB user you face an issue you can reproduce that issue that's what love people is to be able to reproduce so having the SQL query having the data set

45:33uh to reproduce your issue and if you create uh a g up issue regarding this that you can reproduce that is really valuable that is not Gooding at all right it's like you identify uh a problem that it can be reproduced and aside from that there is also just the documentation documentation is also you know a team effort and sometime thing

45:56are missing or not correct or could be added if something is not clear for you in the documentation it might probably be not clear for someone else so it's worth to just open a discussion right there is also G up discussion you can just open a discussion if you're not sure to create an issue um yeah yeah so

46:15that that's the the the two thing I wanted to to call out all right let's go back to uh uh the terminal and what I would suggest uh what what do you want to uh to start you suggest me before starting uh a couple of things right yeah um um I mean uh did you want to show maybe uh creating a secret uh in

46:40mother Dock and then maybe we can run through um scanning of Park and I also just want to show the MD run parameter that I think most people actually um are not familiar with I'm not I'm not even like using it I think I use it like a couple months ago but um so um yeah we

47:01can we can start with that so you had the example of using the TPC H extension

47:07right is that correct uh y y we can do that but um yeah we could we could generate uh did you want to show like the query plan with that yeah we can do maybe we can start uh with the with the query plan um so uh I'll I'll generate

47:25locally yeah sure sounds good um what kind of factory do you suggest there oh um we can do 0.01 or actually do you want to generate in a mother duck um database so what happens is if we create a database if you say create database you know uh maybe like Cloud DB and then we say use cloud DB that automatically drops you

47:50into a remote mother. DB so super easy like that and then once you've said use cloud DB then that will generate this um uh tpch in the cloud TV yeah yeah so we can uh we can definitely do that so I'm going to connect to mod duck again to uh to an attach so now uh so now I'm there

48:13I'm just going to zoom out a bit for for the people for the audience uh and let's uh create database

48:20uh you know CH may just T CH make it yeah I'm just thinking if I didn't have any already this one um yes so I'm

48:31basically there and so now what I could call is uh DB uh we can say call DB gen

48:39and then we can do like a 0.01 so uh what it does is actually calling the extension right to uh for

48:49TPC H right uh data set so uh ddb has a

48:55building extension for uh uh tpch which is um a set of uh open data set for

49:04comparing uh database benchmarks uh so it is it's really easy to just generate uh fake uh data that you want to try and test uh so that's what we're going to do and so now it's done and if I do show table this is the table that has been created right yep sounds good yeah and now we can do a simple join an

49:28aggregation query to show um how many customers there are in a given Nation yeah so I think you gave me this this query before uh so let's let's run this one um so that's account per per per

49:46country y perfect yeah so here as you

49:51can see the query is referring to the customer table and the nation table and these are your remote database that's in mother dock so this query is being executed remotely and how do we know that if we simply a uh prepend explain in front of this query we can see the the query plan um that was used to run

50:13this query all right so um yeah can you talk

50:19to us a bit about the so I just run an explain so what is happening here yeah so the explain plan will give you uh the physical plan that is used uh to determine uh exactly how this query is going to be gener uh be executed so before the physical planning stage we had the logical planning stage and there

50:39was the optimization stage so during that optimization stage and planning stage The Logical planning stage we uh determined okay so this is a remote uh table that you're querying here so we're going to go ahead and uh execute this query remote remotely so therefore the physical plan being generated here reflects that that decision that okay we're going to go ahead and execute this

51:04remotely and uh when you execute this remotely so on The Logical side we insert a bridge operator that we call uh to kind of you know facilitate this uploading and downloading of of data so in your uh physical plan you can see that all of your scans all of these operations were done the projection the hash joint and the sequential scan all

51:28of these are remote so there so that's the the the r ear right right exactly and this is local correct however that is local because uh essentially that is where this uh query is going to download this data you know from remote to local uh using this bridge and the bridge ID is is one so that's how you can see that

51:53this is a remote query but eventually you have to have your data back on the client where you're seeing this on your terminal right here right so that's where your local comes in yeah all right yeah but however you know we can also show how to uh run this query kind of you know in a dual mode so if we uh if

52:14we copy the table to um a local file so

52:18if we run the maybe copy customer to customer paret uh then that will like

52:25copy this data to your local

52:30essentially uh so let's say uh yeah customer. par yep that sounds good so now I have

52:40uh let me actually open I'm not sure if uh it I was in my own directory but yeah here it is uh so I have

52:50uh the customer par here which is a really small data set that's been copied from Cloud to locally right right exactly yeah so if we modify our original query um to say instead of selecting from the remote table but instead we select from uh customer. par so that's going to actually um select from your local customer. par uh file

53:17yeah so here basically yep can switch that to um customer. Park um so yeah uh well let's Okay so

53:29there is a lot of things that happening here because like first of the thing I mean we talk before that but here also in term of extension we didn't have to specify you know the file format extension or whatsoever I just specify the extension of the file and it knows that you need to write it as a parket

53:48file right um and um and here and so

53:54this is downloading a remote from the cloud to local right right exactly and here we are joining a parking file with a local parking file with a remote uh table Nation table yeah that's correct yeah yeah and again we can see this with an explain C again

54:18like all this that we have said yeah so now you can see this is a much more interesting prry plan unlike before everything was and now you have uh you know remote sequential scan and then you have a batch download sync that's remote and every single sync is matched with a source on the other side that's going to

54:39be uh either local or remote so you always have a sync and a source uh pair essentially yeah and and this and then yes and then if we go up you will see eventually you know we're doing these uh logical projections all of these things happening on your local file locally essentially so that's what happening with your customer. parquet stuff and

55:03then you know finally you're downloading your data back to your client um so that's going to be a local operation as well so this is how you know this query um was executed in a dual uh dual mode as opposed to fully remote and why that is the case is because uh one of your uh tables is actually local and there's no

55:25reason to have to migrate that data and instead we just computed leveraging your local compute right on your computer um so you have this plan as a result yeah no that uh so that makes a l sense and uh and basically we are leveraging the local dug DB client here to do part of

55:47the of of the compute operation exactly and it may not be super obvious here because customer. paret is a relatively small file but if if you have a much larger file um this will be probably more obvious that this is going to be a lot faster and this is especially prominent when you're rendering something in the browser especially if

56:08you're trying to um have some sort of you know graphic you're are trying to drag a bar for example and you want to see the the graph you know respond to the position of your bar immediately uh without any sort of latency uh so that makes local you know first execution in this case very very appealing but again uh it's

56:29not always local first in mother deck it is a dual situation here it really depends on um how how big your table is how large your data is and where you're executing from uh that then we determine you know uh optimistically like kind of optim how to optimize your plan in such a way where it will execute in the most

56:51efficient way yeah now with that said uh there are certain customers or people users who want to just you know dictate I want to run this thing locally only or I want to run this thing remotely only um I have seen both cases so of course you know someone who want to just run something locally uh they want to do

57:13that perhaps because maybe they don't want to share their credential with mother duck right so they want to just execute something locally with their local e memory credential so they don't persist this in the cloud uh or perhaps you know if R2 which is you know kind of edge you know storage uh you may want to run your query locally but really on

57:34edge so that's closer to you know where the data is stored so there's a reason why people may want to have that capability and that's the reason why we have introduced the MD run parameter uh that that can be used for you know all of these duct DB scans so paret scan CSV scan Delta scan Iceberg scan all of them

57:54you can declaratively just say uh you know uh read paret and then you pass the path of the parket and then you can just say MD run equals uh local so then that will be executed locally uh but if you say MD run equals remote then that will be executed remotely so you can determine that for yourself as well

58:15yeah so here of course if my file is local it will perform locally I cannot force to uh agree it but basically the syntax will look like like uh actually you have to space yeah because we need to pass this parameter in yeah so you have an extra uh parameter and so it's going to be Mar paret and then MD run

58:43yes local remote yeah yes you have to

58:47use parenthesis there one pair of sorry uh quotes just one single quotes that says uh let's say local yeah even though naturally this would have been run as a remote quy in mother do because we see that you're trying to scan a remote file um so one thing to note over here is uh if you're trying to run an S3 uh file

59:12completely locally this will involve a downloading of the data right like in the back end like the duct DB httpfs file system when they execute this they're going to download it so it is important to check the permission on your bucket as well because if there's permission you know forbidding the operation of downloading or something like that you might run

59:31into some type of permission error but if the permission front checks out then this will just like run you know without any issue um and um yeah maybe we can uh we can we can try to do this because we are uh closing to to the end already um but uh we can I have an example actually

59:51where we could uh do specific query on a

59:55relative large S3 file one without using

60:00Mock and one with what do you think yep sounds good uh so I'm going to load my uh EDS credential um so the right side is connected is just dougb connected to S3 and I'm going to do um a left side where I'm connected to mod deck and so to create uh secret uh this is the only thing I need

60:31right to to specify correct yes exactly um so you can also for people who are used to doing AWS s so login prior to

60:42running this query if you just simply do AWS SSO login this is going to be creating the secret using whatever login credential you've had with that so it's super straightforward you do not need to type in all of your credential information yeah just chain yeah that's a good point that I did do an SSO logging uh before

61:04and I do uh did also added already this

61:08uh so I don't need to to do it and I'll maybe just show uh quickly uh in the in the UI you can see

61:19actually that uh actually I'm I have the

61:23the UI right there mm H so that if I go

61:28to Secrets I have this Secrets which is already uh created in store right and another thing to mention is a lot of people uh because they're part of an larger organization when they do uh awso login it may be a temporary credential that expires in say 8 to 12 hours yeah um so I think one thing that duck DB is

61:49looking to work on is uh token refresh so that we will have some sort of clients that mechanism them to know when a credential has expired or or something like that but for now you know if you have temporate credential and that might have expired so then you will just do a create or replace secret and then you do

62:09the same credential chain uh command so that will then pull in your your newest uh credential okay uh so yeah that's a a

62:19good point to to mention and um and so I have a a data set which is

62:27um this one which is a pretty long uh actually uh query and it's actually um

62:365pi data the data set is roughly 2 gigaby right and I'm doing basically uh

62:43aggregation so it's representing each row is a is a download uh in PPI so on

62:50dug DB project and so this is using um

62:55BAS basically mod maybe we can do let me do a timer always forgot so what I did is just a timing to see uh who's going to be uh fastest uh so let me uh so this sounds

63:15uh good I think um I have an expl in front which is not what I

63:29mon all right so I'm going to run this query it's going to take uh take some times and uh and here uh I'm going to basically run the the same query uh and

63:49hopefully and so this is not using Moder duck this is just using um uh ddbn S3 and so here we see that we have uh okay 25 seconds but here what's

64:04going to happen is that it needs to download the data right exactly and so the larger your data set is which is here it's still relatively okay it's less like I filter out so it's less than one gigabyte this data of 4K but you can see that the more you need the the more data you're going to need uh to have on

64:28remote the more you're going to need to download and so that's a lot of traffic IOD so that's what we explained earlier right yeah exactly um it's yeah exactly

64:40so that's a great example by the way Med yeah um so yeah so that's that's about it I don't know about the timing I think it was a bit a bit longer here I do have a good internet connection here uh so that's actually that might be you know suspicious but again um so you can you can compare yourself if you if you I I

65:05forget to put the timer on but if you put the timer on uh and you run this uh you're going to see the difference uh between uh mod deck Cloud using S3 and local S3 and and your where is this bucket located uh Med uh Us East one okay that

65:24makes a lot of sense also since you are uh streaming from Europe and uh mother Doug happens to also be in US is one hooray yeah exactly exactly all right um

65:39let me uh just uh uh wrapping up uh things uh so yeah

65:48thank you uh Steph I think uh it was great we learned a couple of things on uh the different components of uh uh what makes mod special we had uh a Hands-On part uh showing uh the explain

66:04with the local remote uh how you can actually Force also uh some execution to do uh a local ear remote and we saw uh the the performance difference and actually have the results here it's not even a big data set and again uh let me share again here so you see we have uh

66:2336 uh seconds and here 25 when using uh

66:27M deck uh but you can do you can do your own test based on your use case uh you can also join local data so that's what we did in a remote table um so that was pretty fun um yeah yeah anything anything you would like to to head yeah I think I was super happy to be here and um that was super fun medie

66:47and it was really cool to see your your demos and all the Hands-On exercise so I just hope that the viewers now have a better understanding of how uh dual execution at mother dug works and uh the real challenges of deploying duct DB in the cloud and how to do that properly in the sense that we want to do it securely

67:06reliably um and cost effectively um and also accessibly you know there are just so many uh things involved with um you know getting this product to work easily for people it's not just about getting it to work but we really want to make it work well for um our users and so a lot of challenges you know come from uh

67:29making sure that the product is excellent um yeah so uh thank you so much again for inviting me yeah cool um and thank you for for joining us the live qu and code is happening every other week uh so don't hesitate to join

67:47uh the LinkedIn page uh follow our the YouTube channel we are streaming from both uh LinkedIn and YouTube all the time and uh you can follow up with the events on the events page mother duck.com events if you want to don't miss other quacken code event or order uh in person or online events and uh have a great end of the week

FAQS

How does MotherDuck's dual execution split queries between local and cloud?

MotherDuck's optimizer analyzes your query plan to determine which tables are local and which are remote, then splits the physical plan into sub-trees executed in the optimal location. For example, joining a local Parquet file with a remote MotherDuck table results in the remote scan running in the cloud, the local scan running on your machine, and data bridging happening automatically between them. You can inspect this using EXPLAIN statements, which show R (remote) and L (local) annotations on each operator.

How does MotherDuck extend DuckDB without forking it?

MotherDuck uses DuckDB's extension system rather than maintaining a fork. The MotherDuck client extension modifies DuckDB's parser, binding, catalog lookup, optimizer, and storage layers through the official extension API. This means MotherDuck can upgrade to new DuckDB versions quickly, for example going GA the day after DuckDB 1.0 was released. It also means anywhere you can use DuckDB (Python, Java, CLI, WASM), you can connect to MotherDuck by simply attaching with ATTACH 'md:'.

How do MotherDuck secrets work for accessing cloud storage?

MotherDuck provides a pluggable secret manager co-designed with DuckDB that supports encrypted, cloud-persistent secrets. You can create a secret with CREATE PERSISTENT SECRET while connected to MotherDuck, and it will be stored encrypted in the cloud for reuse across sessions. It supports AWS S3, GCP, Azure, Hugging Face, and Unity Catalog credential types. If you use AWS SSO, you can simply run aws sso login first and MotherDuck will chain those credentials automatically. Learn more in the MotherDuck documentation.

Can you force a query to run locally or remotely in MotherDuck?

Yes. MotherDuck provides the md_run parameter that can be added to any scan function (Parquet, CSV, Delta, Iceberg) to override the default execution location. Use md_run='local' to force local execution or md_run='remote' for cloud execution. This is useful when you want to keep credentials local, when you want to scan edge storage closer to your data, or when you want to explicitly control where computation happens for performance or privacy reasons.

What challenges does running DuckDB in the cloud create?

Running DuckDB in the cloud requires solving several problems that do not exist for local usage: multi-user authentication and access control, secret management, reliable shared storage (DuckDB's local file format is not cloud-friendly since writes can leave gaps), monitoring and observability, and efficient data transport. MotherDuck addresses storage through a custom differential storage layer that maps DuckDB's file system to append-only S3 snapshots, enabling time travel, database sharing, and in-place updates.

Related Videos

" Preparing Your Data Warehouse for AI: Let Your Agents Cook" video thumbnail

2026-01-27

Preparing Your Data Warehouse for AI: Let Your Agents Cook

Jacob and Jerel from MotherDuck showcase practical ways to optimize your data warehouse for AI-powered SQL generation. Through rigorous testing with the Bird benchmark, they demonstrate that text-to-SQL accuracy can jump from 30% to 74% by enriching your database with the right metadata.

AI, ML and LLMs

SQL

MotherDuck Features

Stream

Tutorial

"The MCP Sessions - Vol 2: Supply Chain Analytics" video thumbnail

2026-01-21

The MCP Sessions - Vol 2: Supply Chain Analytics

Jacob and Alex from MotherDuck query data using the MotherDuck MCP. Watch as they analyze 180,000 rows of shipment data through conversational AI, uncovering late delivery patterns, profitability insights, and operational trends with no SQL required!

Stream

AI, ML and LLMs

MotherDuck Features

SQL

BI & Visualization

Tutorial

" The MCP Sessions Vol. 1: Sports Analytics" video thumbnail

2026-01-13

The MCP Sessions Vol. 1: Sports Analytics

Watch us dive into NFL playoff odds and PGA Tour stats using using MotherDuck's MCP server with Claude. See how to analyze data, build visualizations, and iterate on insights in real-time using natural language queries and DuckDB.

AI, ML and LLMs

SQL

MotherDuck Features

Tutorial

BI & Visualization

Ecosystem