Lessons from Building MotherDuck with MotherDuck
When I joined MotherDuck as a Founding Engineer, I took on an unusual role: becoming the company's first internal customer. "Dogfooding" – the practice of using your own product extensively before releasing it to customers – has long been a staple of software development. For us at MotherDuck, a company building a cloud service based on the popular DuckDB analytical database, this approach has proved invaluable in uncovering real-world issues and accelerating product development.
38 Seconds of Nothing
My journey with data systems began around 2010 when I inherited the management of what was then the largest east coast Hadoop cluster, a 300-terabyte behemoth. This early experience revealed a fundamental inefficiency in distributed systems that would later influence MotherDuck's design philosophy.
I ran a MapReduce job that took 42 seconds. But when I ran the same computation locally on my machine, it completed in just 4 seconds. I wondered: where were the other 38 seconds going?
The answer lies in the overhead of distributed computing. When data is processed across multiple machines, it must be serialized, sent over the network, deserialized, and reassembled – a process called "shuffling." This creates significant latency that doesn't exist when processing data on a single machine.
In distributed systems, every shuffle operation causes this network storm where each node must communicate with every other node. You're exposed to long tail latency because the second part of your query plan cannot proceed until all partitions have been received.
Meanwhile, hardware advancements have dramatically increased the processing capabilities of individual machines. Modern servers can have dozens of cores, hundreds of gigabytes of RAM, and terabytes of fast SSD storage. A single Graviton3 instance at $1.30 per hour delivers 300 gigabytes per second of aggregate memory bandwidth – comparable to an entire mid-sized Hadoop cluster from a decade ago.
We built these distributed systems to handle the top 1% of workloads, but we pay this distribution overhead for everything, even when it's unnecessary.
No Overkill, No Lag
The inefficiencies of traditional big data systems created an opportunity for a different approach. DuckDB emerged as a response to the need for efficient local analytical processing, particularly for data scientists working with tools like pandas.
DuckDB takes a fundamentally different approach from cloud data warehouses. As an embedded analytical database, it runs in-process, eliminating network overhead entirely. It offers columnar storage, vectorized execution, and aggressive pipelining – essentially implementing state-of-the-art analytical processing techniques in a library that can run anywhere.
DuckDB became relevant because it addressed the inefficiency of traditional approaches. It applies cutting-edge research on pipelining data and parallelizing operators to a market segment that larger companies had ignored – local analytics on your own machine.
Unlike traditional data warehouses, DuckDB has no authentication system, no network interface, and no user management. It's designed for single-user, local processing. While this makes it incredibly efficient, it also limits its usefulness for collaborative work and larger datasets.
Tiny Queries Everywhere
Our hypothesis at MotherDuck that most analytical workloads don't require massive distributed systems was supported by real-world data. Amazon published a paper with aggregated statistics from Redshift users, revealing that approximately 95% of queries scan less than 10GB of data, and 90% scan less than 1GB.
Even more surprisingly, 30% of queries in production data warehouses scan just 1-100MB of data, and about 10% of Redshift customers are running entire warehouses on datasets between 1-10MB – data that would fit comfortably in a laptop's memory.
People are over-provisioning relative to what one big machine could handle. There are benefits to having a shared, cloud-based system, but clearly many organizations are paying for more distributed processing power than they actually need.
We are Our Own First Customer
From day one, I positioned myself as MotherDuck's first customer, implementing our internal analytics infrastructure using our own product. It would be hypocritical not to use the product we recommend to others ourselves.
Our internal analytics stack at MotherDuck is relatively straightforward – using Airflow as a scheduler, mostly Python for data processing, and growing to employ partners for specific needs. The system now handles about 6,000 queries daily across 40 scheduled jobs, making it a substantial enough workload to thoroughly test the product.
By being deliberate about data types and optimizing for the vector execution engine, I keep our datasets efficiently sized – about 1 terabyte in total. This allows us to run the entire analytics operation at remarkably low cost, demonstrating the efficiency that careful engineering can achieve with DuckDB.
Our billing is ridiculously tight. We only charge for when queries are actively working on something, not even billing for time when they're blocked on I/O. Following good practices with DuckDB results in a really cost-effective solution.
Real World Discoveries
The real value of dogfooding emerged when I began encountering issues that might not have been caught through conventional testing:
File Descriptor Limits: We discovered a slow leak of connections when using SELECT FROM s3:// commands that would eventually hit the 1024 file descriptor limit after about a week of running. This would never appear in short unit tests but became apparent during continuous operation of our analytics stack. Without dogfooding, customers might have hit this in production before we caught it.
Concurrent Schema Change Issues: We found that when schema changes occurred, our system would pessimistically abort running queries due to out-of-sync local catalogs. This seemed reasonable in theory, but with tools like dbt where every materialization is a DDL statement (CREATE TABLE AS), it made the system practically unusable for real data engineering work. We had to revise this approach for better usability.
Long SQL in UI vs Error Location: Our initial UI design placed SQL error messages below the query. This worked fine for simple examples, but I work with 1300+ line SQL queries that require scrolling back and forth between errors and the code that caused them. We shifted to inline error highlighting, similar to what dbt does, making it much more practical for real-world complex queries.
Validating Differential Storage: We implemented a feature called differential storage, essentially writing our own file system layer. This was an extremely risky change that affects the core of data persistence. By enabling it for my production workloads for a month and a half before rolling it out to customers, we were able to catch edge cases and ensure stability for this critical component.
Bind Timeout During Slow Source Scans: When connecting to external systems like Postgres, we initially expected at least one row of data every 15 seconds to keep connections fresh. But I had queries that scanned billions of rows yet returned only 40 rows after 10 minutes due to high selectivity. These timed out unnecessarily, teaching us that timeouts need to account for query selectivity, not just overall duration.
Deeply Nested JSON Memory Usage: While typical JSON test data might be 30 fields with one nesting level, I showed up with real documents from MongoDB that were 80 levels deep per record. DuckDB wasn't prepared for this and used about 500 times the memory of the string just to parse it. Real-world data is far messier than test data, and dogfooding helped us identify and fix these memory issues.
Slow Extension Load Times: As our binary grew over time, we weren't closely monitoring its size. The security checksumming process (which is crucial and can't be skipped) was taking 7+ seconds on each extension load. This delay became very noticeable in my daily workflow and prompted us to optimize our binary size and loading performance.
Continuous Learning by Running Stuff
Our dogfooding approach created a tight feedback loop between product development and real-world usage. Over time, my role evolved from primarily testing basic functionality to actually using MotherDuck as the company's data engineer.
When we first started, I was just making sure that basic operations like loading data and running aggregations worked properly. Now, about two-thirds of my time is actually doing data work for the company because the product is no longer giving me trouble.
This transition happened because issues were identified and fixed early. By the time MotherDuck reached general availability, the product had already been battle-tested through months of internal use.
We also gained valuable insights about scalability and resource management on a single node. While distributed systems spread workloads across multiple machines, a scale-up approach requires careful attention to memory allocation, core utilization, and I/O management. These learnings informed improvements to DuckDB's resource handling capabilities.
Looking ahead, we continue to enhance our hybrid execution model, which intelligently processes queries partly on the client and partly in the cloud based on data location. This architecture provides the best of both worlds – local processing speed with cloud scalability when needed.
If You Don’t Use It, Nobody Should
For us at MotherDuck, dogfooding has proven essential in building a product that truly meets user needs. By using our own service for actual data engineering work, we uncovered issues that synthetic testing would have missed and gained firsthand experience with the product's strengths and limitations.
This approach has yielded three main benefits: identifying bugs earlier in the development cycle, ensuring feature designs match real-world workflows, and building empathy for the customer experience. It has also reinforced our core thesis that most analytical workloads can be handled efficiently on a single node with modern hardware.
For engineers and data teams considering new analytical solutions, my experience suggests a few key takeaways. First, many workloads don't need the complexity of distributed systems. Second, the efficiency gains from eliminating network overhead and intermediate data materialization can be substantial. Finally, a product built by a team that uses it daily is likely to evolve in ways that address real pain points rather than theoretical concerns.
Whether you're building or buying a data solution, the principle remains the same: the best way to understand if something works in practice is to make it part of your daily workflow. I've essentially been the sacrificial data engineer, running into issues before our customers do, which has made the product better for everyone.
Transcript
0:05what I want to do just a quick series I kind of want to tune what I'm going to say to who we have in the audience um a little bit so how many people here have at one point or another whether it was your official job title or not had to maintain data for an organization or group of people hands up oh my God okay
0:23so that's great because uh you you will
0:28feel you we'll able to connect on the level of pain because I think that that's a Don devotion we maining um okay so who has officially had a data engineer job title all right so these are people who probably had control budgets uh or at least like some for okay what are the other side the consumers like analysts
0:50bi gotta some people who were like stats got so you had to deal with all the mess the first group of people were were making um I've tried to be a data scientist several times in my career and every single time I got there I'm like how how can you work like this and I just became a data engineer again so I finally gave
1:11up and became a fulltime data engineer something like 10 years ago um so I out in New York fresh Bas and New York in 20 2009 was like magical for data I know the technology was developed globally Silicon Valley West Coast but New York the data got used like on the ground we had Finance with the sweer pool of qu we
1:32had a ton of startups up and down Broadway there were like goo meetups every two weeks um there are bunch of new data Parks coming and honestly like thank you guys for coming because I think it's coming back the uh Bill Eon does a New York systems Meetup Pro erican do another database Meetup appr and I know Meetup has got really crowded
1:51with all kinds of like spany meetups but I think like the data engine meups are there uh and I'd love to see it's happening again so in that ve I want to talk to you about like how I ended up with M duck because um it's interesting story and also I want to show you what we're doing there to make sure that what
2:08we build is actually like useful to our customers okay um so basically what happened is something like this it's maybe an old me but uh the X
2:24people from these three database companies for some reason knew me is is a useful PS and they said come on we're going to like make something new and what they were proposing is like really appealing because an experience I had like 10 years earlier and this was it all right so this is 2010 I'm working at an company um and at the time 300
2:47terabytes was a lot of data uh the fact that we don't considered that as giant now is actually really something we had the largest uh East Coast uh huk cluster
2:58at the time and the guy whose job was left for yahu the mother of all y you know P operators and I got the job and I was like I was learning that produ there was no sequel in Ado there was no nothing we just wrote everything in raw Jaa and we didn't realize at the time but we were there was like
3:18a all the data Engineers around that time the first people in big data were becoming very bad databased software writers we reinvented consistency L we
3:30invented partitioning terribly but we got it done and the reason we had to do it the same reason Google had to do it the margins on antech are really Slim you're talking like a tenth of a penny per thousand Impressions and each one of those generates a log line so where are your margins to pay Oracle $110,000 per
3:48core per month for whatever they're doing so we were all desperate we all got commodity boxes and this was kind of like b a map produce which is kind of the Forerunner to a lot of the b day technologies that came after so I was there and I was like okay I took over some you know uh summary jobs that we
4:03had had and typically I would just run you know the new jar my job jar nothing with there no Pig no hi it was all just job there 42 seconds and I thought well stop like that much data right I could fit it on my took a few minutes but I got over to my to my local and I ran it
4:21and I just like piped them right so I ran my Java uh and put it on intermediate this is just on the console and thing wented in 4 seconds and I'm like is Hado a scan like what's where are the other 38 seconds going what's it doing so what is it doing well distributed systems have a lot of
4:40overhead you basically have to like coordinate first of all uh you have to send data so let me just think if I want to send something to Stephanie across a network I've got to serialize it even before anything so that goes into a buffer the that gets streamed up by the operating system packet by packet across the network on the other side that goes
4:59into another seeing nothing happen in the middle could be delays she has to re serialize it so the data is being copied like four times on each side before she can see Hello step and that happens between all the stages of your job and where does this happen so let's just keep our computation to like SQL The
5:20Operators that require this shuffling across will be like uh pass short so anything that is a hasch agregate anything do do a sorts so window functions are basically sorts these days um uh any distinct operation basically anything useful has to shuffle um I think actually skipped to this for a second I I'll come back so we have this like Shuffle operation that's
5:44happening and broadcasting these are the operators that use them and on distributed systems they do something like this here your partitions which are given by each node and each of them writes you know locally to a partition file kind of locally and then sends it across the dev you have this end to end like crazy Communications you ever
6:02looked at a network graph during the shuffled operations of the distributed system quiet quiet and then the first about start producing in this network the where everyone not the second part of the plan cannot proceed until this thing knows it's got all the green sections which could have originated on anything over here and so you're exposed to longtail latency so of course if you
6:24have to do this you have to do this this is this is what enabled you to run a queral over 300 tab dat for a few dollars but my God for smaller data sizes this is absolutely insane I get a sense the entire file from here to here complete in uh one ten in time it took to do this kind of Chu dance so
6:44basically we kind of like in order to enable the top 1% of queries we pay this distributed system's overhead for everything else I don't talk about the social re right everyone tried to like oh we're Google we need M but I was at companies were happen where we easily could run around smaller systems but there was incentive from the top to make
7:03sure that they participated in Big Data it's a lot of like we got to have this technology too everyone else does oh yeah and so I want to compare that to what happens when you just have something in process okay so this is the process on your computer um you can do a lot of parallelism on one machine right now not
7:20even like time shipping you have so many cores I'm get to Y in a second so same query we're going to you know group over here but instead of like across the network we're just going to like write to buffers in memory um and we're going to like locate them and create little mini partitions these will be the other side
7:39you know what previously was on the bottom and these just have pointers to the relevant sections there's no copying here my output went straight to what my uh readers were going to read and so I've accomplished the exact same thing as with the shuffle except I haven't copied a single thing my output is coming input directly you couldn't do
7:58this in 2010 four core machines were electure from and eight core machines were lecturing 2011 but we have nowhere left to go for all the transistors the tsmc manag prame on except create more Cordes and so this makes a lot of sense um broadcasts where for example you're doing a join for a small table to a large table and that small table needs
8:18to be everywhere assuming you don't have it already collocated on the storage of those nodes this is even worse just like in the single note case well everyone just points to the same block of data and over here I need copy this to all these machines redundantly so why did this happen and let me go back to some economics here so
8:38this you know if you work in technology then your kind of career is somewhat based on Material Science basically stuff happens in Labs where um new process new chemical processes for etching show up they take 10 years or 15 years to go through the kind of Pipeline and eventually show up in the F this process becomes so intense that only a
8:59few companies in the world can can do n nodes um but not all the qualities scaled linearly so this is the free lnch is over this is the end of you know Computing progress for free so around the time that I started get into adtech and doing all these computations frequencies stopped I know you could buy 3 GHz machine at one point but basically
9:22we kind of decided that two to 2.5 up to 2.8 is the ideal to kind of cooling everything after that these became too hot and our transistor count continued to go up we kep them able to shrink it and miraculously they're still able to kind of shrink it we'll see how much longer so where did all these extra
9:39transistors go they didn't go we couldn't have anything hotter we were getting too hot to cool so our Watts stay the same our single thread performance didn't get any better because we basically uh these algorithms got perfected over here and so they kind of just Rose the frequency they all went into more cords so this is the only
9:57light that goes up here so we now we have Machines of 64 cores we used that four but they're no F so when they L to for um so let's look at what you know your money got you uh I Tred to find like on History like what $125 an a r
10:15you bought you in 2014 and this is what came up on Amazon so of course it's three times the cost of what you could get if you like cran Med eight cords C size are okay um 68 gigs r is
10:29not bad but look at the bandwidth aggregate bandwidth to your machine is 51 GB per second a gron free and $130 an
10:38hour has 300 gab of stepping the band machines that's as fast as your your high-end networks used to used to be if you were lucky so if I have 32 quarters and 3008 standard in between and I added dollar an hour by like skipping you any once that day and I get a 3 terab SSD for cashing
11:00doesn't this look a lot like a midsize ad dup cluster from like 2010 and suddenly all those shuffles go away and we start to just kind of like hand people pointers if you felt like the machines you're doing the systems you're working with were getting really heavy and you're going well Technologies Advanced how I'm not seeing in my job
11:19it's like this is why you can do much more in a single port than you ever could um but none of the sofware that was around really has been taking advantage of it there incentives the Big Data companies want to sell you more and more um the cloud companies want to use more and more nodes so who is
11:36incentivized to make something efficient that could do this so um actually I'll just tell you a little bit about how duct he became to be okay Hest will tell you like the better ver of the story on me but um as I understand it he was walking by a couple of people who were doing some like R Candice work and they do some
11:58data transformational steps and I don't know if work Pand this scen you like do do transform. filter whatever and it materializes in memory after every step which is insane to a database person why would you materialize your entire Matrix your whole data set after every operation you would pipeline otherwise you're going to get a data SI as essentially double your
12:21data um West M I think said that like the recommended amend of memory is 5 to 8X your dat set to properly run pandas well nobody does that they kind of like you know just make it work and spill like half but um the way duct became ring was just the insanity of that wasteful copying as the data sets
12:44got larger and applying essentially Cutting Edge research as far as like pipelining data and paralyzing operators to um a market segment that basically the larged companies that ignore which is how much analysis can you do on your computer your local machines these days doesn't matter what platform you are are miracles of technology that you would have paid like $50,000 for you know 10
13:10years ago and you just have it there uh so duct B is basically the state of the Ark randomic processing stuck into a library um so one this came from what I got up here is actually a clock post that our Co produ um and this I don't if you saw red shift the paper out including an data set so if you ever
13:30used red shift you got these stats tables that tell you kind of like okay I ran this query how much did it scan whatever so they sub aggravated that across a bunch of customers and one thing you could extract from there are people's data sizes and people's SC sizes so when you look at actually and this is a 3-month period about 500
13:49million courage this sample um I can can't these are too complicated so if I had add up from here these are query sizes and session will be like kind of the session time it's about four or five 95% are underneath 10 terab of s if we
14:07Bop off this five 90% are under 1 tab of
14:11C these are like clusters these aren't individual machines so how many people are actually 30% are 1 Gigabyte and still we're using these large the same distributed systems through these BR is these GRS um so you're like well maybe don't just have a lot of data uh the same thing happens look how many people have 1 to 10 GB data on rship the
14:34machine they're using to upload their own computer could handle that data set you're just fine and yet we have 10% of retro customers spinning up entire cluster C there's a reason for that it's hard to share data from your computer um so there are benefits to having a shared you know state but clearly I think people are like over buying relative to
14:55where you can go get one big machine for I think people are waking up to this I love interruptions I need them to keep me on track do have any questions who I yeah all right we talked so basically all that was like the motivation for why an opportunity showed up for duct and then turn why you know by the means of
15:19the car um so I've always had this feeling that we've been over baring using overly complicated systems for essentially over mostly sorts and aggravation right but duct DV is not a data warehouse right in fact it's um really duct to be is a library and it happens to come with a CLI um they actually took the one from SQL white and
15:42started modifying it um but that's what you running when you type induct but what are the properties of this Library well first of all you open up runs the process it locks the file and of any other writers to it you can't even have other like readers to it while you have a right lock on it you can have two read
15:58loocks um local only only on your computer there's no network interface there's no query interface uh there's no users obviously it's uh only on local dis um all this I guess changed recently with like an S3 days path and you know if You' put it on ec2 you are paying for the entire time that it's up whether you're like running
16:18it or not um the bar has been raised So High by snow plake and red ship and um and viery too especially unlike what we expect from our data warehouse but this is you say you want a data warehouse this is kind of like the minimum so let's contrast that here in order to like make something useful okay so I can
16:38do these computations but unless I can share them with others meaning other people can query them it's not much good sitting on my laptop and they're not going to expose my laptop so uh we need multiple users to be able to query it also they're not all process on my computer that you know go across a network then I need some notion of
16:56identity because we're going to be able to like make sure not everywhere can do it um probably the SQL U when I started the SQL U on the web out there some data or whatever they are so good like snowflakes and big queries like SQL editors you can sit there and work eight hours a day now it's you need to good UI
17:14um and these days you don't just need a database right there are a lot of D teams out there which have like they don't build their ETL from scratch this is there's been a labor shortage you could not get data Engineers there are a lot of companies which just couldn't justify for engineers or divisions of companies frequently marketing divisions
17:33right they wouldn't be like attached to the name kind of labor pool and they would be like okay we need to inest this good analytics data whatever and a whole
17:43tree a whole Forest of SAS companies showed up to kind of like do data processing for you so with a credit card and some Partners you can get your data loaded transformed cleaned whatever um and so you need that that's table Stakes now you need a good you know Partners Elena sitting back there raise your hand and not heckling at all yeah is uh
18:06someone in charge of this so if you have a beloved partner who you want none not to talk to um she's the person to talk to um so you need transport and quity results this is actually non-trivial you know the posters interface is used a lot but it's really inefficient for large data I it doesn't do any compression it
18:26sends your data back through a wirs snowflakes space will determine if you have like more than uh it's like 10 BS of data and if you have more than 10 BS it'll actually send it to S3 and your local CR just automatically patch there people are used to the speedups that these kind of like mechanisms have provideed they're also used to like
18:45either automatic scale up or pay only for what you use either way they don't want to pay for things that they're not doing they also want a file system that's not going to die if your computer gets trash um and way easier uploads than you know just pushing pins to like S3 and eoty command um so there's a big gap
19:02between like what you're going to end up with so I I really encourage people to try to build their Data Systems on gup DB the ecosystems of experimentation Earth there is like wonderful um we thought that there was a big opportunity to try to like solve these problems for a bunch of people but still keeping the pricing aggressive and but
19:22like the difference between what you pay with us and what you do with your DIY is small enough that you kind of go well you know let me just try it the easy way PR so if you're going to do this this is what you end up with and there's one like kind of unique thing I want to
19:39point out okay so this is all bog standard uh for anybody clear an open source project on the web containers um
19:48on some kind of form of shared storage low balanc or authentication basically every SAU kind of looks like this but with databases there's can you still
19:59think you press the button or something oh mean we outside a battery
20:11we um okay so you have some kind of shared storage and you're running some process this is a database but it's basically just D TP we put a service layer in front ofb as you might started with J C we've elaborated sense but there's one really unique thing about this architecture diagram that you're not going to see in basically any other
20:34damic out there we do not have a client that is Standalone um that you know you import into whatever you're doing we're at an extension on top of duct TV and that means that any duct DB user the the thing we want to provide is the illusion that you have this giant powerful um back end on demand locally meaning that
20:57you're working locally and there's this focus on even psychologically that you are always in control of the situation you're not like begging some server to please server complete my request in under 10 seconds and I'll just wait patiently for you we want you to be able to like intermix local and remote uh work kind of very fluidly and so the
21:17experiments we try to provide is that you're just working with duck TV it just that these other tables happen to be like veral cataloges so do this thing where like you can actually question oh yeah sorry the slide the previous slide before you move on I had a quick question on that um I saw that you had on the web UI for the
21:39web browser there du towm in there yeah uh is that implying that the Hybrid quering Works on the the the Mother D UI itself Ian it's not it's not else to stopping that it's BR security right so yeah hybrid but attached to what files like so yes so I mean there is a file system in CA right and so like our
21:59load if you look at our like our ad files thing it actually loads into the W pile system which by the way is a key Value Story it's not even a tree um so yeah it does work um you can do it the problem is like it has no access to your polies and I don't think we're getting
22:14around that because I don't think the browser people are going to be ready for the kind of security night we would be if if was appliation just going to actually your or drive I think like I don't know we don't have any friend people here there are some control attempts to make that happen because I would love to happen it may be that like
22:32everybody else we have to make an electron rout you know we'll see it um we'll see where that goes uh but it's a security thing it's definitely not La all okay so um Stephanie knows all about this um because she works on the database team so basically normal databases like ddv do something fairly these are the phases of hag queries get
22:58executed you hand it a string with a bunch of test that gets parsed into tokens and turned into a syntax tree and then into a logical plan at that point you have all these names of tables those have to turn into locations of data somewhere that's called The Binding phase so basically it locates table T1 and says
23:15well that's actually stored in this kind of location on desk then optimize basically says okay I'm a database I'm really smart I can figure out a much better way to do that query than you can so rewrite your point to all times and they're so good we basically rely them and then executes um which is actually in some Le the easiest part of this um
23:36we took the hard route it would have been much easier we had built clients that did like every other database send the text get that result but because we wanten to preserve this like illusion of local experience we actually do this like um combined uh planning things so what we do is we cach your remote duct DV
23:55catalog so you all your planning and binding actually works on locally so if you don't have a table name you get a table name that's wrong you'll catch that locally top of server but if you make it pass that much then we'll start collaterally optimizing and the reason for that is we have to figure out what parts of your quy in around remly and
24:16locally um so the way that looks so check out this this crazy Cy um so this is the uh the the the the iCal
24:28plan let's say the physical plan is basically tical in this case and we have two remote tables which we going to join together and then we have a local table uh which we're going to join against that and then we're going to push the whole thing back remote so this is a double round trip so when our planner
24:44runs basically it says okay where what has to run where we know that this table this remote one and remote two are remote because in catalog set that so this has to be remote but what has to be local well local on is a local F do a Noor that has to be local so at that point we basically have three nodes here
25:06that have to be run wherever they're going to run and then we can color you know one either local remote based on what makes more sense um where these operators run so this point we' split the plan up into three sub trees there's this sub tree which is we decided is a room that's going to join to this local
25:24sub tree and that's actually going to push the tles all the way back to the table as this happens like transparently you just write this query they're caring kind of where the data lives and we figure out all the mechanics of shuffling data back and forth wherever they need to with tles this is kind of to extend the illusion of like I can use
25:45everything locally but when I need like a big server access to large data I just add the FR there's no downloading into compiles there no push into S3 it all just like happens as if it's uh as if you have mother about working locally so I'll show you practically how I use this um so this is one of my uh El jobs so
26:06note this thing this is a pandas data frame you know which uh this probably came from we this came from like a bunch of Json FS which I have to like parse and Jon so it's actually a bunch of DS um that's in a data in a pandis data frame so let's look at this quy so we're
26:23going to insert into this is a remote table trust me um in the select star here here but I going to replace one of my columns if like there's if condition base on a remote table so this current user is not local this is sitting on modu this data frame is local So based on this I'm going to alter this and I
26:45insert the stuff remotely and did you notice this a magic thing I can select from DF mentors it's not us right this is like duct python magic basically the duct python Library if it can can't find the table it will actually probe the local variables in scope to see if there's a name of the data frame in
27:05scope and buy to that this is amazing fors notice I didn't have to push anything into a file no creation snowflake stages or nothing it's just an inser statement which could run locally um but kind of just joins the data wherever needs to to make it happen so this anti join and it would have been a three-stage job if I couldn't have like
27:26hybrid execution that's what we
27:30okay so uh that's basically the picture on like why mother duck came to be what of the like technical conditions but G perace the single node something that um and then like our architecture where we try to preserved the local experience um and with all of kind of ducted these excellent ux but you know uh I basically
27:52from day one have been trying to build against mother duck like you would if you like join RGA and it was not ready when I first got there a lot of things like you know did not work that seem be when I first started um but that's like
28:10um better I run into it than you run into it you know and so there's this kind of like pushing the boundaries like 6 months ahead of where our customers are um to PVE the way of the kind of works from people kind of get there because you know when you just slap a grpc to put something fpb you start
28:28start doing all this chicanery with Hy plans there a lot of Education can discovered so why why dog food well it would be hypocritical to not use the product that we recommend you use ourselves so we just have to like put us in the same risk position there's a lot of good and somethings there the other thing is like testing you know like when
28:46you it people don't have an imagination for working the Lear and they timately test the areas which you're well with there's an infinite number of things you could test but testing the golden path that users are going to use use based on real use workloads ensures that at the very least these common things you know they're going to be brought there other
29:06thing is like a design thing um when you design a feature it's not done when you relas it's done like 3 years later after you got all the feedback from how you wrong on it in first place and so by kind of putting dog footing in the critical path on our early releases we could actually see the things we want
29:23and build were as ergonomic as we expected them to be and I one is a little more I think of like dentist not many people who design the tools for your dentist users are dentists themselves not every BL becomes like a technician and so there's kind of this like difference between the the what puts you in a position to write
29:46databases from using databases and so this's is experience Gap um so having somebody who has like the customer profile on staff really shortens like the iteration of when it comes to Give an example a lot of Engineers think queries are like insert into X you know these four values but I I've worked with d my PR are 1300 lines long and it
30:08passed through four people three of which no longer work there and everyone's afraid to touch because seel is hard to test they have 17 CTE seven
30:18subqueries so you put that in front of a database and like oh that's what people are writing these days I I don't have no idea um but uh yeah it uh it shows up um
30:30you can also feature flag so we ruled a St feature called differential storage duct DB thinks it's using an actual like file system sneakily we never delete anything and this allows us to implement time travel by basically kind of like giving duct the illusion of the file system and given important to the turn so risky featured to R out to replace
30:49the file system from underneath from underneath we literally just na it and so basically they they enabled it for me so like month and a half and then I verified that I could run our stack on it um and for the most part it went actually pretty smoothly another things that is a good place to do customer
31:07reaps um having an environment that actually is similarly set up to customers you can often recreate the issues they have except unlike customers you can attach a do buger to me I will complain you can abort my burries I will complain and so having like kind of a high Fidel environment that's always running I think is the musical um also I
31:28get more typical kind of data shapes you know it's not going to be 100 rows it's going to be 2 billion rows it's not going to be like 30 colums it's going to be 400 columns 17 nested things in there and so it's um you want to anticipate things that like are much closer and of course partner Integrations because we
31:46run a stock that we actually use uh we can go to our partners and say hey we'd like to try your bi tool your transform tool on our actual data and again the incentives are really aligned we we it's frustrating for us to work with a partner it'll be frustrating to for you um so it's a fairly simple stack uh
32:05airflow scheduler at the beginning mostly just python um I brought in DBT would not seem mature enough and I'm expanding to use kind of air bites and some kind of Partners um it's the kind of stock I set up any kind of like midsize company that has you know digital injust and uh we're up to about 6,000
32:24queres a day in like 4 sched of jobs um so it's a significant enough thing that you know you can do a p that this is not some half big thing that nobody uses I we are probably one of the larger users of other um at least on like not a data volume perspective but on a variety of
32:41worried perspective um that's what I just said oh yeah I calculate what I would have paid if I had to and I do this and so no the data volunteer um I type my data sets very tightly I don't have these like unstructured Json whe every here um and that allows me to use like floats ins a
33:02lot or I keep my strings to kind of a minimum because I know that deor execution engin really loves string a type field so I half a terabyte but in my old days that would have been like 10 terabytes and back done that and so like a bit quir by doing a bit of this like type discipline knowing that I can
33:20always reload from from S3 CU it's really easy for ducting you to hit S3 um I can always like recast and P it back up but it means like this is what it costs me our billing is like ridiculously tight we we build basically integral under top if you ran top and can wash the bars go back and forth we
33:39build on the green stuff um and that means you only pay for when you're cor is active working on something even when it's blocked by IO that doesn't seem to actually take it into to a current pricing model and so I it's ridiculously cheap to run because we're we have to been kid against duct D in some ways we
33:56got to make it like the duct experience that we can charge a premium on it lower than other low lower than other warehouses um and so uh if you follow discipline uh you will get an ex really cost effective solution through your data um and this is a bit of profile of like what uh shock fooding has been
34:19since you started so when you first started there's a question of this this thing is going to work at all and most of the time was just making sure that the basic stuff loading data a few aggregations worked that's the golden stuff I'll call it golden path testing over time then we start getting customers so certain portion of time
34:37became customer reos um and then we had this like environment where we could load a similar data set and watch the bogs you know happen in real turn as I try to this um notes of the green stuff which is your internal use cases are minimal at this point because I mean there's no sense analyzing customer data
34:53L your golden path is like St for look like problems now right around maybe six months before beta the product got like significantly better especially with the of dark T release and I started actually to give feedback into the usability products so they work but were they in the right shape for like how people would properly use them and was still
35:13ear there from the product cycle that I had used my da experience to inform outen go then we get to data but this time the gold and path is actually really good there were like a few caveats that we send as warning but basic analytic workloads or reasonably well with the few things you have to kind of like worker and at this
35:32point now I'm actually doing work for the company itself I'm the data engineer respond to my analytic for company it's still only 50% though um now we get to our GA which is just B this year and now maybe like two3 of my turn is actually doing data for the company because the thing's not like giving me trouble
35:51anymore um and the reason is that all this work in the yellow area kind of flashed these bugs kind of like earlier so we not you won't be as surprised as this work so over time I'll just become a DAT engineer again instead of a QA engine um yeah and we have this wetip feedback B so I run an airfow job I can
36:11watch to logs and beta do like spin by and uh it allows us because with the customer we up say oh can you please run this thing we're going to watch it um or can we have access to your data for Repro we never Repro our customer data without like some permission or exchange of here dat is very private and so
36:29having this like tight feedback loop between kind of like uh Repro and internal wats allows us to like uh kilobytes a bit quicker and so these are examples of the kind of issues which you wouldn't have been caught in testing but got caught in dog fting uh turns in we
36:45had a there's a slow leak of connections file descriptors it would accumulate over like a week and then only hit a limit at a thousands and you're you're back and with die you do see unit test there gives you wouldn't run this slow leak over time uh concurrency sta changes it was thought that like Alters were like rare events um they didn't
37:06know about DVT every materialization table is to create table as create is a dvl statement and so we have this thing where like if any if the scheme of Change Would your query would aborts if alters and catalog changes your rare events at spying this is not a rare event so every timeing my DDT jobs who would board other queries so we kill
37:27that oh yeah long SQL so we have SQL notebook so um errors would show up below your SQL well that's fine if you're like query is this big and query is this big then you're scrolling back up and down every time you get to an error so recently moved to highlighting errors in Ling um because with big SQL
37:45that's a gen it's what data grip does red thing in the um and what honestly like J uh I validate differen storage oh yeah B 10 so um you can attach duck DD to so many things so if you attach it to your client so you can do this I'm going to attach to my my own postgress database I'm going
38:08to attach it run a query there pipe it through my local clients join it with some remote data and R that to a new table in me do notice we haven't attached me do posters year your client has the problem is we expect at least a row every 15 seconds to keep that collection fresh if you write a highly
38:26selected query you're scanning a billion rows and 40 come out over 10 minutes cuz it's a cool t St that actually timed out and that's the kind of thing that really hard to test for it's a very specific circumstance but when we try to you know we posters it happened uh deep the Json drb had an issue where well they thought
38:45called Json is what 30 things maybe one nesting level and I showed up with um something that was 80 levels in for each one record because it was download from or download from like which just out a Dogg induc is not expecting that so like 500 times the memory usage of the string just a c to STP um so that got clashed and then slow
39:09extension load turns uh we weren't watching like the size of our binary and that thing gets um uh checked check zoned and against the sign ke for security so every time you wear an extension up to it's extremely secure but our binary grown over time I know we just Wen getting track of like medicine in it and it's taking like a second and
39:29a half uh that's very noticeable when you R in the stack load your extension I kind of T to so um I including slide
39:37basically that's what I've been doing kind of since day one being essentially the sacrificial data engineer your proxy you know internally um trying to like Rite into issues before you do um which means that it's ready for you to try it um you can find communic stock and Nicholas uro there I'm anxious to be experiences any questions
40:01[Music]
Related Videos

2025-12-10
Watch Me Deploy a DuckLake to Production with MotherDuck!
In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!
YouTube
Data Pipelines
Tutorial
MotherDuck Features
SQL
Ecosystem

2025-11-19
LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics
LLMs excel at natural language understanding but struggle with factual accuracy when aggregating business data. Ryan Boyd explores the architectural patterns needed to make LLMs work effectively alongside analytics databases.
AI, ML and LLMs
MotherDuck Features
SQL
Talk
Python
BI & Visualization

