Big Data is Dead: An Insider's Story on the Rise of Small Data

2024/10/24Featuring:

TL;DR

The "Big Data" Myth: 95% of companies do not operate at hyperscale. Most workloads involve less than a terabyte of data and focus on recent logs, making complex distributed systems unnecessary.
The Big Data Tax: Traditional "scale-out" architectures impose high latency (approximately 400ms overhead), unpredictable costs (60-second billing minimums), and operational complexity that stifle innovation.
Scale-Up vs. Scale-Out: Modern hardware allows for a "scale-up" architecture that processes data on single, powerful nodes, eliminating the network shuffle and coordination overhead of distributed systems.
Per-User Tenancy: MotherDuck's unique "Duckling" architecture provides isolated compute for every user or tenant, solving the "noisy neighbor" problem and enabling granular cost attribution for SaaS vendors.
Unified Development: By combining DuckDB with a serverless hybrid architecture, MotherDuck offers "Dual Execution," automatically optimizing queries to run locally or in the cloud for near-instant interactivity and a unified dev/prod workflow.

Vendors sell tools built for the top 1% of hyperscalers to the other 95%, saddling users with architectures too complex, slow, and expensive for their actual needs. As a founding engineer on Google BigQuery, I watched the predicted "data cataclysm" fail to materialize. Most customers stored less than a terabyte. Frequent queries ran against even smaller subsets, typically just the last seven days of logs.

The industry needs a new approach built for the reality of modern data work.

Why Are Most Companies Paying a "Big Data Tax"?

Early AWS instances forced engineers to distribute loads across fleets. Modern hardware renders this approach inefficient for the vast majority of workloads.

The Latency Tax: Scale-Out vs. Scale-Up

Distributed architectures impose a mandatory latency floor on every query. Even simple queries must be broken up, sent to thousands of machines, and reassembled.

A scale-up architecture eliminates this shuffle overhead entirely.

MotherDuck runs queries on optimized single nodes (vertically scaling CPU and RAM), returning results in milliseconds. This makes truly interactive analysis possible. Distributed systems like BigQuery often incur a 400ms overhead before returning any result.

The Cost Tax: The "Surprise Bill"

Distributed systems enforce minimum billing increments that inflate costs for interactive workloads. Running a petabyte query to demonstrate scale might cost $5,000. The smaller, unpredictable bills from frequent interactive queries cause more damage over time.

Legacy cloud warehouses like Snowflake often enforce 60-second minimums for compute. MotherDuck uses a 1-second billing minimum, reducing costs for high-concurrency applications. For the interactive queries that drive modern BI and SaaS apps, this difference fundamentally changes the cost structure.

The Complexity Tax: Maintenance-Free Infrastructure

Complexity stifles innovation. Managing "big data" systems often requires dedicated platform engineering teams to handle partitioning, clustering, and warehouse sizing just to keep performance acceptable.

MotherDuck removes this overhead entirely. You do not manage servers, spin up clusters, or configure partitions. This maintenance-free experience allows developers to focus on insights, not infrastructure.

Table: Scale-Out Architectures vs. MotherDuck Scale-Up

Architectural Impact	Distributed "Scale-Out" Systems (Traditional Cloud DW)	MotherDuck "Scale-Up" (Hybrid/Small Data)
Latency Overhead	High (~400ms): Requires coordination across thousands of nodes before processing begins.	Near Zero (Single-digit ms): Runs on optimized single nodes; eliminates network shuffle.
Billing Minimum	60 Seconds: Users pay for a full minute even for sub-second queries.	1 Second: Precise billing that drastically reduces TCO for interactive workloads.
Operational Velocity	Slow: Complex systems make simple optimizations and schema changes difficult to implement.	Fast: "Single-player" simplicity allows for rapid iteration and instant feedback loops.
Primary Use Case	Petabyte-scale batch processing (Top 1% of companies).	High-performance interactive analytics (95% of companies).

Why Isn't Your Data "Big" Anymore?

The tech industry fixates on petabyte-scale problems, yet these reflect a tiny fraction of real-world workloads.

The Reality of Modern Workloads (The 95% Rule)

Most analysts do not query massive, historical datasets. They focus on small, recent subsets. Data just one week old is roughly 20 times less likely to be queried than data from the current day.

Modern architecture amplifies this pattern. Separation of storage and compute allows historical data to sit inexpensively in object storage.

Critical workloads remain small, evident in the Medallion Architecture, where final "gold" tables used for BI are highly curated and compact.

Hardware Evolution: Modern Laptops Rival Legacy Clusters

The definition of "big data" moves with the limits of a single machine. Yesterday's cluster-sized problems are today's laptop-scale tasks.

A developer's laptop, like a MacBook Pro with an M2 chip, delivers performance comparable to the 3,000-node cluster used in the original Google Dremel paper. Local hardware capabilities have grown by orders of magnitude. Most workloads no longer require distributed system complexity.

From Single-Player to Multiplayer: Why MotherDuck?

DuckDB proved its production readiness by solving complex engineering problems like robust time zone support. But while open-source DuckDB provides high-performance local "single-player" analytics, it lacks the persistence, security, and sharing capabilities required for a production data warehouse.

MotherDuck wraps the speed of DuckDB in a managed, serverless platform that handles identity, durability, and collaboration. This combination has proven transformative for customers. FinQore migrated their financial data pipelines from Postgres to MotherDuck, reducing processing time from 8 hours to 8 minutes. This 60x speedup enabled real-time AI capabilities.

Per-User Tenancy: The "Duckling" Architecture

For B2B SaaS vendors and data teams, shared resources in traditional warehouses create "noisy neighbor" problems. One heavy query slows down everyone else.

MotherDuck introduces a unique architectural concept called "Per-User Tenancy" (or "Ducklings"). When you connect to MotherDuck, you get your own isolated compute instance. This architecture provides two critical advantages:

Performance Isolation: One user's heavy workload never impacts another's dashboard performance.
Granular Cost Attribution: SaaS vendors can track costs down to the individual tenant level.

This architecture is critical for customer-facing analytics. In the Layers case study, the company used this architecture to avoid a 100x projected cost increase they faced with a competitor, while delivering a dedicated "mini data warehouse" experience to each of their customers.

Unified Development: The Hybrid Advantage

We created a serverless data warehouse treating the local machine and the MotherDuck cloud as two nodes in a unified system. When a user connects, a full DuckDB engine runs on the client, inside the Python process or CLI, alongside the engine on the server.

Dual Execution: Unified Development & Production

The optimizer enables Dual Execution, analyzing SQL queries to push computation where data resides. If a query involves only local data, it runs entirely on the machine. If it involves only cloud data, it executes on the server.

This unifies the development and production environments. A data engineer can build a pipeline locally using dbt or the MotherDuck CLI, testing against local CSVs for near-instant feedback. Once ready, the exact same code runs in the cloud against production data.

For users currently stuck on Postgres, the pg_duckdb extension acts as an immediate on-ramp, bringing analytical speed to transactional databases.

MotherDuck integrates into the "Modern Duck Stack," serving as a drop-in replacement compatible with dbt, Fivetran, Airbyte, and Omni. This ensures you can adopt the speed of DuckDB without breaking your existing workflow.

Table: Traditional Cloud Data Warehouses vs. MotherDuck

Platform	Architecture Type	Query Execution Method	Data Transfer Strategy
MotherDuck	Serverless Hybrid	Dual Execution: Intelligently splits processing between the local client and the cloud based on data location.	Smart Transfer: Only transfers essential, filtered result sets to the cloud; caches heavily on the client.
Legacy Cloud DWs (e.g., BigQuery, Snowflake)	Fully Distributed	Cloud-Only: Forces all queries, regardless of size, to be processed by remote clusters.	Full Upload: Requires uploading all local files/data to the cloud storage before analysis can begin.
Standard DuckDB	Local In-Process	Local-Only: Processes everything on the user's machine using local resources.	None: No cloud connectivity; limited to the storage capacity of the local device.

Conclusion: The Developer-First Data Warehouse

The era of "big data" by default is over. For the vast majority of workloads, legacy distributed architecture is a trap of unnecessary cost, complexity, and latency. Engineering teams require an architecture that prioritizes developer experience and efficiency for the 95% of use cases that are not petabyte-scale.

MotherDuck delivers the speed of a scale-up architecture, the precision of 1-second billing, and a maintenance-free platform that lets you focus on building. Your data probably is not "big," so stop paying the big data tax. Sign up with MotherDuck for free to run your first query in minutes.

Transcript

0:00welcome back to the Mad podcast today my guest is Jordan Tani CEO of cloud analytics company mother duck mother duck and the duck DB open source project have become very Buzzy in the world of data infrastructure with a promise of delivering fast efficient analytics without the complexity of traditional Big Data Solutions we talked about why big data is dead and the rise of small

0:23data if you have smaller amounts of data you can move faster less expensively because the architecture is simpler we can focus more on building better experiences the mother duck product and it's focus on speed we can do queries in sort of single-digit milliseconds in big query we were very very happy when we got the overhead down to like 400

0:42milliseconds and Jordan's unlikely entrepreneurial journey I always figured that there's people out there that are like going to start companies and then there's like kind of normal people and I was one of the the normal people please enjoy this great conversation with Jordan uh it feels like the absolutely unescapable unavoidable way to start this conversation is to talk about small

1:04data just when we thought we had finally made it in the world of Big Data you came up with a very well um written and very noticed uh blog post in early 20123

1:17called Big Data is dead uh and like you built a whole thing uh around this and just last week in San Francisco you had the you ran the the small data conference so small data what is it all about um so I'm very glad you said that it was unavoidable like that I mean we we did we did a lot to try to get the

1:35message out and get people get people excited and um you know it feels like it's a little bit sort of counter counter the prevailing narrative that uh you know everything for 15 years has been about Big Data this big data that how big is your data uh how much can you scale and um kind of the Tipping Point

1:55for me was when um I saw the the sort of

2:01data bricks versus snowflake kind of they had this benchmarking war and um and everybody's focused on like the the the you know the war between data bricks and Snowflake and to me the biggest thing that I noticed was like well they're looking at the database Benchmark at the the databas this the the query sizes they were using was 100

2:20terabytes and I remembered back from my time at Big query um you know we had some of the largest customers in the world we had Walmart Home Depot Equifax HSBC you know like and and nobody was using anything you know running queries anywhere near that because it would have actually fallen over at the time and um so I I I knew that and people like were

2:43um you know were kind of not even really pushing up against against limits and so I thought like wow if if this is sort of the state-of-the-art that people are focusing on this size of data that nobody has um there's got to be an opportunity actually to sort of like look at this smaller you know smaller data sizes and I and I kind of was

3:01remembering back um to when I was doing a bunch of analysis uh also when I was at Big query on the you know the query sizes that people were using and most people actually were using you know had small data the amount of data they were actually using was even smaller than that and um and uh and so kind of was

3:20like you know hey I bet if you were going to design a system these days from scratch like you do it differently you know after Google came out with you know map produce and and GFS and big table kind of everybody's all of which was in like 2006 right 20 years 2006

3:39everybody's brain just sort of like broke and they're like wow in order to build systems that can handle the data sizes that we're seeing you kind of you have to just dramatically change how you're building them you have to you have to run on lots of machines lots of cheap and expensive machines versus you know these giant hyper expensive

3:58machines and to be fair that was a problem at the time um but you know nowadays like you know I've got a Mac M2 laptop it's 2 years old um it's like probably an order of magnitude to two orders of magnitude faster than the server machines were back when you know map produce came out and people started

4:16building building these systems you know let alone like nowadays the server systems you know have hundreds of cores and you know can have terabytes of ram um and uh and so really kind of like if you were going to build something now like why would you bother with all the complexity of scale out because the thing about the way we designed systems

4:34like I was one of the people that helped start Google big query uh and I worked on you know single store for a couple years so I um you know I have been in you know with my elbows deep in in you know building these these these complicated systems is that like there's just this huge tax that you pay to um to

4:54have to build a distributed system that scales out and can do like you know distributed transaction and shuffling data and um and you know if you were going to design something for kind of Modern Hardware you could make it much much faster um and you could make it much simpler so me meaning you could actually progress progress faster

5:16because that was one of the things is like you know there was a couple of like well-known joint optimizations that we added to Big query and they took like a year to to do

5:28and um they weren't that hard it's just to get everything right um was you know just took means means that it took took a long time um so sort of getting back getting back to small data kind of the idea is that um if you have massive amounts of data you have to build these complex systems if you have smaller

5:48amounts of data you actually don't you know you don't need such such complexity and you can move faster you know less expensively um and also now that networking speeds and you you know local machines you know laptops have gotten so much um so much more performant you know you can actually push workloads down to the end end user and and that opens up

6:10so many kind of new and different architectures and different ways of of um of handling data and building and building systems and I think we're starting to see that with you know some other you know some startups out out there did we invited to this um this small data small data conference um yeah the conference was sort of born it was

6:29the idea of um Bob the CEO of we8 and uh

6:33he's like hey what do you what do you think about like doing like a small data conference and I'm like that would be like that I just love the idea because you know we try not to take each take ourselves too seriously at mother duck and you sort of like you know you kind of have big data London and you know Big

6:46Data this and big data that and that we could just sort of like kind of poke a little bit of fun at those at those things and do things a little bit a little bit differently and that not I was like looking at the you know the manifesto that you have on the small data SF 2024 um uh website and the I love I mean

7:06clearly you having you're having a lot of fun just from a marketing uh perspective it's um have to like you know give you Kudos it's it's amazingly well done I mean there's so many it's so hard if you're technical company uh to break through the noise and come up with something that feels like a movement and feels like a Manifesto and like you guys

7:26have done a remarkable job uh doing that and it's um yeah it's it's really fun to watch uh thanks so much I think a couple times in my career I have built things that um that I thought were amazing technology and like didn't get the other pieces right and you know nobody saw them and and so um you know distribution

7:49matters as as he turns out right yeah you know kind of distribution excitement and you know so kind of when we started we started mother duck it was sort of a very um very deliberately wanted to make sure that we weren't just sort of you know writing a bunch of code and throwing it over the wall and saying hey

8:06if we build it they will come that we were also kind of TR trying to tie it to some some thought leadership you know take take advantage of you know we have several people in the company who kind of have a lot of experience building these kinds of systems and kind of had seen you know kind of the trend going in

8:23this direction and kind of the where we thought the world should go was in this other direction and so you know had something we could be passionate about and that we could that we could write about and you know and add a little bit of add a little bit of of fun and sense of humor to it uh you know I guess I

8:37guess always always helps so a couple of uh just uh questions which I'm sure you've gotten a thousand times uh if you already have you know big query in place or snowflake or data break or like the whole like big data uh kind of infrastructure modern data stack whatever you call it um if you can do the Big Data stuff can't you do the

8:58small data uh with it I mean isn't that shouldn't they be priced in a way that uh you know ultimately that should not make a difference to you yeah and so I think um you know I mentioned kind of before that there's like this this tax you pay with these these complex distributed systems and you pay the tax

9:14twice you pay the tax in terms of latency uh it's just when you if you think about it um latency is going to query the whole thing well you're going to query the whole thing but also when you send a query to Big query for example like your quer is going to get spread out over possibly thousand or

9:30thousands of machines um the all of that coordination takes T takes time um you know there's this you know schedule or allocating allocating the slots there's you know all the rpcs dealing with you know cancelling things that that had been running on them previously getting all the results aggregating the results um you know like there's just a limit to

9:52how fast you can how fast you can do that versus if it's just you know the first machine that you hit actually runs the query you know you can do things you know I mean um you know we can do queries in sort of single- digigit milliseconds and um you know in big query we were very very happy when we

10:10got kind of the overhead down to like 400 milliseconds so that's like two orders two orders of magnitude difference uh and then on the other side it's just the you know the cost that the cost tax you know you have to have all this Hardware um creates a lot of overhead and you know I think somebody that um you know a big big bank did some

10:30benchmarking against bigquery Snowflake and single store when I was at single store and the kind of they estimated that on a per core basis uh big query was 40 times less

10:41efficient than um than single single store which is kind of a more you know kind of you know dedicated performance optimized uh query engine but so you're giving up at least in order of magnitude and for for for Google it's like whatever we own the hardware we we'll just throw lots of cores at it um but at some point you know like you

11:02got to pay for those somebody's got to pay for those cores and somebody's got to pay for that inefficiency and so I think if you build these systems um simpler more simply then you know you can you can have uh kind of dramatic you know dramatically less expensive um and then the kind of there's there's another bit that people don't always recognize

11:21which is sort of the velocity of improvement and you know often when you choose a technology you know you want you want to stick with that technology for at least a few years and and so really what what you're buying is not just where it is now but where it's going to be a year from now two years

11:39from now five years from now and when you kind of have these complex distributed systems they get better very slowly and you know if you look at duct B for example which is you know what what you know my company is is is built on top of like incredibly rapid pace of improvement like in incorporating like brand new stuff coming from you know

12:00algorithms uh you know joint order optimizations coming from Academia you know The Graduate students that come up with those Implement them in duct EB and then they ship them and then they're out they're out like a month later and um so I think that the you know kind of the rate of improvement is also something that you have to uh have to take take

12:20into account that kind of some of these Big Data Systems are just you know going to have a hard time keeping up is some of it uh question of use case as well meaning uh if you use data as in you know modern data stack for purposes of bi effectively data analysis uh then uh

12:42you know maybe small data makes a lot of sense but if you want to I don't know train a big machine learning model then you know as abundantly documented you want as much data as possible is that is that is that uh is there some truth to that yeah absolutely there there are clearly some use cases and some

12:59workloads that you know that are not are not small data I think you know big big data may be dead but it's you know not uh it's not going away um and I I think

13:12the one one thing that I often hear though is um you know sometimes people like well I read your I read big data is dead and I I I you know I agreed with with most of it but um but I've got big data you know uh but I think actually

13:27there because there's a lot of people that that have a lot of data but what they actually use is a small section of data so you might have 10 years worth of logs it might be a petabyte worth of logs um but if you only actually query the last seven days or the last day um that's that's not really big data

13:44because of separation of storage and compute that other data just sort of sits there and is cold on you know on AWS S3 um on or on you know on Object

13:55Store somewhere the only the important part is the part that you're querying and and you know the vast vast majority of workloads actually just query that um that hot data because it's um because it's expensive to to query the whole thing I mean like I used to run this query um I used to I used to give talks

14:16on on big query and i' get up on stage and I'd say hey look at this data set it's a petabyte and I'm going to query this whole petabyte and I H enter and it would query the paby and I would like look isn't that amazing I cared a paby and the thing that I didn't tell you is

14:27that cost $5,000 to qu that pedy and um you know I think you know big data has been able to make it possible to do some of these giant things but if you still have to do all the work you still have to do all the work and it and that and they you they haven't been able to make

14:43that inexpensive and so I think part of the kind of small data is is actually sort of recognizing that okay in order to not have massively expensive um analytics bills you want to actually kind of slim down aggregate um you know the work you're doing and often there's like the you know people have The Medallion architecture you land data in

15:03the bronze tier and then you kind of transform it a little bit into the the silver tier uh and then finally you have your gold presentation tier very often that presentation tier that gold tier is pretty small and I think as you were mentioned the stuff that you run your bi off of you know if you have a human

15:18waiting there you want you want that to be fast uh and that's just kind of another another reason to to sort of use kind of a low latency small small data tool to you know to operate on the old tier data so I'd love to go into uh mother duck specifically so what is duck DB you know duck TB like like um like

15:39sqlite is just a library uh it's something that you know as you're building your code you link that in it's you know functions that you can call inside that library and you don't have to set up a separate server somewhere and call out to that server uh it just sort of runs everything inside inside your inside your process and from a you

15:58know from a latency perspective that's that's super nice from a complexity perspective it's also super nice you know if you're running in Python for example um you know the the way python works is kind of things in Python have access to the the variable name space and so actually in duct TB you can query against your python objects so like so

16:19if you just Link in you know if you just import duct DB actually in your in your python process you can just start querying querying data without even having to move it it or or prepare it or do anything so it's just it's just really really kind of very high developer uh experience uh for accessing

16:39your data uh it also works you know well as a sort of Standalone as a standalone database as well it's not just sort of in memory but it's uh um you know I think it's and and it's also because it's just a library it has no dependencies and it's just it's it's incredibly lightweight it can run in the

16:57browser so it can run in under under um you know under wasm in the uh in in the browser so you can have a full-blown uh query engine running in your running in your browser if you just go to shell.

17:09dub. org it's like it's like wow I can run a bunch of like you know SQL queries right right in my browser without installing anything um which is sort of I think one of the things that makes it makes it pretty pretty unique and uh historically that was a a project developed at the University uh at a at a research

17:28institution in am Dam called uh called CWI it's also where um python was created and so who created it who are the creators of it so yeah it was uh hanis M heisen and Mark Ron Ros Mark roselt um Mark or Hest was a um a professor Mark was one of his graduate students and hanis had just gotten

17:48tenure and so like kind of nobody could tell him what to do for a little while and uh Mark had finished his PhD papers early and so nobody could really tell him what to do for a while and they're like hey we've been using mon ADB and kind of there's a bunch of limitations uh let's let's write our own

18:06and to solve some like some problems that they'd seen in kind of the data science world and that you know they they're like data scientists hate databases and partly it's because they had to install databases and configure them and load data into them and they they said hey you know there's a there's a there's a there's a better way and it

18:23turns out that they're amazing you know they're amazing database researchers but also you know great and they were able to build a a super useful system that just sorted getting becoming more and more more more and more popular uh when was this what roughly what year uh I think it was 200 I think they've been going for five

18:43or six years I think it's been uh like five and a half years now so probably 2018 tell us about how where mother duck fits in the puzzle I saw a tweet of uh of somebody um I was I was still at single store and it was was doing some bench somebody was doing some benchmarking of single store against

19:04uh uh you know big query and red shift and um and this thing that I'd never heard of called ducky B and I'm like wow that's really fast like like where did this come from how did how is that possibly that fast and I started digging into it and I'm like it's like oh it's a research prototype and then I realized I

19:21actually kind of looked at it and it's like I saw their blog post on on time zones and in big query we didn't Implement time zones for I think like six or seven years because it's just really hard to get right and like you know there's all kinds of Bizarro things in uh in when you when you deal with uh

19:41you know with with time zones and we're like well we don't have to worry about that everybody can just convert from you know from UTC from you know from Greenwich meantime and um the fact that they had actually had the attention to detail to do this it was like hey there's something going on here that the this is not just a kind of something

19:58that you that you write to to that you build to sort of do a couple of papers on to get your PhD this is like this is this is real and so I'm like you know somebody like it scales down so small um it's so lightweight um you could build an amazing serverless system using this and somebody should put it in the cloud

20:17and um you know and build a service around it and I'm like hey you know I helped you know start start big query i' I've done one of these before and I helped um you know with the the single store sass service and like I I've done a couple of these I kind of know know how it works uh you know maybe that

20:33should be me and I had never really thought of starting a company before and um but the idea it was just it just was so compelling and uh the technology was amazing and you know I met Hest and Mark I was actually on a um so you just read like you just email them and say hey uh

20:51yeah I got an I got an intro from a mutual um Mutual acquaintance acquaint is actually Lloyd tab the the founder of looker um and you know in the world of like common acquaintances that would be a good one uh I I knew that like Duck he's working on Malloy this you know metrics layer thing um and uh uh and I

21:11knew that that would work with ducky B and so I asked I asked for an intro to uh to those guys um he's also like yeah you know while you're at it you should talk to my friend Tom he invested in in looker uh he's you know he'll give you some feedback on the idea uh Tom was Tom

21:25Tom tonga's uh the uh who we did a great

21:29episode of The Mad podcast with which will uh I guess we We'll add this to the show notes as real YouTubers say yeah and he did the uh he did this he did our seed uh for uh for for mother duck and they're like and at first I was asking for a job I'm like hey you know what you

21:44guys are doing is really cool you know I'd love to if you guys are going to build a cloud service I'd love to come work clearly you're going to do that right why wouldn't you do that yes and they said no they said I know we don't want to do that no we just really want to focus on kind of you know building

21:57the core core database and um were not really interested in that but you know if you were going to do something like that we'd love to partner with you and um and that sort of got me got me thinking and um I was uh my first like

22:12vacation post um postco I was in Portugal uh and then I kind of rerouted the return I went through Amsterdam instead of through Paris and uh to meet Hest and Mark and we we had scheduled like it scheduled like four hours in the afternoon and I'm like we're a bunch of nerds like what are we going to talk to

22:31talk about for four hours like we're going to talk for like 45 minutes and be like start getting nervous and and awkward um but we we we we just started like talking and like we were just SED geeking out about like this database and that database and like like and like next thing we know is like dinner time

22:46and like kind of we went to dinner with like my wife was there and haas's wife was there was there and so kind of it was just like this really kind of cool meeting of the minds and realized that we were um that we could work together I think and um yeah and so that's how we kind of

23:00started started working together um and I think what we did was you know I think is I'm I'm hoping that if this works out uh and I'm obviously hoping it works out but um that it will be kind of a model for how to do kind of a uh an open source plus kind of corporate VC backed

23:19um uh Arrangement because they have they have a foundation that that kind of owns the duct DB um duct TB IP it's you know MIT license it's always going to stay that it's always going to stay kind of a open open source um but we also gave them a chunk of the company when we started like essentially a co-founder

23:38share so you know they're incentivized for us to be for us to be successful but also we have no we have no direct say on anything that they do and so like you know Hest may decide to go build something weird or something that we you know that we don't you know we don't like and that's just how it that's just

23:55how it works um on the other hand he did agree not to work with anybody who's is doing something similar to um to what we're doing um um but he's ALS he's also incentivized not to do something kind of too two off of the uh what about the community I mean I guess the community does what the community does but um you

24:13know it's sort of a what's the word like a sort of a common wisdom in startup and Venture circles that the you know the the commercial company building on top of the open source should also kind of own the community to the extent that any Community can be owned like how does that work for you guys right now we kind

24:29of have you know there's somewhat disjoint communities we have a we have our mother duck community and we have like a slack um but also there's a pretty vibrant duck DB community and that we have not you know we have not jumped in and tried to own or run anything I think I don't think that would have gone over well um on the

24:48other hand we try to be helpful where we can you know we we contribute a lot back to um you know to duck DB code uh and you know we have great you know relationships with the with the with the founders and with and with the community and um and so I think that there's I think that that's you know it's actually

25:07kind of a happy uh a happy way of of of doing things I mean think ducky B is is super popular and uh you know we don't want to try to um kind of horn in horn in on that as long as you know we can be you know we can be successful by you know building our building our manag

25:23service and uh tell us about the company

25:28today so I think you've you raised about the 100 million um across three runs that is that the right number yeah we raised you know we raised in um you know we raised our our seed round from you know red point and um madona and amplify uh and then we got preempted a few months later um andreon uh and then

25:49uh we got preempted for RB about six months later for uh by felicis um and you know building a you know building a databas is a service is expensive you know duct TB is an amazing amazing amazing piece of software but it's not a data warehouse and kind of to turn that into a data warehouse you know

26:08is is hard there's a lot of things that kind of like you poke it you poke it the wrong direction it'll fall over uh and then it's sort like well no one's ever poked it in that direction before um and so you know it's just stuff to stuff to work work through stuff that you know have having you know have to build and

26:23we're building this I think pretty pretty rich um database as a service and this serverless serverless backend this highly multi-tenant um this hybrid execution system or actually dual execution system where we can push workloads down to the end user and kind of split query plans and so I think we're doing some you know we're doing some interesting non-trivial non-trivial

26:42stuff partly because you know you've seen there's a model that I I think I've seen before which is uh in open source where you know somebody builds this great open source project uh and and then you know that's what they that's what they completely focus on because that's you know you know to get adoption to get excitement to get to get funding

27:04they have to build a great open source project um and then they say well we'll monetize with SAS and we'll monetize we'll put it in the cloud and it's sort of it's an afterthought they have a couple of Junior Engineers work on it and uh and so it's just you know you're basically just running the thing in a in

27:18a kubernetes container and um and it ends up being something that's kind of trivia clonable by you know AWS uh or you know Google or somebody else and um and you'd end up with without actually a much of a much of a moat and um and and and to me actually that that that's it's disappointing not because

27:40you know the the work the hard work gets you know uh gets cloned elsewhere but it's disappointing because I think that there's a lot of things you can do a lot of interesting architectures that you can build um if you really focus on the SAS service like and so for us for mother duck we are focusing on building

27:55this kind of I think unique way of delivering the the service to users um and then docb Labs gets to focus on building this great amazing amazing database and you know I mean because like our API is not a typical sort of like web API you send a query and get a result back our API is like this partial

28:16query plan API where you send a piece of a query plan and you get you know a piece of a query plan back and like it's complex but it's complex because it can actually deliver value it can deliver value in that you know you can join you know data from your post post server against data that's living in the cloud

28:31you can um you can you know build you know uis that um that you know let you

28:39query data and you know at 60 frames per second and fly through your data as if it's if it's a video game um and the reason that we can do that is because we can do part of the work locally and we do part of part of the work in the cloud um so long a long a long-winded answer

28:54but you know we are trying to you know spend a lot of our Innovation token on kind of how the uh how the basically the SAS infrastructure delivers delivers a service to users you mentioned that as we were saying right before we started recording that you have four offices Seattle New York Amsterdam and SF yes so

29:14yeah we have um about 50 about 50 employees we kind of started during during covid and everybody was kind of explicitly remote um but I think like a lot of people we recognize that there's value to actually having people in in person and um and collaborating and and just sort of being able to feel like you're part

29:34of a real team and um and so we kind of

29:39we realized almost everybody was in one of these four cities you know Seattle uh I'm in SE Seattle that's where our headquarters is San Francisco San Francisco is just you know huge concentration of of um of talent New York um a couple of co-founders were were there um plus there sort of a nice balance between the West coast in Europe

29:59and then finally Amsterdam because we wanted to be we wanted to be close to the the duck Deb team and so that we can kind of go over and like you know poke and say how that how's that coming or like take them to dinner just you know um could you do this faster you know help us actively actively manage the uh

30:15the the relationship um it's you know actually they we're pretty pretty evenly evenly distributed so far um and I think it's working I think it's working pretty well like each office is kind of developing its own you know its own own flavor um and uh you know our New York office is probably the youngest office and the most fun um of

30:36course um the uh the Amsterdam is the most academic and um um but uh yeah and

30:46we have um uh we ra raised $100 million

30:52and you know we have you know several thousand users it's growing you know super super fast we're sort of deliberately priced it pretty low to um you know to just um we want to we want you to be able to get in and and you know get a data warehouse uh that that works for you know $25 a month um and

31:10you know Enterprise prices will will of course end up being being more than that but um uh you know we think you shouldn't have to spend thousands of dollars a month to uh to to do basic basic data Ware housing and analytics and do you want to explain then in what in memory means so transactional database Bas you know you're you're

31:29typically you know you you are operating on sort of one thing at a time you have an order and you you you create an order maybe you update the state of that order and there's a bunch of like consistency checks to make sure that that order you know matches a real customer and matches a real product and line items Etc um and

31:49um analytical databases on the other hand tend to operate across across data so you can ask questions like well how many orders did I have in the last in the last week or how many orders you know broke who is my uh you know biggest customer by you know by amount of dat amount that they spend uh and then

32:07Broken Out by region and you know those those those types of questions um the data tends to be stored differently in these types of databases um you know column store versus versus row store and um you know then there's also a subclass of databases which is sort of an inmemory database which means that if you turn your if you turn your computer

32:27off you know you learn you lose all the data um but on the other hand memory is you know forers of magnitude faster than than discs uh depending on what kind of kind of discs but um it's you know generally much much faster and so you can do things you know you know blindingly fast and um I think there was

32:48a wave of inmemory databases that started about you know 10 15 years ago um you know memsql became single store and which is where I I worked I worked previously um but I think it turns out that you know people really want you know they want they want persistence they want you know if they have uh if

33:08they take all the the energy to load this data into their database they want it to be there um and they want it to be you know transactionally updated Etc what makes uh Doug DB and mother duck then so fast and so appropriate for those use cases you know one of the things that makes it fast is it's just it's um it's

33:27a brand new database built from built from scratch kind of applying you know kind of the latest and greatest best practices um there was a paper that Michael Stonebreaker wrote um uh I think about 15 Michael Stonebreaker is a turing Award winner like the Nobel Prize of computer science uh in databases and and creator of many database companies yeah he created

33:49postgress and Ingress um and I think the DAT I think the paper was called no free lunch and it was really about that like hey the technology has Chang um dramatically why haven't databases changed like the way if you just think about you know um ssds you know like you might have in your in your laptop they

34:09they don't have the spinning platter anymore they're um um and they're because of that they're much much faster to find things there's some things that they do that are better that are dramatically better there's some things they do that aren't aren't as good um but if you were going to build a new system you would just build it

34:25differently and um but people ended up like well the old one works kind of well and like and you know a decade you know it's a decade or even more since that paper more has changed in Hardware in in sizes and speeds like you get these you know um what are they called uh you know

34:46almost memory speed um discs and um you

34:50know they just they to totally upend the way people tend to think about the tra trade-offs in software like it used to be if um if you had to hit disc it was like all of a sudden your your program just you know screamed to a halt um and

35:07uh and nowadays is ah it's not so it's not so bad you can you can you can do you can you know you can amortise the uh the cost of uh of of dis access so um you know kind of modern dctb built with modern techniques built really well simply scale up um by you know some

35:25really really brilliant um brilliant people and and made it very fast and uh and it

35:32I think one of the the other things they did is they avoided traps that you know kind of prematurely optimize so one of the things you know people talk about you know uh for for analytical databases uh is vectorized execution which is not to be confused with Vector databases those are actually something different but in vectorized execution you

35:50basically take a whole bunch of uh you know a vector of of inputs and you can and you can process the process the them all at once and uh this is you know uh very nice for how you know computers use caches uh you know it it turns out to be sort of very efficient to write the code

36:07for those but typically what people do is like they get so over they get so excited about it is that they're like okay I'm going to write special machine instructions that that that do this and do this fast um and you know we did this in bigquery and you know click house does it um you know a lot of these these

36:23engines build these like hand coded vectorized execut engines because because there are computer instructions that are specifically deal with lots of things at once uh but duct TB wrote it in such a way that they let the compiler do that optimization for them and they probably leave a little bit of performance on the table but on the

36:44other hand it just means that you don't have to maintain all this like crufty really complex nasty hairy um assembly code uh you just have like these sort of kind of careful elegant elegant written um operators and it also meant that for example when the Mac silicon came out it took them like 2 hours to make it work

37:04on the new Mac silicon um versus like having to sort of do all this like complex hand hand coding um hand cating stuff so so duck you know there's a bunch of reasons why duck DB is fast you know mother duck is fast because we sort of let duck DB do its thing we also uh

37:21you know we have a we always have a duct TB running on your client so that could be in the in the web browser there's a to be running in your web browser as well as on the server um or if you're running in a python process or you're running you know you're write writing some code that

37:36you know connects to that connects to mother duck there's a duck DB inside that process and that duck DB we can basically cash data there and so queries that can hit that cache don't even have to hit the server at all so if you think about we have sort of some you know some UI uh apps that we've built that um kind

37:55of uh will start out and they'll have to they'll have to run queries against the server but then they'll end up sort of building this cache and then so as you navigate the UI everything happens locally and so you this is how you get the 60 frame per second um kinds of um you know visual Vis visualization speeds

38:12that are literally impossible in more TR traditional architecture because because if the C if like if I'm talking to a data center that is you know on the other side of the country that's 100 milliseconds and so the best I could possibly do even if it was infinitely fast to do the um to do the to do the

38:30query would be 10 frames per second and um uh and so you just you know you end up being sort of limited by the cloud architecture but by pushing stuff down to the down to the client you know the only limits are the the limits of the the local execution speed and the local execution speed is just much much faster

38:48than than than it used to be and uh and getting you know and getting faster you know and and the other thing is like more and more people have like fiber to the home or Fiber to the workplace and so kind of the pulling some of these data sets down which used to be you know prohibitive um now it's like you know

39:06you can do in you can do in seconds or or or less maybe to close on on product what is uh mother duck and DB not good at yet and that you have on the road map so I mean scale is certainly one one one limitation and I think you know if you're going to push past working sets

39:25of like 10 terabytes or larger then um you know I think then then you know mother duck doesn't doesn't work well yet you know we're working on kind of um you know working with larger machine sizes and just sort of things that are going to scale better as you get as you get bigger um but that's you

39:43know that's certainly um that's certainly something that that can um uh that can come up you mentioned Vector databases so that's another inevitable question about um you know where mother duck fits in the nent AI infrastructure

40:01stack I saw on the again on the website for small data SE that you are talking about small data but also small models uh maybe walk us through that you know as we're pushing I think there's there's a there's a real analog to you know pushing data workloads down to the client and then also being able to push

40:22you know models down to the client and um I think everybody who's worked on you know AI things has probably been sort of frustrated you know you type something in and you know in uh uh in chat GPT and you kind of sit and you wait and you twiddle your thumbs and you wait and like um and yes it's incredibly powerful

40:41and you get these magical magical responses back and um and they make it feel like it's you know it's faster by you know streaming streaming results back but you know it's it's actually quite quite slow but if you can do a lot of the work um you know or certain types of certain types of questions or certain

40:59pieces of questions uh via a local model then you know you can get you know kind of much more interactive um you know ai ai applications and especially if you're doing like you know rag or something where the you know uh you know retrieval retrieval augmented generation where you're you're pulling you're pulling data from somewhere you're using

41:20basically a database to store State and then you're using an llm to kind of under understand something about the world um now if you do it that way you can actually use smaller smaller models because they don't have to quite understand they don't have to encode encode all the state as well and uh and so um you know you can do local rag

41:44where you're basically kind of doing local lookups and local uh local inference um and then perhaps for the harder things you can call out to the uh to the server you can start with an answer and then refine the answer from um from something that is uh that is remote but I I see the um I see there's

42:01like some really nice parallels between um kind of the this the architecture architectural stuff that we're doing uh in uh in in mother duck and you know things like AMA where you know you can actually um operate on on models on models locally and there's there's certainly really interesting things coming out with you know with hardware and with you know uh computer

42:25environments that they kind of let you get access to the G GPU um and um and

42:32kind of do more do more AI stuff on your local client I saw that you seem to be uh good friends and partners with George at five Tran uh and I would have assumed that small data is not a good thing for the 510 of the world not to be condemned but like any company in that stack

42:51because don't you need a lot of data uh and a lot of complexity uh for for for

42:59this modern data stack this Suite of vendors to thrive I mean it's it's interesting you mentioned you mentioned uh George and five Tran because he he he was a speaker at our at our small data SF conference he and I did a a town hall conversation and one of the things that he said was like yeah it's shocking how

43:16little data people people use and generally if people are pushing a lot of data through five TR it's because they're doing something really inefficient where they're basically just sort of over they're doing basically they Rec copying all their data every day um but you know typically you know they the the sizes of datas that they see are are much much smaller than um

43:36than than people would would expect but I do think that the modern data stack you know the ideas behind the modern data stack are important and I think really that you have um uh you know you

43:50kind of have like I would call it sort of three maybe four pieces you have the you know data ingestion uh you have kind of your query engine and then you have your your visualization layer uh and then maybe the fourth would be kind of the orchestrator um that would be sort of DBT like five Tran would be ingestion

44:09snowflake would be the query engine um looker the visualization layer the visualization layer and then DBT the orchestration um you know obviously you can swap out those those pieces and there's lots of competitors in in those in those spaces um I think kind of the world is still arrang in those in those in those buckets and so I think kind of

44:31that is still is still valid like mother do I think we're playing in the hey we have a great you know query engine space and you can hook us up to your bi tool and we we connect to you know you know Tableau and powerbi and uh Omni and and

44:46preset Etc uh and then you can hook us up to your you know ingestion engine you know whether it's five Tran or airb uh Etc and works great in DBT and so like so we're playing well in the we're playing well in the ecosystem does that you know fast forward 5 years from now does all that still look the same um

45:04does you know the Advent of things like um these open data formats does that start to kind of change um you know how some of that how some of that work gets done does it impact you at all like in a in a world where um you know tabular and uh you know that was post the acquisition of tableb data breaks

45:24and the rise precisely of those so data format does that change anything for you it does sort of open up some doors for us if you think about um you know a lot of people who's and we're hearing from a lot of people that you know that are using Snowflake and they're moving their data into usually Iceberg uh as part of a you

45:47know as a cost reduction you know way to avoid lockin uh way to have be more flexible and have more fle more flexible access to their data and um and that's great news for us because if the datea is locked in Snowflake uh and we want somebody to try try mother duck um well we have to convince them to like export

46:08the data or write it to two places have multiple copies and it's it's a mess it's a migration um if the day is an iceberg you know we can just we have the same access to it that snowflake does and um and so I think as uh I think that's going to be hard for the incumbents and I think it's going to be

46:25it's going to be a net benefit to the the you know kind of the people that are people that are coming in with new uh with new tools and new ways of doing things and it's going to put pressure on margins um which again is also you know tends to be in favor of uh of people

46:42that are coming in afterwards especially if they have a simpler architecture and can deliver things you know less expensively there is batch processing which has its one stack there is Real Time processing which has a different stack and then there was Big Data and the small data and sort of what do I do it feels like things are getting more

47:04complex rather than less uh but it's part of the small data message uh uh that uh things actually getting simpler but you sort of need to get rid of some of the pieces of the past I think one of the things one of the things with small data is uh I think because the um

47:24because the AR architecture is simpler we can focus more on building better experiences so um yeah maybe there there might be you know a plurality of tools

47:37involved or a plethora of tools involved um the if those tools are simpler to use then um and I think kind of the net cognitive load can can go down and I'll give an example is um uh a lot of data is in CSV files and

47:57as much as sort of like as a database person it sort of makes your head explode you're like why would you put all this in a CSV file um it's just it

48:07that's the way the world works and like it's simple and it's easy to it's easy to write a CSV parer and it's easy to re to write CSV but there's so much broken CSV out there because it's actually really hard to write a totally unambiguous you know correct CSV file and everybody sort of does it different like different like null characters and

48:26is you know two empty quotes is that a null or is that just an empty string like there's just all sorts of um all sorts of weird weird weird things that that happen and um one of the things that duct TB did is they said okay we're going to we're going to really really solve this problem and we're going to

48:43make it so that you can just do select star from CSV file name and it will do the right thing and um and they they wrote a research paper on it they put a full-time you know kind of PhD um on it and and it keeps getting better and they're writing more research papers on on this versus if you look at

49:04other other database companies when I was in a big query we put a college new grad on on you know the CSV and they worked on it for three months and we're like great you know you shipped it go work you know work on something else and then so there's all sorts of like you know Corner cases or things don't work

49:20and then we basically say that's not our problem our problem is once you get into the database uh it's your problem to go fix your fix your CSV or like um and uh and if you just think of it from the perspective of somebody who is trying to get work done how much time do you spend like wrangling broken CSV files it's

49:38like ah this isn't working this isn't working and like and you know and then so duct B can often make that just sort of magical and just sort of like you know oh it just it just works I don't have to I don't have to think about it and that's a that's like a small it's a small example but it's a way that you

49:55know yes the number of tools are you know maybe maybe increasing but hopefully you know we can still make life life simpler so to to close I'd love to go in a completely different direction and talk about uh your entrepreneurial experience so you mentioned you hadn't started a company before and you were most recently a chief product officer at the uh single

50:21store what was the transition for like any technical person out there that maybe thinking about starting their company what was uh surprising in a in a bad way but also in a good way I always figured that there's people out there that are like going to start companies and then there's like kind of normal people and I was one of the the normal

50:40people and uh you know never kind of you know sold computers out of my dorm room or like had you know lawnmowing businesses um you know employing siblings and um and but I just sort of I

50:55sort of fell in fell into this and I think one of the lessons is like you know you might not think that you're you know an entrepreneur um but you know you you too can you two can do it um I think

51:08the the you know there's there's been a number of kind of surprises along the way but I think one of them is that there's no one right way to do it I kind of I kept expecting there somebody to somebody who was like brilliant you know Lloyd for example to be able to tell me tell me the answers like should I how

51:27should I think about this how should I do this how much money should I raise who should I raise from how should I hire my first you know Engineers or my first salesperson and um and you you know you can listen to people who have been really successful and that's one of the things that I love is that Founders tend

51:43to be so giving with their time to other Founders um because otherwise I don't think I could have you know kind of figured anything out because I had literally no idea what I was what I was doing when I when I got started um but you know you listen to them and you're like okay well this person did

52:01you know did X and is telling me you got to do you know do exactly what I did of course and then another person who's equally successful does exactly the opposite and um and you're like wait well how can these two things be and it's like you realize that well this was the right thing for that person and this

52:18was the right thing for this other person I got to figure out what's right for for my company and what's right for mother duck um it's a lot of it's so that it puts more pressure on your shoulders but it's also like it makes you realize that there's just there's not one way and I think one of the

52:34things that you know has been suggested is you got to you know you got to rely on your intuition and what and what feels right and what seems right to your to your company you know I think an example um you know from the early early stages was uh I asked one founder like well you know how much should I raise

52:52and and he said um raise as little as possible because you can always raise more if you raised from somebody good they're not going to let you run out of money uh and then I raised from somebody else that said that said um I asked for somebody else and they said uh raise as much as you can because the only time

53:06you're ever at risk as a Founder is when you're is when you're running out of money if you are if you own if you control the board and uh and you have plenty of money you don't have to listen to anybody else um and both of those people are right you know like maybe more so and maybe less I mean like if

53:24you raise too much money there's certainly you know problems you can get into um if you raise too little money like there's you know certainly problems you can get into and I think people you know this was in this was in 2022 kind of the tail end of the the zerp era and

53:40um then people started finding well maybe you can't always just you know just hold your hand out and get and get more um and so that was and so that you know so there were there were negatives on you know positive negatives on both sides and you just have to figure out okay what's my decision framework and

53:54what works for me and what works for my my company what about the commercial side of things uh coming from a very uh technical background how did you learn there what was surprising again was the you know caveat mentioned up front that uh you seem to be doing an extraordinary job at marketing and and marketing positioning so I was an engineer for for

54:1920 years and I kind of bounced back and forth with engineer and engineering manager and I worked on a couple projects that I thought were beautiful I thought we were like wow I'm just so proud of this and that like are dead you know and they just because because like the commercial value wasn't there the you know the

54:38company killed it or the company's no longer there and um and so I think one thing that that the lesson that that taught me was like you've got to understand customers you got to understand the market and so I was part of the team that help help start Google big query and uh and at one point I

54:55ended up moving into into product um and part of the reason that I did that is because uh first of all I was being asked to help hire the director of product for for big query and uh and I kept interviewing people I'm like wow this this person just really doesn't get what's special about bigquery and like

55:12this would be terrible if that person was was was you know Led Led the the product uh and then I thought well maybe you know maybe I could maybe I could do it which felt like totally weird to uh to go from engineering to product but that just opened my eyes in so many different ways because as you know the

55:31you in engineering you're sort of you're designing with this sort of palet you have these these things that you can do these you know these data structures that you can you can move these systems that you're building that connect in certain ways on the product side you have sort of the same ideas you have you know customers and pricing and packaging

55:49and you know your and your marketing team and your go to market and all these things you have to sort of design something coherent if you don't make it coherent if you don't make it work you know it doesn't it doesn't work now you don't get the same like positive feedback every day that you do and you

56:05kind of submit code and it works and you see it running um on the other hand like you know I think you can build something more more real and more enduring if you kind of do take a product product Focus um so being the product manager for big query did give me uh kind of exposure to some of these other parts of the world

56:25that I don't think I would have had and then I jumped into the into the you know deep end at single store uh as a chief chief product officer uh and you know I was you know as being a uh you know in the sea Suite um you know got to be you know part of a lot of these discussions with you know

56:45with the cro and the CMO and the um and the CEO and the CFO and kind of like hearing how how they were thinking about you know fundraising and you know sales comp plans and kind of marketing and it's one of the you I realized that marketing is war um you know when you're at Google you think oh well you don't

57:03need to do marketing who needs to do marketing um but then you realize outside of Google is like well nobody cares about what you're doing um and and you have to basically win over their you know over their eyeballs you have you have to convince them you have to convince them to care and you have to give them a good reason to to do to do

57:20so um anyway uh so I think it was a really good introduction to kind of what it takes to to to sort of to see what a a successful startup company you know much later stage startup company I mean there was there were 100 million in ARR um you know what that what that looks like and all the pieces and then you can

57:40sort of extrapolate back down to okay it does feel like a different like a whole different thing when you're just starting a company and it's just you and a couple of people and like you don't even have a GitHub repository and like you know like um but like over time you can start to see okay well this is

57:56this is like kind of the lineage that will get you to this this thing kind of towards the end which then you know you look at then actually successful public companies and you know you see okay this is what gets you what gets you there so so that was um uh a great you know I I I

58:11hadly recommend you know being part of a you know big tech company because you just do you learn how to do things right um but then I think also I recommend being part of a a a startup because you learn how to get things done quickly and you also get exposure to sort of much more pieces of the puzzle that than you

58:30would you would otherwise thank you so much for sharing the the the story and the some less sense we appreciate you being here today thanks appreciated the uh the conversation hi it's Matt Turk again thanks for listening to this episode of The Mad podcast if you enjoyed it we' be very grateful if you would consider subscribing if you haven't already or

58:49leaving a positive review or comment on whichever platform you're watching this or listening to this episode from this really helps us build a podcast and get great guests thanks and see you at the next episode

FAQS

Why does MotherDuck's CEO say big data is dead?

Jordan Tigani, MotherDuck's CEO and former Google BigQuery team member, argues that while companies may store large volumes of data, the data actually touched by analytics queries is typically small, often just the last week or month. He observed that even Google's largest BigQuery customers were not running queries anywhere near the 100-terabyte benchmarks that Databricks and Snowflake were competing on. Modern laptops with hundreds of cores and terabytes of RAM can handle most real analytics workloads without the complexity of distributed systems.

How is MotherDuck different from Snowflake or BigQuery?

MotherDuck achieves single-digit millisecond query latency compared to BigQuery's roughly 400ms minimum overhead, because it skips the coordination tax of distributed systems. It uses a dual execution model where a DuckDB instance runs on the client (browser or local process), caching data locally so queries can execute without hitting the server at all. This makes 60 frames-per-second data visualization possible, something that is physically impossible with traditional cloud architectures limited by network round-trip times.

What is DuckDB and how does it relate to MotherDuck?

DuckDB is an open-source, in-process OLAP database created by Hannis Muhleisen and Mark Raasveldt at the CWI research institute in Amsterdam. It runs as a library with no dependencies. You simply import duckdb in Python and can query data immediately, even against Python objects. MotherDuck takes DuckDB to the cloud as a managed data warehouse, adding collaboration, data sharing, and hybrid local/cloud query execution while keeping DuckDB's simplicity.

How does the small data approach make analytics cheaper and faster?

Distributed systems pay a double tax: latency from coordinating thousands of machines, and cost from the hardware overhead. Jordan estimates BigQuery is roughly 40x less efficient per core than optimized single-node systems. By building on DuckDB's simpler architecture, MotherDuck can deliver a data warehouse starting at $25/month. The simpler codebase also means faster improvement: DuckDB ships new optimizations from academic research within weeks, while complex distributed systems take years to implement the same features.

Can MotherDuck and DuckDB handle terabytes of data?

Yes, MotherDuck handles multi-terabyte data warehouses in production. The "big data is dead" argument is not that large datasets do not exist, but that analytics queries typically only touch a small hot subset of data. With separation of storage and compute, historical data sits cheaply on object storage while only the actively queried data needs fast processing. MotherDuck also supports local AI inference with smaller models for interactive use cases, complementing cloud-based processing for heavier workloads.