InterviewYouTube

What's New in Data: Small Data, Big Impact

2024/09/19Featuring: ,

Editor's note: What follows is an AI-generated summary of the video transcript

In an era where data is dubbed the new oil, navigating the complex and ever-evolving landscape of data management and analysis can be a daunting endeavor. Yet, amidst these challenges lies an untold story of adaptation and innovation, exemplified by the career trajectory of Jacob Matson. Once a hands-on operator renowned for his mastery of SQL Server, dbt, and Excel, Matson has taken a bold leap into the realm of Developer Advocacy with MotherDuck. This transition is not merely a career shift but a testament to the transformative power of DuckDB in addressing intricate data problems. Through Matson's journey, we unravel the significance of DuckDB, a tool heralded for its adaptability across both local and cloud environments, showcasing its potential to redefine our approach to data analysis. As we delve into the reasons behind Matson's move and the broader industry trend towards roles that demand adaptability and a thirst for continuous learning, we set the stage for a deeper exploration of DuckDB's impact on the data landscape. Are you ready to explore how DuckDB and the transition of one individual can signal shifts in the broader technological ecosystem?

Introduction to Jacob Matson and MotherDuck

Jacob Matson's career evolution from an experienced operator in data management tools like SQL Server, dbt, and Excel to a Developer Advocate at MotherDuck marks a significant shift in the data technology landscape. This transition is not just a personal career move but a reflection of a broader industry trend where professionals increasingly align their careers with emerging technologies that promise to solve complex data problems more effectively. Matson's pivot towards MotherDuck, a company at the forefront of enhancing DuckDB's capabilities, underscores his dedication to addressing specific data challenges that modern organizations face.

DuckDB emerges as a critical player in this narrative, offering unique solutions that stand out for their adaptability in both local and cloud environments. The tool's design and functionality cater to a growing need for flexible, efficient, and scalable data analysis tools. Matson's journey epitomizes the growing importance of adaptability and continuous learning in the tech industry, highlighting the necessity for professionals to evolve alongside technological advancements.

As we explore Matson's transition and the reasons behind his move to MotherDuck, we gain insights into the significance of DuckDB in revolutionizing data management and analysis. This evolution from operator to Developer Advocate not only reflects Matson's career growth but also serves as a catalyst for a broader conversation about the technological advancements shaping the future of data analytics.

DuckDB: Revolutionizing Data Analysis - Understanding the Core of DuckDB's Design and Functionality

At the heart of DuckDB lies an architecture that sets it apart from conventional databases. Often likened to SQLite for its simplicity and ease of use in analytics, DuckDB's design philosophy hinges on an embeddable, in-process database model. This foundational choice simplifies a myriad of traditional database concerns, effectively removing the burden of complex setup and management overhead from developers and analysts alike. Unlike systems that require dedicated servers or complex configurations, DuckDB integrates directly into applications, offering a seamless data processing experience.

One of DuckDB's standout features is its design tailored for multi-core machines. In an era where computing hardware is no longer constrained by single-core limitations but instead boasts multiple cores, DuckDB's architecture leverages this shift to its advantage. The database is built from the ground up to maximize the computing power of modern hardware. By efficiently distributing workloads across all available cores, DuckDB ensures that analytical queries are processed at lightning speed, making it an ideal choice for data-intensive applications.

The simplified security model of DuckDB offers a fresh perspective on data management. In traditional databases, securing data involves complex role-based access controls and various levels of permissions. DuckDB's approach simplifies this by focusing on local data processing. This model assumes that if you have access to the machine running DuckDB, you are authorized to access the data within. While this may seem unconventional, it streamlines data access for analytical workloads, where the primary concern is processing efficiency rather than multi-user access control.

A critical aspect of DuckDB that reassures users of its reliability is its compliance with ACID properties. Despite its streamlined security model and focus on local processing, DuckDB does not compromise on transaction reliability. By adhering to the principles of Atomicity, Consistency, Isolation, and Durability, DuckDB ensures that even in a simplified environment, data integrity and transactional reliability are upheld. This makes DuckDB a dependable choice for applications that demand both analytical performance and data consistency.

Looking towards the future, the strategic implications of DuckDB's design are profound for the landscape of analytical workloads. By prioritizing efficiency, simplicity, and the effective use of modern hardware, DuckDB presents a compelling alternative to more cumbersome, traditional analytical databases. Its emphasis on leveraging local resources for data analysis not only reduces dependency on cloud-based solutions but also offers a cost-effective and high-performance option for data analysts and scientists. As the volume of data continues to grow exponentially, the ability to perform fast, reliable analytics on local machines becomes increasingly valuable. DuckDB, with its innovative design and strategic advantages, positions itself as a key player in the future of data analysis, challenging the status quo and paving the way for a new era of data-driven decision-making.

Hybrid Execution and Cloud Integration with MotherDuck - Advancing DuckDB's Capabilities into Cloud Services

MotherDuck ushers in a new era of data processing with its innovative approach to hybrid execution, effectively bridging the gap between local and cloud environments. This model enables queries to intelligently assess and execute across the most suitable environment, leveraging both the power of local processing units and the scalability of cloud resources. The seamless integration of DuckDB with cloud services through MotherDuck represents a significant leap forward, making data analytics more flexible and efficient.

Addressing the challenges of adapting DuckDB for cloud environments, MotherDuck has had to innovate extensively, particularly in areas such as security model creation and resource allocation. These challenges stem from DuckDB's original design for local execution, which did not account for the complexities of cloud-based data management. MotherDuck's solution involves the creation of robust security frameworks and intelligent resource allocation algorithms to ensure that DuckDB's transition to the cloud does not compromise its performance or the security of the data being processed.

A cornerstone of MotherDuck's cloud strategy is its innovative tenancy model. Each user receives isolated compute resources, affectionately dubbed 'ducklings,' which ensure that queries from one user do not interfere with another's. This model not only optimizes performance by preventing resource contention but also enhances security by isolating users' computational processes. Such isolation is critical in a cloud environment, where multiple users often share underlying infrastructure.

WebAssembly (WASM) plays a pivotal role in MotherDuck's strategy, enabling DuckDB to run directly in browsers. This capability opens up new avenues for data interaction and visualization, allowing users to perform complex data analysis without the need for server round trips. The use of WASM significantly enhances the user experience by reducing latency and making it possible to leverage DuckDB's powerful analytics capabilities in web applications, dashboards, and interactive tools.

The broader implications of MotherDuck's enhancements on DuckDB are profound, particularly in terms of making advanced data analysis more accessible and cost-effective. By extending DuckDB's capabilities into the cloud while retaining its efficiency and simplicity, MotherDuck democratizes data analytics. Small to medium-sized enterprises, individual researchers, and educational institutions stand to benefit immensely from this development, as it lowers the barriers to entry for sophisticated data analysis.

MotherDuck's contributions to DuckDB highlight a future where data analytics is not bound by the constraints of hardware or the complexities of cloud integration. This vision aligns with the evolving needs of the data industry, prioritizing accessibility, efficiency, and the democratization of data tools. As DuckDB continues to gain traction across various sectors, MotherDuck's innovations ensure that its journey into the cloud is both impactful and aligned with the needs of a diverse user base.

Real-World Applications and Future Directions - Leveraging DuckDB and MotherDuck in Operational Workloads

In the realm of data analytics, DuckDB and MotherDuck are emerging as game-changers, particularly in how they are applied to solve real-world data problems. From simplifying intricate data extraction processes to facilitating efficient local development workflows, these technologies are proving their worth across various industries. For instance, companies are leveraging DuckDB for rapid, on-the-fly analysis of large datasets without the overhead of moving data to a conventional data warehouse. This capability is invaluable for businesses that require immediate insights from their data, such as real-time financial analysis or just-in-time inventory management.

MotherDuck's partnership with other cutting-edge technologies like Hydra for Postgres integration further amplifies DuckDB's utility. This collaboration enables seamless data movement between operational databases and analytical workloads, allowing DuckDB to complement existing data management systems rather than replace them. Such integrations highlight DuckDB's flexibility and its potential to enhance the data infrastructure of organizations without necessitating a complete overhaul of their existing setups.

The democratization of data analytics is perhaps one of the most significant contributions of DuckDB and MotherDuck. By making powerful data analysis tools accessible to companies and individuals without requiring extensive infrastructure, these technologies level the playing field. Small startups, independent researchers, and educational institutions can now harness the same analytical power that was once the exclusive domain of large corporations with deep pockets.

Looking to the future, the evolving data landscape appears ripe for DuckDB and MotherDuck to make an even more significant impact. Speculations about new features, integrations, and the potential influence on big data and cloud computing paradigms are abundant. Possible advancements include enhanced machine learning capabilities directly within DuckDB, tighter integration with cloud storage solutions for seamless data access, and expanded support for complex data types to cater to a broader range of analytical needs.

For data professionals and organizations contemplating the adoption of DuckDB and MotherDuck within their data stacks, the message is clear: stay adaptable. The technological environment, especially in the data sector, is in constant flux. Tools and platforms that offer flexibility, efficiency, and the ability to integrate with existing systems while preparing for future demands are invaluable. DuckDB and MotherDuck epitomize these qualities, promising a robust foundation for data analytics now and in the years to come.

Small Data SF Conference - Spotlighting the Small Data Movement

The tech industry's relentless pursuit of bigger data sets has overshadowed a powerful undercurrent: the small data movement. It's a shift that's gaining momentum, and nowhere is this more evident than at the upcoming Small Data SF conference. This gathering is set to illuminate the potential of small data and DuckDB's technology in solving complex problems that don't necessarily require vast data lakes to navigate. Here's what participants can look forward to:

  • Practical AI Applications and Data Analytics: The conference will shed light on how small data powers AI applications in ways previously dominated by big data paradigms. Attendees will explore methodologies for extracting meaningful insights from smaller, more manageable datasets, showcasing that quality often trumps quantity in data analysis.

  • A Rich Tapestry of Speakers and Topics: The diversity of speakers lined up for Small Data SF is a testament to the wide-ranging impact of small data across industries. From healthcare to retail, finance to entertainment, experts will share how DuckDB's technology has revolutionized their approach to data analysis, often simplifying processes and reducing costs without compromising on analytical depth or accuracy.

  • Challenging the Big Data Paradigm: The core mission of Small Data SF is to question the inevitability of big data as the sole solution for technological advancement. By presenting scalable, efficient alternatives for data analysis, the conference aims to broaden the industry's perspective, showcasing that small data can often fulfill the same needs as big data, but with greater agility and less overhead.

  • Networking with Leading Experts: Beyond the educational opportunities, Small Data SF represents a prime networking venue. Attendees will rub elbows with some of the brightest minds in data science, AI, and technology innovation. It's a chance to form collaborations, exchange ideas, and perhaps even lay the groundwork for future breakthroughs in the field.

  • A Call to Reconsider Data Strategies: Perhaps most importantly, Small Data SF encourages participants to reassess their own data strategies. Whether you're a startup founder, a data analyst, or a product manager, the insights garnered from the conference could inspire a shift towards more efficient, scalable solutions for data analysis within your own projects or organizations.

As the conference approaches, it's clear that Small Data SF is not just an event; it's a burgeoning movement. It challenges the status quo, offering a fresh perspective on how we collect, analyze, and leverage data. In an era where the size of your data set has been seen as a measure of potential, Small Data SF stands as a beacon for those who believe in the power of precision, efficiency, and accessibility in data analytics. This conference is poised to redefine what success looks like in the tech industry, proving that when it comes to data, bigger isn't always better.

0:00hi everybody thank you for tuning in to what's new in data in this episode we have Jacob Matson a well-known squl server DBT an Excel practitioner who has joined mother duck as a developer Advocate let's get right into it Jacob how are you doing today hey I'm doing great thanks for having me on yeah Jacob this is your second time

0:23on the Pod last time we spoke you were

0:27in a very cool hand on operator role you're very

0:32well known on Twitter and Linkedin as being one of the leading experts on being operational with things like SQL Server DBT Excel Erp crns you name it

0:48and now you're in a new role as mother duck tell us why what Drew you in that direction yeah I think a lot of it came

0:59down to the shape of the problems that I was facing and I kept just going back to the well for duck DB so can't it turned into

1:11a thing or it was like oh like I can use duct TV for this like I have a problem that is in the shape of of data that I can run on my local machine but

1:22also not so big but I need to use something like snowflake or spark and so that just kept happening and then time timing just worked out from a standpoint of what they're looking for and what I was looking for and and I'm here and I get to I get to share about all the really cool stuff that mother duck is

1:44enabling with the Dual execution

1:49model and running duct TB both locally and in the cloud super exciting stuff I'm 100% with you uh duck DB is seems

1:59like it's foundational technology for the next generation of analytical and all types of workloads honestly we're going to dive into that into this podcast and also really exciting to learn about the things mother duck is doing to support it but before we get into that I want to

2:18understand and help the listeners understand as well why duck DB exists and why it's special can you explain just what duck DB is and how it work yeah sure so I think about duct DB as SQL light for analytics so it is

2:39an embeddable kind of inprocess database so that means a lot of the things that you need to worry about when running a database in a distributed system or multi-tenancy you can forget about you don't need to worry about security models for example it's all running locally when you're running duct TB it's an in process it has access to what it has

3:04access to and there's no such thing as Ro level security or role-based access control or anything like that so what that means is broadly it's a simpler way to kind of Reason about the data model because you're not thinking about things like you're really concerned at postgress about like mvcc right multiversion compeny control there's one person reading a dub database so who

3:31cares those types of things are a little bit different obviously there is like the ability to write and do updates and there's it is asset compliance so there is some controls there but it's not the same level as something like postr the other thing is it's built for multi-core machines from the ground up so one of the really cool things about

3:55what hotess and the team of dctb labs have built with duct DB is they recognize that the next way that

4:06computers were scaling out from the compute standpoint is adding more cores and that's different than the world let's say pre2 2010 where we would maybe have two

4:18cores or four cores on our machines and when you're running sending a spark job to the cloud or to your cluster at that time it's it's analogous to One Core now we have machines my the machine in front of me right now has 14 cores right so if you just think about like just the math over the last 10

4:41years the math is that well now you have 14 times as much power as you did 10 years ago assuming that your that your clock speed stay the same and it didn't of course it's faster too so we have that happening meanwhile it requires a different way of thinking about your soft where to build it to run in multi-

5:03Threads that's something for example something like R does really well is run in a vectorized way duct DB has done the same right so when it's running and executing queries it's breaking into parts that can be executed in parallel and then combined back together and because of that you get all of the advantages of maxing out all of

5:22your local cores because it's vectorized and that is something that's unique to ducky be and is a kind of design feature intentionally in like transaction engines right where um they're thread bound typically right that doesn't mean they're they're always thread bound but what that means is that basically when you write a query it's running on a single thread um dctb

5:43says since I'm in process I can use all the cores available to me going to just take all that data suck it up and just jam it through so that's what's unique about what duck DB is doing is really that it's built from the ground up to execute on machines with 10 plus course and that

6:07architectural Advantage serves it served it really well from an analytical workload perspective absolutely and this is what we were talking about the fact that duck DB is so

6:22optimized for the type of hardware and the metal that most people have now in their laptops or even very easy to deploy Cloud instances it has a so many built-in advantages and the best part about it's so practical I've used duck DB myself and

6:43you were able to bake it into your operational workloads sure so all those things and I think mother duck Brands it as multiplayer analytics right yeah so

6:56yeah and I can talk a little bit about what the difference is between mother duck and duct DB that's you want to go yeah let's get into that because I know mother duck uh Jordan toan team are are doing some excellent work there to make more scalable and better adopted by

7:17a community of people who can get value from it so I'd love to hear what mother duck does yeah sure so we are plug service provider sitting on top basically operationalizing duct B for the cloud so duct B is built and maintained by duct B Labs out of the Netherlands which is like a open source Foundation that duck DB is a part of

7:42mother duck obviously is a close partner with them but we do not duck DB is not our project duck DB lives in the foundation and was stewarded by hanis and Mark and others so what we're doing is we're saying hey there's this really powerful kind of single node database engine available how do we make that work in

8:05the cloud and so all the things I was talking about earlier that our benefits duct DB become a challenge when work Cloud right it uses all the resources available to it it doesn't have a security model so that means we have to build one you have access to either the entire dat database or you don't for example that's just subb works there's

8:24which is very different from like traditional database where users will have specific tables they'll have rights to like different actions on different tables has no consideration for those um so how do you make that work how do you make things like incremental storage work how do you extend all the capabilities that exist in duct Tob just make it easier and faster to uh use with

8:49other cloud services and so that's what we're thinking about the second part is also your local machine is pretty powerful but there's even bigger machines in in the cloud and so we have this notion we call hybrid execution where the query planner when it runs actually assesses where the data is meaning is it local or is it in the

9:09cloud and then can assess where to execute different parts of the query some of it can execute in the cloud some of it can execute locally and so really what that means is you're maximizing your your existing hardware and that means it's also cheaper to run right instead of using ec2 it is using a local instance for

9:33example your local compute and so there's a whole bunch of stuff around that that's really cool and the the other piece that's semi- related to this notion of dual execution is we have a wasum engine for mother

9:47duck too so it can run in your browser actually the the mother duck app is powered by kind of our wasum connector which is really cool that means that again as you're interacting with it locally it can run a bunch of queries locally and then go to the cloud when it needs to get more RAM or more

10:04data that's not available locally um okay that's excellent that's actually one thing I want to drill into a bit and so there's actually a few things there that that sound very exciting so with with mother duck specifically like you mentioned there's a hybrid execution engine and it can tell okay is the data local to this machine uh or is it in the cloud are

10:26there parts we can execute locally are there parts we can execute in the cloud are you saying like I can do like a join between data locally in the cloud is that how that works yes that is correct wow okay

10:40that's pretty powerful and it really does seem like for a lot of the analytical workloads you don't always need a spark cluster but that's because of the way the data laid out and organize and okay I have my data warehouse vendor or I have my lake house and that thing by default is going to run these big expensive queries that

11:03seems to be what we standardized on as an industry not because it's the best way to do it but because it's convenient now it seems like there's an even more convenient and more efficient way to do this yeah that's fair that's certainly our angle is we want to be the easiest to use but also the most cost effective it's a hard line

11:28to walk for sure and I think there's a lot of ground

11:35yet to be explored in terms of how you can scale out conduct DB one thing that that's interesting if you compare it to other Cloud vendors is you don't need

11:48to operate a lot of those with kind of the concern of optimization in mind and you don't need to necessarily with duct DB either but what what's interesting is that users tend to think think about that more because they're using dctb which is a really cool kind of synergy and offering too because you end up with users that love to think about

12:08optimization and so it's a really nice feedback loop where it's like hey like I I had someone the other day that was we're talking the DBT slack really cool op thing where they were actually running their pipeline using only local compute and then their last step in their DBT pipeline was cloning that database into mother duck right just

12:29copy database the copy database function so that's a really cool way that you can think about using it that I Haven even thought about just because our users tend to be optimizers there's a lot of really cool stuff where it's like oh we have like this we can run our pipeline all locally and then last step is we

12:48publish the file basically into mother dock so if we think about what that pattern looks like for I don't know if you're a data app vendor and you have you want to be creating analytics that are really fast for your customers you need to like run this Pipeline on a hourly or daily basis how do you achieve

13:04that at scale is that one one option is you can just run a bunch of stuff locally and then just ship that into the cloud for execution with our wasm engine which is really cool too so the wasm engine wasm stands for web assembly that means it runs within the browser so is the browser itself instantiating duck DB

13:25the entire database yeah it does

13:30it does like in my MDS in the box project find it at MDS in the box.com that is using just regular duck to bewm but you can also mother duck has its own kind of fork of that that supports the additional mother duck functionality but yeah so I I don't think it can use I think it's limited half the RAM and half the cores

13:56are single threaded maybe so it doesn't have all the advantages of duck DB kind of running on bare metal but even then it's really fast and can do some really cool stuff especially my favorite example of it is the Mosaic project which is a a visualization Library built by some of the guys from streamlet and Vega and

14:22others and it behind the scenes on it is is d bwom and so what that means is you can load let's say for example 20 million a 20 million row data set and get millisecond level latency interacting with it I think their what I've heard from them is their goal is 60 FPS as a bare minimum in terms of

14:45interactivity with the visuals they're building and we think about how you make that actually happen you can't have a round trip to a database right meaning like a database uh in the cloud it needs to all be local so how do you bring that power to to a local experience the answer is wasum wasm is One path and I'm

15:08very excited about what type of scenarios that makes available for sure I remember when I was software engineer working on business intelligence specifically dashboarding the biggest problem wasn't just building a nice chart it was actually managing the data from the the database right because if you let a user go in and put in some SQL they're GNA the browser is just going to try to

15:35get this big Json response of all the data points and it can throttle and choke the the browser with not that much data honestly because there's other inefficiencies there now you're saying and sounds like the Mosaic project is leading this is a more efficient kind of caching and layer for

15:57rendering purposes yeah yeah definit yeah so I think there's two parts there and then you actually hit on something that I want to talk about mother Duck's tency model which is unique and I think in terms of like database pressure is one of the challenges to the analytical workloads but I'll come back to that in a second

16:15so if you think about if you're using I don't know let's say D3 for visualization right you're probably pulling Jason on over the wire and then populating charts with that right and that's fine J this great format right but if you can use the dctb format which is in wum you get like 10x compression so that means that what is a

16:37100 Meg of Json can be like 10 Meg of duct DB or more likely what it means is instead of you having being limited to 100 Megs hey I can only put 10,000 rows in this visualization I can now I can get 10x the rows just because of one dimension which is compression then you have a faster query engine right instead of

16:59using Json or instead of using JavaScript Primitives to interact with the data you can use SQL Primitives which are going to be faster because they're based on dctb U which is optimized for for query speed and so you add up all these little advantages and all of a sudden you can actually start to get to the interactivity that is like

17:16for all intents of purposes real time and I think that is some of the advantages there like I said one of my favorite data sets in the Mosaic the Mosaic kind of demo is a 20 million 20 million row flight data set of like flights and and if they were on time and and that kind of stuff and you

17:35can just drag around interact with it and you see the data come back in real time it is it's crazy so that's I'm I'm gonna ask you a very pointed question and you're not allowed to say it depends okay so everyone just be aware there's that qualification here can I replace my bi tool with this can you replace your bi tool like

17:56with the Mosaic stuff um probably not yet okay so there's a couple there's

18:04a couple of vendors who so the evidence dev.d folks are using duct tbu aom today with their bi tools for interactivity and I believe observable is as well and there might be others I know like it's in HEX and mode like the duct TB pieces of it as the doing some various SQL bits I think the main use case for the

18:27wasm stuff is actually more like if you're building a data application instead of having to let's say embed a bi tool like looker you can roll your own and build a best-in-class experience using mother duck WM right that's incredible so let's say I'm a I'm I have a data stack right a modern data stack and you know this

18:53very well because you you built MDS in a box but let's s a data team we inest we bring it all into a data warehouse like big query or Snowflake and we have looker reports sitting on top of this where does mother duck come into the equation here is it adding to this is it replacing parts of this

19:16that's what I want to hear yeah I think the exis work we're

19:23thinking about this is you don't actually have that much data and so for folks that really need

19:32big query and snowlake they're always going to need it but I we think that the majority of the market actually doesn't have that much and that you couldn't use mother duck as the data warehouse and I think what that looks like in the in the kind of short to medium term is I think it's probably definitely a and question hey we're

19:53going to use big query for x and maybe our marketing team is going to use mother duck but think notionally where we want to go to is hey whatever your data workloads are we want you to be able to bring those into mother duck yeah absolutely and your point about most people don't have really big data seems to be cerated with

20:17multiple benchmarks and Publications on this a really welln one was some of the data released by AWS red shift

20:28where the majority of workloads are not at

20:34paby scale right they're in the the low terabytes and it is worth mentioning that yeah maintaining a data warehouse there's that built-in floor cost and that comes from the the scale that it's that that warehouse is supposed to offer but most people may not have that so and now let's let's talk about the actual like to make this a bit real I let's say I'm

21:02storing a few terabytes of data in big query I do I actually need big data processing to to use that data by the scale of a few terabytes I don't think so the answer of course depends on requirements right how much of that are you really going to process at any given time what are your partition strategy look

21:30like the nice thing about tools like Snowflake and big query is you can ignore that stuff like yeah I'm just gonna put a bunch of crap in there and then maybe I'll use it later and if I query it it can spin up a huge a huge instance to handle it rapidly but on the flip side is that

21:49really what good data stewardship looks like do you want to have that much data lying around shouldn't you be able to query it that way I don't know I would say no actually I'm not going to say I don't know no like you should be more intentional about it yeah also a few terabytes yeah no problem no problem

22:06mother duck but you do you do have to think a little bit more about your petitioning strategy you just can't like jam the 10 terabytes in there but overall I've been very pleased with the kind of the ground we've been taking in terms of executing against potentially large loads and the other thing you have to think about too is

22:25duct B has best-in glass compression one one terabyte on in a paret file is not the same as one terabyte inside of the duct DB and so that's something to think about too or I was talking about 10 to1 compression earlier right that's probably not exactly right but most Json probably compresses around that ratio so just think about if you're dealing with

22:49I don't know Json blobs in S3 or something right like you can you could have I don't know few terabytes of those and then can just compress them and now all of a sudden it's a few gigs I've seen that I've seen that quite a few times with Json and csb less less

23:12compressed formats so duck DB has its own native format in addition to par is what you're saying yeah so the duct DB storage format is incredible yeah okay has its own serialization format and is the way to get data into that format is that just inserting directly into duct DB via the the AP that it like the the actual DML API that

23:39it has yeah so yes that's right so like with mother duck it's as simple as putting MD colon and then your database name to connect to it and then you can just yeah use regular tml operations from paret files CSV Json Arrow tables data data frames there's a whole it has a whole kind of set of apis out of the box that

24:04are all pretty fast to a load data and you very rapidly just start running into networking constraints on just moving the data from A to B which is really where you want to be on that stuff is if you can hit the networking threshold then you pretty much saturated you know what you can expect to do yeah there there's lots of cool stuff

24:27there we have a five train connector for example as well one for air bite and some others DLP Hub has a good one too getting data in is it's easy but could be easier um and

24:42we're hoping to continue to push envelope on what it looks like to have Best in Class performance in that area yeah absolutely the other thing that's very interesting is yeah you could have your Warehouse right where you store all your data long term but there's a lot of ad hoc analytics jobs that you can potentially do with duck DB so how do

25:08you see companies balancing what's in their warehouse versus what they do in duct DB yeah I think ideally they're thinking about those as one and the same with mother duck but I think one thing that's really interesting that we have built out so we use mother duck for our own data warehouse at mother do one thing we built out

25:31which is really cool is we actually have a sample to local script we can use for development right um so what that means is you can run a script and you can get a thin copy of the entire data set which means you can do true local development for very low cost obviously you're just paying for egress at that point which is really

25:52cool and that's something you can't really do with with other databases at least not in an easy way we B that in the state oh we need to restore backup so that our Dev environment like can be we can run and not get a failure on whatever this test is that's really hard but obviously dub opens up some new

26:11paradigms where you can sample to local which is really cool in fact actually there's a really cool open source package that just I just saw the other day which someone actually built that for snowflake so that you can you can basically proxy your snowflake with ducky B I think I saw that actually yeah that's pretty cool and this also comes

26:32back to your point about nbcc right so like why ises nbcc matter because you don't want a lot of analytical users going and running like olap Sol queries on postgress while it's also serving that transactional workloads because ultimately they're both GNA be competing for resources and the analytic queries AR aren't super efficient yeah so now duck DB is its own

26:54Super efficient compute layer so would love to hear how that's how duck DB adds value there too yeah so one thing that's really cool that we're doing a mother duck is every user gets their own duckling okay so what that duckling is our kind of notion of BM

27:14compute so what that means is if I have a data warehouse and then I'm connected to it as Jacob and you're connected to it as John our queries are sandboxed completely away from each other so if I run some really dumb query doing a bunch of window queries or something super inefficient your performance isn't is not affected and so that's something that's

27:40unique our tendency model there where that really matters is if you think about um you're serving a data app use case you could potentially have hundreds of thousands of concurrent users that could be a pretty big snowflake Bill if you let them have an interactive inter session with

27:58with their data so if you think about what that looks like in the context of mother duck wum you can feed like a data set I don't know a few million rows into a user and now they're completely sandboxed off from other users so you can break the dependency there and let them run whatever crazy SQL they want to run give them the real

28:20power of of the platform You're Building without also the risk of oh this someone's running a query that's made my whole application slow right my previous company was in Telo and I have definitely I had definitely got my hand slapped a few times by some pretty big vendors when we ran some spensive queries on them that is it's definitely something

28:45to think about from a data app standpoint where how do how do you let users have the power they need without without the right word here impacting the performance of the system system as a whole that's that's part of why we ended up with rest apis and you know you can't web scrape and all this stuff just with

29:08how the front end works the back end ultimately a lot of it is just to prevent users from being able to just DDOS their own server and so I think something where it's we welcome that type of workload actually bring it into WM and the only person you impact is your yourself yeah that when you look at a lot of the

29:31classic architectures that are proven to work for balancing transactions and analytics really it usually comes down to a couple patterns one is change data capture from the operational database to the data Lake you're offloading data for reporting purposes and then you also have con Concepts like read replicas which is essentially another form of change data capture just to to a copy of the

30:02database now I think one of the things that's super interesting is you still might have this notion of changeing the capture but you could have htap where majority of the compute is taken from this super efficient embeddable olap engine that's reading from the database or a copy of the database so that I I do

30:26think that there's some fundamental patterns that it's building on top of which could be very cool and exciting for people who want to to roll their own htap yeah I think definitely htap is notionally a really

30:45cool thing to attempt to achieve I still think the reality is you have to replicate the to somewhere else to achieve the performance you want whether that's with CDC or just like selecting data into another table or column store indexes on a row store table that's how SQL Server does it for example yeah or Oracle as well yeah some way the DAT being you

31:13know great we've duplicated all of the data into a column store index yeah

31:20okay you're gonna pay for a one way or another it's be the most convient and efficient way to do it yeah I think what so like when I'm really and I think this leads into something we were going to talk about anyway but might as well bring it here obviously we're partnering with Hydra and others to bring PG duck

31:38DB to life which is a postest extension to let you run duck DB inside your postest server and then obviously extend it to mother duck if you need to and so that's been development now the project is public you can take a look at it I think that is the closest thing that I've seen to open source

32:03htap but what's not I don't know what the technical marketing is that we're going with here yet but it's definitely it's fast analytics whether you want to call it htap or not be my guess yeah

32:18it's it's certainly very indicative of the fact that this is the the next generation of analytics and Compu that can be solved with duct DB as a foundational element and then on top of that Apache arrow and these popular serialization formats I know adbc is also something that duck DB

32:43is aligning itself too which is also super cool and becoming a standard in its own way like what you're seeing with DT Hub and and data fusion and others lot of really what I think is groundbreaking technology coming through this and it really did peque everyone's interests when Jacob you as an operator so has actually kept the lights on in a

33:09real world environment with data spreadsheets SQL Server DBT what have

33:17you looked at Duck DB looked at mother duck and said oh wow this is actually something that can be useful to everyone

33:28yeah that's that's the goal right I think again my personal experience was just kept running into problems that were duct DB sized right in the range of

33:40I've got one of my favorite was at a previous company I was get we would get Excel files that were partitioned by tab for example so like how do you do how do you deal with that and the answer is yeah people had like data analysts that were just like running their head against the wall trying to handle

34:00it with pivot tables or power query and all the St kind of works okay but like with duck DB it's very trivial to rip that into a tapul format that is then super super fast to query we're talking about going from I don't know probably like a couple hours screwing around in Excel to four SQL statements that we able to

34:26get when we were trying to get two out of these files and it took 30 seconds yeah and and the fact that you can express your Transformations very concisely in SQL on top of Fairly complex file formats obviously the my thing one of my favorite kind of minutia of of duct B is that the Excel reader is

34:50actually part of a spatial package so you have to install the spatial package for it to work and there's a separate Excel there's a separate Excel plugin that does something completely different I think it's I can't even remember what it does but yeah so use the spatial package you can do lots of fun things with Excel handle files that are

35:10partitioned um yeah by tab yeah yeah it's anytime you can make SEL

35:21really portable and embeddable there

35:25going to be tons of value you can add in now the reality is tons of analytics and

35:32operational work is still done in spreadsheets and and that'll never change yep the question is how do you

35:40scale that and make it more rather than everyone having like their own ad hoc copy version of a spreadsheet how do we actually apply some of the fundamentals we know about data management to to scaling that I think there's I don't know if the story is super clear yet but I think I think there is a real path now with with duck

36:00DB to making that possible yeah I think that For Better or

36:07For Worse like the linger franka of data has settled on SQL much to the shagr of

36:14data frame enthusiasts everywhere but I think that it's a real power to be able to think about especially as companies are more and more spread out and they're smaller and smaller like collaboration is critical and being being aligned is critical and having tools that enable us to do those things critical and so if we can just make it easy to have one source

36:44of Truth and make that the easiest thing make that as easy as a spreadsheet I think that's some of the real power of duct EB is that it takes that

36:55same runs locally just like Excel it can read all the same files as Excel and it gives you really powerful Primitives to operate it on so I think like for me that's consumed a lot of workloads where I previously was screwing around in Excel or using power query it's like now it's just easier to duct DB it because

37:15it's it's SQL then I can run it later if I need to and I think that there is there's real power there but a lot of that is organizational but I wanted to ask you about small data SF the event you have coming up in September 23rd like I said in San Francisco yeah so we are hosting the

37:41first small data conference very excited a lot of really

37:48cool a lot of really cool speakers talking there I think my favorite is one of the overarching them that we've been able to think now is like what does it mean to make some of this AI stuff usable in practice right so folks like oh llama gonna be talking about that we just released our own kind of

38:12embedding embedding feature which I'm really excited about that's a abling some kind of fun use cases but I think actually can turn real as well in we excited to spread the

38:28word there about okay you don't have to have big data to solve big problems that's overing goal of we're doing small

38:42data great we'll have that link to sign up to Small data the small data conference in San fro in the show notes with Jacob's code be sure to use that so that'll be down in the show notes description with my code indeed we'll put that in the show notes yeah yeah definitely looks like a great event there I did see the speaker

39:06lineup certainly GNA be very exciting for people who want to the way I really think about duck DB is that if you want to do analytics and you don't have to worry about getting throttled by your Warehouse or adding a necessary compute it's really amazing for just ad hoc analytics right now it does sound like mother duck has some really interesting

39:31value to standardize and centralize that

39:35with within an organization even if you already have a data warehouse because yeah sure data warehouse is great for storing tons and tons of data and if one day your CEO asks hey I want to know the age of every single customer we've ever had then sure go run that against the warehouse and all scan all pedabytes and data you

39:55have for most workloads like that that report that sales is asking for or that marketing information that your revops team is asking for it can fit in a small amount of in either on a single machine and can be much more efficient if you just do that with duck DB yep yeah that's the goal and I think

40:21that the number that we're going to be able to see that's processed by a single machine is only going to increase over time and it's crazy to reflect on where things were 10 years ago in terms of what it meant to have big data versus what it means today yeah I think it if I had known where we would be in

40:412024 when in 2014 when I was working up with my finance team on procuring more SQL Server core licenses I think it would have like totally blown my mind there still does but yeah yeah it's crazy crazy

40:58Jacob Matson developer Advocate at mother duck and you can see him in person at small data asep on September 23rd Jacob thanks so much for joining today's episode of what's new in data and we'll be hearing from you soon and thank you to everyone who tuned in today thank you all right thanks John CH later

41:21[Music]

41:30back back [Music]

Related Videos

"Lies, Damn Lies, and Benchmarks" video thumbnail

2025-10-31

Lies, Damn Lies, and Benchmarks

Why do database benchmarks so often mislead? MotherDuck CEO Jordan Tigani discusses the pitfalls of performance benchmarking, lessons from BigQuery, and why your own workload is the only benchmark that truly matters.

Stream

Interview

"Can DuckDB replace your data stack?" video thumbnail

60:00

2025-10-23

Can DuckDB replace your data stack?

MotherDuck co-founder Ryan Boyd joins the Super Data Brothers show to talk about all things DuckDB, MotherDuck, AI agents/LLMs, hypertenancy and more.

YouTube

BI & Visualization

AI, ML and LLMs

Interview

"The Death of Big Data and Why It’s Time To Think Small | Jordan Tigani, CEO, MotherDuck" video thumbnail

59:07

2024-10-24

The Death of Big Data and Why It’s Time To Think Small | Jordan Tigani, CEO, MotherDuck

A founding engineer on Google BigQuery and now at the helm of MotherDuck, Jordan Tigani challenges the decade-long dominance of Big Data and introduces a compelling alternative that could change how companies handle data.

YouTube

Interview