Local Dev, Cloud Prod with Dagster and MotherDuck
2024/04/22Featuring:Have you ever wondered how the seamless transition from a local development environment to production in data engineering can significantly boost efficiency and innovation? A staggering number of data engineers and developers grapple with this challenge, often hindered by the complexities of ensuring consistency, scalability, and reliability across different stages of their data pipelines. This article sheds light on a game-changing strategy that leverages the synergies between MotherDuck and Dagster, two pioneering platforms that are redefining the landscape of data engineering.
By exploring the transformative approach of integrating MotherDuck and Dagster, readers will gain invaluable insights into streamlining their data pipelines from development to production. Witness firsthand the journey of Colton Padden, who made a pivotal shift from Airflow to Dagster, driven by the platform's unmatched ease of use and efficiency. Moreover, delve into Alex's transition from an industrial engineer to a data scientist at Intel, where his fascination with DuckDB catalyzed his current pioneering role. This exploration not only highlights the personal and professional transformations brought about by these platforms but also addresses the common hurdles faced in data engineering, particularly the daunting task of moving from local development to production environments.
How do MotherDuck and Dagster collectively propose to navigate these challenges, offering a beacon of hope for data engineers and developers seeking to elevate their workflow? Engage with this comprehensive guide to uncover the strategies that could revolutionize your data engineering projects, empowering you with the tools to thrive in the ever-evolving digital landscape.
Introduction to MotherDuck and Dagster - A Comprehensive Guide to Streamlining Data Pipelines from Development to Production
In the realm of data engineering, the leap from a local development environment to a fully-fledged production system often presents a daunting array of challenges. From ensuring data integrity and consistency to optimizing performance and scalability, data engineers and developers are constantly in search of more efficient, reliable solutions. Enter MotherDuck and Dagster, two innovative platforms that have emerged as game-changers in the way data pipelines are managed and executed.
MotherDuck, building on the prowess of DuckDB, reimagines cloud data warehousing by prioritizing developer experience and efficiency, while Dagster presents itself as a modern data orchestrator focused on enhancing developer workflow and productivity. The synergy between these platforms is not just about technology; it's about transforming the approach to data engineering from the ground up.
Colton Padden's journey from being an avid Airflow user to becoming an advocate for Dagster encapsulates the transformative impact of embracing new technologies in data engineering. His experience highlights not just the ease of use but the profound efficiency gains that come with adopting Dagster. Similarly, Alex's path from industrial engineering to data science, propelled by his intrigue for DuckDB, underscores the importance of innovative tools in career evolution and the execution of data projects.
The integration of MotherDuck and Dagster offers a compelling solution to the common problems faced by data engineers, especially the intricate process of transitioning from local development to production. This guide aims to explore the intricacies of this integration, providing insights into how data engineers and developers can leverage these platforms to streamline their data pipelines, enhance productivity, and ultimately, transform their data engineering practices for the better.
What specific challenges do these platforms address, and how do they pave the way for a smoother, more efficient transition from development to production?
The Problem Statement and Proposed Solution: Navigating the Challenges of Data Engineering with Innovative Tools
In the intricate world of data engineering, professionals often encounter a myriad of obstacles that can hinder the development and deployment of efficient data pipelines. These challenges range from mocking data sources for testing environments, writing unit tests for data pipelines to ensure reliability and accuracy, to the complexities involved in integrating with external systems—each presenting a unique set of difficulties in the transition from development to production. Colton Padden's insights into these common hindrances underscore the necessity for tools that not only address these issues but do so in a manner that augments developer productivity.
The introduction of Dagster, MotherDuck, and Evidence marks a significant leap forward in the quest for solutions that embody the principles of software engineering within the realm of data engineering. These tools collectively offer a paradigm shift in how data pipelines are constructed, tested, and deployed:
-
Dagster emerges as a beacon of modern data orchestration, emphasizing a workflow-centric approach that enhances visibility and control over data pipeline operations. Its asset-centric model facilitates a clear visualization of data lineage and dependencies, ensuring an organized and maintainable codebase.
-
MotherDuck takes the stage as a revolutionary cloud data warehouse solution, leveraging DuckDB's prowess to offer unparalleled consistency between local development and cloud deployment. Its serverless architecture and Git-like operations for databases pave the way for efficient resource utilization and effortless version control, respectively.
-
The integration with Evidence introduces an innovative method for building data dashboards, wherein SQL queries can be embedded directly within markdown files. This simplicity in dashboard creation democratizes data visualization, allowing developers and analysts alike to craft dynamic data stories without the need for extensive technical expertise in data science.
The significance of these developments cannot be overstated. By applying software engineering principles to data engineering, these tools collectively enhance efficiency and developer experience across the board. One of the most groundbreaking aspects of this integration is the seamless transition it facilitates from local development to production environments. This transition, characterized by a lack of code changes when moving from using DuckDB locally to leveraging MotherDuck in the cloud, epitomizes the efficiency and ease of scalability that modern data projects require.
Consider the specific use case of building dashboards with Evidence. The ability to embed SQL queries in markdown for dynamic data visualization not only simplifies the process but also accelerates the development cycle, enabling rapid iteration and deployment of insightful data visualizations. This approach not only saves time but also ensures that data insights are accessible and actionable.
Through the lens of these innovative solutions, it becomes clear that the future of data engineering lies in embracing tools and methodologies that streamline the pipeline from development to production. By fostering an environment where efficiency and developer productivity are paramount, Dagster, MotherDuck, and Evidence are setting a new standard for how data engineering challenges are addressed. As these tools continue to evolve and gain traction, the data engineering landscape is poised for a transformation that prioritizes agility, reliability, and accessibility in data operations.
Deep Dive into Dagster: A Modern Data Orchestration Framework
At the heart of effective data engineering lies the orchestration of complex data pipelines, a task that requires precision, foresight, and the right set of tools. Dagster, as introduced by Colton Padden, emerges not just as a tool but as a comprehensive framework designed to refine and enhance the way data engineers and developers manage their pipelines. What sets Dagster apart is its foundational approach to orchestrating data pipelines, emphasizing developer workflow and productivity above all.
Unlike traditional orchestrators that focus on tasks as discrete units of work, Dagster introduces a paradigm shift towards assets. This shift is more than a mere change in terminology; it represents a fundamental rethinking of how data pipelines are constructed and visualized. Assets, in the Dagster universe, are tangible elements—be it a table, a report, or a machine learning model—that provide a clearer visualization of data lineage and dependencies. This approach not only simplifies the understanding of complex data flows but also enhances the manageability of dependencies within the pipeline.
One of the standout features of Dagster is its auto materialization policies. In traditional setups, pipelines are often triggered on a schedule, without regard to whether the upstream data has changed. Dagster, however, employs a more reactive model. These policies enable pipelines to trigger based on changes in upstream data, thereby ensuring that data flows are not just efficient but also relevant. This responsiveness to data changes underscores Dagster's commitment to efficiency and resource optimization.
Dagster's prowess is not limited to orchestrating workflows and managing assets. Its extensive support for integrations with a wide array of tools and platforms makes it a versatile player in the modern data stack. Whether it's integrating with data warehouses like Snowflake, computation platforms like Dask, or visualization tools like Evidence, Dagster serves as the linchpin that unites various components of the data stack, facilitating a seamless flow of data across tools and teams.
Understanding how assets are defined and managed in Dagster offers insights into its structural and operational efficiency. Assets in Dagster are not just static entities but are defined with rich context, including metadata that describes their lineage, parameters, and dependencies. This structured approach to handling data assets within pipelines not only enhances transparency but also bolsters the reliability of the entire data ecosystem.
As developers and data engineers delve into Dagster, they discover a framework that is not just about executing data tasks but about creating a cohesive and efficient data operation environment. Dagster's emphasis on assets, coupled with its reactive triggering mechanisms and robust integration capabilities, positions it as a critical tool in the arsenal of modern data professionals seeking to navigate the complexities of data pipeline orchestration with ease and efficiency.
Introduction to MotherDuck: Rethinking the Cloud Data Warehouse
In a landscape dominated by scale-centric cloud data warehouses, the inception of MotherDuck signals a pivotal shift, emphasizing the developer experience as the cornerstone of modern data warehousing. Alex, a pivotal figure behind MotherDuck, articulates the necessity for this paradigm shift, driven by the limitations of traditional data warehousing architectures that often sideline the agility and productivity of developers. MotherDuck emerges as a beacon of innovation, seamlessly blending the robustness of cloud warehousing with the nimbleness required for agile development.
Git-like operations for databases stand out as one of MotherDuck's most groundbreaking features. This functionality ushers in a new era for database versioning and deployment, offering:
- Zero-copy cloning: Instantly clone databases for development or testing without the data movement overhead.
- Branching and merging: Manage database changes with the same flexibility as code changes, facilitating smoother transitions from development to production environments.
At the core of MotherDuck's philosophy is its seamless integration with DuckDB, ensuring a uniform experience from local development to cloud deployment. This integration eliminates the common friction points encountered when moving workloads to the cloud, fostering an environment where developers can focus on innovation rather than infrastructure nuances. The synergy between MotherDuck and DuckDB ensures:
- Consistent SQL dialects and functions, regardless of the environment.
- A streamlined path from prototype to production, without the need to rewrite or adjust code for cloud deployment.
The serverless architecture of MotherDuck introduces a dynamic approach to resource allocation, where compute resources are tailored to the workload's demands, significantly reducing operational costs. This elasticity allows organizations to scale their data operations without the burden of over-provisioning or managing complex scaling policies.
Furthermore, the dual-engine execution model represents a breakthrough in query processing, elegantly balancing the decision of where to process queries—locally or in the cloud—based on:
- Data locality: Optimize query performance by processing data closest to its source, reducing latency and transmission costs.
- Query complexity: Intelligent routing of queries to the most appropriate execution environment, ensuring optimal use of resources.
MotherDuck's innovative approach not only redefines the cloud data warehousing landscape but also aligns with the evolving needs of data engineers and developers. By prioritizing developer experience and operational efficiency, MotherDuck stands as a testament to the belief that the future of data warehousing lies in flexibility, scalability, and, most importantly, empowering those who harness data to drive insights and innovation.
Evidence: Revolutionizing Data Dashboards with Markdown and DuckDB
Evidence emerges as a transformative platform in the realm of data applications and dashboards, offering an unparalleled mix of simplicity and power. At the heart of its innovation is the integration of DuckDB WebAssembly, which propels Evidence into the forefront of responsive and interactive user experiences. This integration is not just about leveraging DuckDB's capabilities in a new environment; it's about redefining how developers and data analysts approach the creation and dissemination of data visualizations.
Colton Padden's firsthand experience with building data applications using Evidence showcases the platform's unique approach to dashboard creation: writing in markdown. This methodology isn't just about simplicity; it's about accessibility. By allowing SQL queries to be embedded directly into markdown files, Evidence lowers the barrier to dynamic data visualization, making it a powerful tool for those without extensive coding skills. The implications of this approach are significant, offering:
- Ease of Use: Users can create complex, interactive dashboards with minimal coding, focusing on the storytelling aspect of data rather than the technicalities of implementation.
- Flexibility: The markdown format is universally recognized and can be easily edited and versioned, facilitating collaboration and continuous improvement.
The seamless integration of Evidence with MotherDuck and DuckDB ensures a coherent and efficient workflow from data processing to visualization. This trio of technologies harmonizes to create a unified environment where data can be effortlessly processed, analyzed, and visualized without the need for transitioning between different platforms or languages. This integration demonstrates the platforms' commitment to developer and analyst productivity, underscoring the potential of these tools to streamline the data analysis process.
Practical examples of dashboards created during the presentation illuminate Evidence's effectiveness in delivering insightful data visualizations. These examples not only illustrate the platform's capability to render complex data sets into comprehensible and visually appealing formats but also highlight the ease with which these visualizations can be customized and enriched with interactive elements.
The democratization of data visualization represents one of Evidence's most compelling contributions to the field of data science. By making sophisticated data analysis and visualization accessible to a broader audience, Evidence empowers organizations to harness the full potential of their data. This accessibility is pivotal for:
- Enhancing Decision-Making: Empowering more team members with the ability to analyze and visualize data fosters a data-driven culture.
- Promoting Innovation: When barriers to data visualization are removed, it opens the door for innovative solutions to emerge from across an organization.
In essence, Evidence's innovative platform, built on the robust capabilities of DuckDB WebAssembly and seamlessly integrated with MotherDuck, signifies a leap forward in data visualization and application development. Its approach not only simplifies the creation process but also extends the power of data visualization to those without specialized expertise in data science. This paradigm shift has the potential to transform how organizations leverage data, making informed decisions more accessible than ever before.
The Future of Data Engineering with Dagster, MotherDuck, and Evidence
As we stand on the brink of a new era in data engineering, the integration of Dagster, MotherDuck, and Evidence presents an unprecedented opportunity to reshape how data pipelines are developed, managed, and scaled. This trio of tools ushers in a transformative approach to handling data, from local development environments to production, emphasizing efficiency, scalability, and an enhanced developer experience.
The implications for data engineering are profound:
- Efficiency: By streamlining the development process and minimizing the need for code changes when transitioning between environments, these tools significantly reduce the time and effort required to deploy data pipelines.
- Scalability: The seamless integration of MotherDuck and DuckDB, coupled with Dagster's orchestration capabilities, allows for dynamic scaling of data pipelines, ensuring that they can handle increasing loads without compromising performance.
- Developer Experience: The focus on developer productivity tools, such as the ease of writing unit tests for data pipelines and the application of software engineering principles to data engineering, enhances the overall experience and productivity of data professionals.
Looking ahead, the future developments and enhancements of these platforms hold great promise. With the continuous evolution of technology and the increasing complexity of data workflows, the adaptability and innovation demonstrated by Dagster, MotherDuck, and Evidence will be critical. As hinted by the speakers, we can anticipate further advancements in these platforms that will push the boundaries of what is possible in data engineering.
For data professionals seeking to tackle the challenges of data engineering, exploring these tools further is not just an option—it's a necessity. The resources provided offer a starting point for diving into the capabilities and potential applications of these innovative solutions. Whether it's Dagster's Python data orchestrator that enhances workflow and productivity, MotherDuck's rethinking of cloud data warehousing, or Evidence's revolutionizing of data dashboards with markdown and DuckDB, each tool offers unique advantages that can be leveraged to meet and exceed data engineering objectives.
A call to action for data professionals: Embrace the innovative solutions offered by Dagster, MotherDuck, and Evidence. By integrating these tools into your data pipelines, you can build more efficient, scalable, and developer-friendly workflows. The transformative potential of these technologies is not just in their individual capabilities but in their combined power to revolutionize data engineering. As you embark on this journey, remember that the future of data engineering is not just about managing data—it's about unlocking its potential to drive innovation and success.
Transcript
0:00well let's go ahead and get started um I do think this is an exciting presentation we have in store for you um this is a mother duck in dagster presentation on how you can go from a local development environment to production and have confidence in doing so we'll also be demonstrating how to use evidence for building dashboards my name is Colton padon I'm a
0:23data engineer and developer Advocate at dagster Labs uh I work there working on dog fooding our product um building Integrations um making learning material I've been in the data space for eight or so years now um and I used to use air flow like every day for like six years and then after I learned about dagster I
0:43liked it so much I ended up working there so I think that kind of goes to show um how much I like Daxter and the power of it that's awesome well well how I'm Alex I'm a forward deployed software engineer here at mother duck and I also do some blogging for te Tob Labs so um also you know became a fan
1:04and then got to work on what I'm a fan of so I spent nine years at Intel going from Industrial Engineer to data analyst to data scientist and along the way helped build a self-service analytical platform and duck DB became a key ingredient of that I became a huge fan uh tweeted about it a lot and actually
1:19got recruited to uh do some part-time work for dect Tob Labs based on Twitter so very Millennial of me being recruited by Twitter but uh really excited and it's been uh diving deeper into deeper into duck themed databases uh ever since so great to meet everybody awesome so let's start by talking about the problem statement so in this presentation we're going to
1:41share an example stack of using dagster mother duck and evidence but first we want to outline what we're trying to solve here so um it's kind of like a known problem in data engineering and building data pipelines that working locally can be pretty difficult um and then going into production with like what you developed and having competence
2:01is Pretty Tough so some of those complexities that you might have is mocking your data sources writing unit tests for your data pipelines um the scale of your data can be difficult to work with locally and having like a good representation of what might be in production is a little tricky then integrating with external systems can be
2:19tough especially if you're in an Enterprise that's really locked down uh like you wouldn't have access to production snowflake from your local machine so having a way to have uh copies of your data set is really beneficial so the proposed solution here is to use tools that really emphasize developer productivity and some of these tools include dagster mother duck and
2:42evidence and all of these have kind of been taking a Cutting Edge approach and taking learning taking the learnings from software engineering and applying those to data engineering so that you can really have an efficient way to work with things on your local machine so this is the proposed stack um as you may already know dagster is used
3:01for orchestration mother duck and duck DV will be used as our data warehouse we're going to be showing how to use DBT for Transformations and then finally evidence will be used for um building a report with really nice
3:16visualizations so first I'm going to talk a little bit about what is dagster I recognize that some of you may have not used it um but dagster is a python data orchestrator um that has an emphasis on developer workflow and productivity and it has a lot of support for different Integrations including mother duck and DBT which is it really allows
3:36you to like hit the ground running um it is it has some key differentiators from other orchestrators and one of the biggest ones is the Paradigm Shift of working with assets versus tasks so in airflow you might Define these black boxes or tasks that run on a periodic basis and if a task is a verb an asset
3:56is a noun so in Daxter you typically Define like an like a noun like if you're collecting data about ducks you would have a duck asset which can like correlate to a database table but this Paradigm Shift really changes in how you can visualize this and you can have like a full asset lineage so you can see how
4:14everything relates together and then you can also um use some really beneficial features in like not scheduling things on um an hourly basis but having autom materialization policies where things are more reactive and you have sensors and things just trigger when Upstream data changes so Dexter is kind of like a modern data orchestrator that has a lot
4:33of really cool features um that allow you to um really integrate with the modern data stack I'm going to the next slide I also wanted to talk about evidence so admittedly um for this demo this was my first time working with evidence but I was really Blown Away by how easy it was to work with um evidence is a platform
4:53for building data applications and dashboards and funny enough it's powered by Duck DB web assembly um so it's right there in line with the mother duck folks um you write your dashboards in markdown and you can embed SQL queries and leverage components for these visualizations in that markdown file and we'll be showing some Snippets Snippets of code on how that's done later in the
5:16slides and I'll give a very brief overview of mother duck here but then I'll hand it over to Alex um but you all are probably familiar with mother duck but it's a cist data warehouse powered by Duck DB um and it is has it really helps with like the local development flow in that it has a hybrid query
5:32engine and one of the key things in this demo is that it has feature parody with duck DB so you can use duck DB locally and then mother mother duck in the cloud and it's all the same code which is really really powerful okay now I'll hand it over to Alex fantastic well thanks so much yeah really excited you
5:49know dagster is pretty neat and that you can kind of specify your end result that you want as an asset a lot like SQL you kind of this is the shape of data that I want and that's how my brain works so I I really like it let's get into it so wanted to talk for a couple minutes about mother duck
6:06just because we're we're very new much much uh more recent uh than than dager uh and wanted to give you an idea of kind of who we are and uh what we're up to so first question that I want you to kind of be thinking about is really why are we building mother duck why why do we exist why are we so excited about it
6:23and it's really about rethinking the cloud data warehouse so back in 2012 when the current iteration of the cloud data warehouse was being designed things were a little different so what's changed since 2012 and what would we do differently if we start today uh secondly we'll talk about mother duck some of our secret sauce what we're excited about and especially
6:44look out for ways that it aligns really well with the dagster ethos and working with dagster so we'll we'll tie it together along the way and then based on that once you have those ingredients of mck the end of the day it's about solving problems so what experiences become ducking awesome and uh watch out for the puns uh they'll get you we're
7:03we're all about the duck puns uh you can post your your favorite duck pun in the uh in the comments we'll read them live on the air all right so why are we rethinking the cloud data warehouse well we have a couple things that we believe and based on those and based on the data we've seen we think that there's some serious
7:21shifts that have gone on and the first one is that we believe that big data was mostly marketing big data is dead uh and if big big data is mostly marketing what does that really mean it means that the way you design a system shouldn't be focused on the scale of your data it should be focused on your developer
7:38experience and if developer experience sounds a lot like Daxter it should it's important right your productivity you're the one using the tool and the more productive you can be the more value you can add for your stakeholders and for your company we also believe that since 2012 laptops have gotten a whole lot more powerful um how can we take advantage of
7:57that compute that you've already paid for that is zero milliseconds away from you no network latency likewise laptops are powerful will sour servers I can go on AWS I can rent a server with a terabyte of ram today and you could not do that in 2012 so how would we design a system knowing that a single node is so much more
8:18powerful now than it used to be so we've got a cool architecture for that and then duct DB is designed to be a friendly database uh we we believe everybody loves duct Tob and we want to build on top of that and and um add to that with mother so big data is just marketing well well how do we know this well uh our
8:38experienced team has has lived it so our CEO Jordan was one of the founding Engineers of Google bigquery and helped helped build it and scale it and at bigquery 95% of customers have less than one terabyte of data in bigquery that's not very big data right I can fit that on my laptop uh it's hard to call one terab big data these days
9:02and so really the popularity of big query maybe it's not about the size of data Maybe it's about something else maybe it's about developer experience likewise at single store 80% of users of customers of single store used the smallest instance size and they kept asking can we have a smaller one please we just don't need the instant size we
9:22already have Gartner a16z ederson Horwitz they agree as well Big Data was more marketing than reality so what would we do differently if big data is dead which is a very click bity way to say it but we we believe it we would make data easy duct DB does this very well the analogy we like to use as a hamburger
9:43traditionally the database research Community focused on making an amazing hamburger patty which is a database engine that returns your query very very quickly it does the calculation quickly um but they tended to ignore the full experience it was hard to get data in and out so Import and Export were very slow um and and what that feels like in
10:04a burger is you know a really great beef patty with a soggy bun you're not going to have a great experience it all needs to work together so how does duct TB do this duct TB is really easy to install in Python it's pip install duct TB it's pre-compiled for all the various operating systems and architectures and
10:21it has no dependencies at all so you can add it to any workflow seamlessly it's like 20 megabytes of a binary so really easy to slide blood into every Dexter workflow you've got and um it does work very nicely with things like pandas arrow and fers but they're not required uh we can work with them without depending uh on them it also runs
10:43everywhere runs on your laptop runs in the cloud runs anywhere you'd like to run dager and it even runs in your web browser as we'll see with evidence not only that but you can query data frames directly duct Tob is an inprocess database meaning it's running in the same memory space as your host application so all the variables that
11:02dager sees duct Tob sees so if you like data frames but occasionally you want to use some SQL or vice versa you can mix and match throughout your dager pipelines so it's really nice to have that seamless interoperability and when you're using SQL duct Tob is truly the friendliest SQL syntax in the world um it's uh they're pushing the dialect in ways that
11:24simplify it and yet they're also you can use kind of the postest dialect standard that's been there for a long time so if you like the standard use the standard if you want to simplify they've got some great enhancements to SQL so how do we build on that at mother ducking data warehouses personal we are offering g-like
11:44operations for your database so zero copy clone branching merging and I'll talk more about that on the next slide we're also serverless which means that you only pay for exactly the CPU and storage that you use not the whole time that our instance is running literally just the CPU cycles that you use and this is great if say you have a a big
12:06data job in dager and you have a very sporadic bi workload maybe powered by evidence uh throughout the day you don't want to keep a large instance running all day you don't want to size your instance as really large for your bi workload this is going to dynamically adjust to exactly what you need for the workload that you
12:25have we can also even pull from postm my SQL directly with DBT alone this is a really great on-ramp as you're getting started it's also great if you have a simple data environment um fundamentally at mother duck we believe that there's a large swath of people that um that aren't really well served by the existing cloud data warehouses um and a
12:46lot of those folks have a simpler environment and they just want to get up and running uh and have things be easy as things get more complicated bringing in dagster bringing in other um ETL or El tools make sense but we want to keep it easy to to start fourth we're very focused on local development powered by dctb we'll see
13:06that we're even pushing the envelope here with the what we call our dual engine uh setup but local development by itself just makes developers so much more prod productive you know I love being able to do things in my local environment uh not having to spin up a cloud instance and having full control over that so how about those git operations
13:27so what does that look like well today uh we have the ability to do a zeroc copy clone of any existing database so if you want to run a develop development pipeline First Step copy production it happens instantly and then you can run your entire pipeline in a total environment which really helps we've also laid the foundation though for
13:49having full branch merge uh semantics where I can actually Branch a database make some changes and then merge it back and the nice thing about that is I don't have to rerun my job when I deployed a production my deployment step is literally just replace production with Dev uh for for instant deployment to production um we're also going to
14:08support full time travel where you can go back to a specific point in time and recreate your database uh no more oh shoot where' my data go right just wind back the clock so this is really about the database experience this is not about that burger patty of how fast can I get it done it's how easy is my job when I
14:27use this database how much value can I add for my company so our other tenant is that if local comput is powerful well let's move our work locally so we're going to peel back the covers we're going to look under the hood a little bit of mother duck and we call this dual engine execution and we'll look at kind of
14:46three ways the first way is kind of a traditional if you're processing your data all on the cloud so whenever you connect to mother duck you use the duct DB driver all you need to do is change your connection string and we will Auto download the mother deck extension and you'll connect to the mother duck Cloud what that means is though whenever you
15:02connect you're using that duct DB driver and what's neat about duct DB is the entire engine is inside the driver your obbc your jdbc your python client they have the whole engine right in there which means we now have two two places where duct Tob is running wherever you connect it from and our server and we want to use both so the first step
15:22whenever you connect we'll go ahead and download our catalog so where are your tables where are your columns and when you run your SQL state we can figure out where does this need to run and if it needs to run in the cloud we'll send that query plan over to the cloud it'll run and bring you your data
15:38back however by having that catalog local by having that whole engine local some queries can run entirely locally and so you can see this in evidence for example when you're you're filtering and slicing and dicing your visuals that can be entirely local we don't have to wait for that Network latency to go all the way back and forth to the server so you
15:56can get instant um results it's all also works really well in a pipeline so let's say your Daxter pipeline is to pull in some new data that you want to then upload to your central warehouse well you're probably going to want to check on that data first you're probably going to want to test it and if your tests have to be run
16:14in your cloud data warehouse you're going to really watch the costs there right you be very careful about how many tests you do on your data if your tests are entirely local and they run instantly maybe you'll test the heck out of that data before you put it in the cloud that's another case where that local horsepower is really helpful
16:29it also works great for caching let's say I have a result set and I want to play with it I want to do a lot of data science work I want to have a dagster pipeline to take my raw data and turn it into intelligence right you can do that postprocessing locally rather than using Cloud resources and even nicer than that you
16:46can actually run one SQL query partially in both places so let's say you care about a data frame in your local environment and you want to combine it with a lookup table in the cloud what we'll do is we'll check which of those tables is bigger and we'll push it to the right node and do the join so you can combine local and
17:04remote data even in the same query uh seamlessly so this is really exciting we want to lean into this and make it be really intelligent a lot of cool potential here all right servers are also enormous
17:18so what should we do differently well we should scale up because scaling up uh works much better now than in the past so let's take a brief look back to the back to 2010 one of the most influential papers about Data Systems design was the Dremel paper from Google and it used 3,000 nodes in concert to answer one uh one analysis fast forward
17:42to 2024 you can rent a single node on AWS of a pretty beefy size let's compare it let's see how how that compares to 32010 nodes Well turns out that one node is has much more CPU power much more RAM power more SSD cache space more RAM bandwidth more SSD bandwidth the only spot where one node isn't better than
18:043,000 is a network bandwidth and that's about to change with 800 gabit per second ethernet so one node can be equivalent to 3,000 except for that last bar there the cost where it's much lower so scaling up suddenly becomes incredibly viable and this Dremel paper helped design a database uh sorry a decade worth of database system so maybe
18:27it's time to turn the page another area where if the cloud can scale up how would we behave differently in this case the status quo is that you kind of have to manage how you bin pack your workloads so you'll probably start out putting everybody on one instance and then you'll turn the knob to make that instance as big as you can and at
18:47some point one customer is going to get much bigger and they'll be um you know too big to fit on that instance and then you're going to have to separate them out into a separate instance and that might happen maybe maybe through top five customers maybe you're top 10 suddenly you're managing a fleet of instances um how can you rebalance how
19:04do you resize it gets complicated with mother duck we're going to do that on your behalf by being a serverless um platform we will create uh exactly the runtime that you need and we will do the bin packing on our side and we'll even do bin packing with other customers so that'll give us uh the scale that we
19:22need the efficienc efficiency that we need to really provide you a lot of value um plus we'll give you statistics by user how expensive is that one user that you know does the full refresh how about that one pipeline maybe maybe you you take a look there and you use you know some incremental features in in your
19:39pipeline so um having that individual user attribution is very helpful all right what is it that people love about duck DB and and how can we enhance that first of all this is a bit of a vanity metric but we want to kind of make the case that we think dctb is is well- loved right so this is our our
19:58G star history of DB uh and we love the shape of that trajectory we're above 16,000 stars now and this is how duct DB pitches uh the dctb open source Library uh from its website um it's simple to install and deploy it's portable and runs everywhere it's feature it's got an amazing SQL dialect you can even read and write par
20:20and Json even back and forth to S3 and then all the way down here at bullet point number four is that it's fast that's aligned with how we think right developer experience is more than just performance um but don't be mistaken uh duct TB is in first place in several different industry standard benchmarks so it has the speed but it
20:40also has the developer experience it is extensible as well that's how we work at mother duck we're an extension and that allows us to keep uh writing the great improvements that the duct DB team uh over at duct DB labs and in the community are making and uh stay uh on board as uh tuck Tob rides that wave and of course dctb is open
21:00source you're welcome to use it U wherever you'd like so any dagster pipeline you can you can have dctb right in there it's MIT licensed how do we build on that with mother duck um one of the fundamental limitations of duck DB is that it's Prim primarily a single player database you can't access it from multiple processes um so in dagster it's very
21:21common you want to spin up multiple processes while you're orchestrating with uh open source duct TB you can't have those all communicating together at the same time however with mother duck you absolutely can and um that really allows it to to scale um in complexity you can also share it with your team that way we have some sharing features that we've added
21:42in mother duck allows you to really easily share your data and because we're in the cloud you can scale with the cloud you're no longer limited to your to your uh wherever you are currently running dctb and then duct DB runs everywhere which means you can run it throughout your stack we like to jokingly call it the modern duck stack
21:59duck puns will never end don't worry uh we've got a lot more where that came from so your elt for example can can use duck DB in your Daxter pipelines you could even use those postm SQL scanners if you'd like um and there's other industry tools that do use duct Tob under the hood for elt as well we've got
22:16a DBT adapter that can be duct DB and even your bi can be dctb like with evidence so what are the benefits there right let's say you've got a dashboard and someone asks you to change something and you change it and then and five other people look at it and say wow I like that I want that too well you could
22:33take that same SQL code put it in your DBT model copy paste and then hey wow
22:39that that got to be a really big query maybe we should process it when we ingest it rather than ingest all the raw data and then process it well copy paste back to your elt so same dialect throughout your entire stack uh again all about the developer experience okay what becomes ducking awesome well you get to explore your data at the
23:00speed of thought if you have your data in that local engine it's a zero turnaround time to to keep digging into your data um it means that you can quickly iterate through um your pipelines and then your as a data team you can develop locally and push to the cloud especially combined with um daxter's Branch deployments it's really powerful one two
23:21combination to make that really seamless and then at your data apps if you have a data app use case you get incredible speed and interactivity this is getting back to that in browser idea so when we're in browser you have a database in the cloud and a database in your browser so what does that mean it means that once your data is in the
23:41browser you can have 60 frames per second interactive experience but you don't have to send all your data to the browser you can do some processing on the server before you send it so it's a really great combination there as well uh it's going to be faster cheaper better than what's available today so let's take a quick look at that
23:59let me show you a demo I will share my
24:06screen all right so this is a mother duck demo this is looking at 10 million rows of data uh of flight data so the top chart is looking at how delayed your flight is uh and then the the middle chart is what time it takes off in the 24-hour day and then how far that flight was uh and the interactivity here is
24:25quite quick so if I want to look at the the flights that were delayed the most those took off later if I want to instantly move and look at flights that were early those are are not at the end of the day so but you can seamlessly move and you can explore your data faster than your ey can discern right
24:42that's not possible if you have a server round trip it's not possible if one query takes 5 Seconds um so these types of experiences that you can provide for your customers for your stakeholders um are made possible by having a a database on the server and a database in in the browser let me jump back to the
25:06slides all right well thank you so much for for diving into mother duck with me feel free to ask your questions in the chat I will get to those in the Q&A and I'd like to turn it back over to to Colton to to do a project walkthrough awesome thanks so much Alex those features really are pretty incredible um
25:22as I mentioned earlier in the presentation we're providing a full stack example of how you might use these tools and I wanted to briefly go through some code Snippets of that project um you can clone a repo with The Code by going to this bitly URL or scanning the QR code you're welcome to open and issue a poll request on that repo or if you
25:44have any questions um feel free to message me on the dagster slack or on GitHub directly um but we hope you find it interesting so here's an overview of the pipeline that this project builds what we're looking at here is the global asset lineage that you would see in dagster and this is just an overview of what like the development pipeline would
26:06look like so we're looking at um Cornell's project feeder watch data set we're uh pulling in that raw CSV data we're preparing it with duck DB and doing some Transformations with DBT and you can see we have assets for those DBT models as well and then finally we're building a report with evidence as the last step and through with that as you
26:29define these assets and integrate with DBT you can get a full lineage of how everything pieces together when we go to production funny enough the code is identical um thanks to the feature parody of duck DB and mother duck it's the same exact code you just have to change an environment variable and we'll show you how that's
26:46done as well but now you can see that now that we load the CSV file it's being loaded into mother duck and those Transformations are happening with mother duck as well and you can get that full performance of the cloud uh this data set is I think around 12 gigabytes so it could work just fine locally um but you can imagine how as if
27:06something were to scale up or you start to get hundreds of gigabytes you wouldn't want to run that entirely locally and you would bet better be better off leveraging something like mother duck and their query engine this is what we'll be producing um this is a a snipet of some of the dashboards that we've created with evidence and you can see just how slick
27:26these uh visualizations are so we we're looking at a calendar heat map of duck observations and then a state heat map of duck observations um you can see there's a bunch of white space there in the middle and that's because this Cornell's project feeder watch is seasonal where it goes from Winter to Spring um and then you can see we get a lot of a lot
27:48of observations of ducks down in Florida I don't know if that's because there are more Ducks down there or just people who are interested in observing Ducks but that's kind of funny um um and this is the project structure um it's the text might be a little bit small but you can see it's comprised primarily of three
28:04things on line seven we have our DBT project and that's where all of our Transformations are defined on line 16 we have a folder and that's our dagster project so we have some files in there like assets. piy resources dopy and I'll talk a little bit more about what those are in the following slides and then finally on line 23 you can see we have
28:24our evidence project where our reports are defined we also include ined a make file in this project with some directives for easily installing dependencies and building uh your project and depend uh dashboards cool so now let's go over the code a little bit uh this is going to be a high level overview I don't expect people to understand everything but I
28:45think it's good to get this overview before you look at the code in the repo itself so you know what to look for awesome so first things first uh I'm going to talk about a dagster concept here called resources and resources really are the fundamental building blocks of building a pipeline with dagster and it happens to be how we
29:04connect to mother duck and duck DB so we Define this duck DB resource um and we pass it in an environment variable and this is a wrapper around the duck DB client and because we're defining it as a resource we can then use it throughout all of our assets in our codebase and this is a really nice way to build
29:22reusable code that you can share um in the in the snippet below you can see our environment file where you we specify the duck DB database and this really is where the magic happens in this project demo so this duck DB database can either be a local duck DB file or it could be a mother duck connection string and that
29:42really is the only thing you have to change and then your dags your project will be either running on a local duck DB file or mother duck same with your um dbd project and evidence now I kind of want to talk about how we build assets and the key takeaway here I don't expect you to fully understand all the code but it's
30:02just python code um we're defining a function that takes in a couple parameters and we throw an asset decorator on top of it and that's how you uh register an asset within dagster and as you can see we're passing in a parameter to this function called duct DB and that is that duct DB resource we defined earlier um you register your
30:23resources and your assets as a definition and then you will have these resources globally available um they're they're injected into all the assets but other than that when you start to look at the uh code within the body of the function it's just like any other python code you would uh expect where we get a duck DB connection we're loading that
30:42CSV file using the read CSV Auto uh function and then we're querying duct DB again to get some metrics that are then returned from the function um one thing that I want you to take away from this is that the pythonic nature of dagster Pipelines really lends them to that local development flow and writing unit tests you could imagine that you could
31:05mock this duck DB resource and then write very thorough unit tests for all of your assets in your pipeline um and then finally you see that we're returning a lot of metadata related to this asset as well and that eventually gets propagated into the dagster UI which I'll show you on the next slide yeah so so we're looking at the
31:24global asset lineage again and I've selected that species asset and the text is pretty small but on the right you can see we get to see the number of rows um and along with other metadata associated with this asset and also we've uh built some visualizations of that that metadata as well so you can kind of see
31:41how things change over time and that's all out of the box with these assets um dagster plus actually just launched yesterday and we we've offered even more features around this metadata including a full data cataloging solution within dagster so that you can easily go and uh find what where your data is through this metadata it's really really cool
32:02and Powerful cool so that was our assets that we Define just using python code but I also wanted to briefly talk about how we're integrating with our DBT project uh we have a really great DBT integration that's able to extract all of the models from your project and visualize those alongside your other assets um within your Global asset
32:23lineage so you can see um all we're defining here above is using this DBT assets decorator that was a part of our integration we pass our manifest file for our DBT project and that's it um you see on the right our models are just as you would Define any other model using DBT but then on the left you can see how
32:42it all pieces together and you can Define upstream and downstream dependencies on these DBT models which is really powerful uh for these DBT models we have like an all birds model that is then referenced by these um more derived top ducts by region top ducts by state and you can see that lineage is actually represented in the Daxter
33:06UI uh I just wanted to show you a quick snippet of the sequel for one of these DBT models just showing you that it really is just a normal DBT model we're not um requiring any front matter or some kind of metadata definition to determine this lineage of the the DBT model it's just a normal reference to
33:23like the Upstream all birds model cool so that was a very fast overview of dagster and how you would define an asset and use resources um I encourage you to check out the quick for dagster and the dagster documentation for more details there there you're also welcome to message me with any questions um but now I wanted to talk a little bit
33:42about evidence um so evidence powered by Duck DB is this bi visualization tool and what it does is it has connectors for all sorts of databases and warehouses snowflake mother duck. DB Google Sheets I feel like if you name it they have connectors um and what it does is it caches that the data that's relevant to your visualizations as parquet files and
34:07then uses the duck DB web assembly library to then visualize that here's an example of how you would Define a visualization using evidence and it's just markdown this was one of the coolest things I think of this uh demonstration project is how easy it was to get these very slick visualizations using evidence so you can see we have on
34:30line 16 through 28 this code block where we're defining SQL and then on line eight we're using this calendar heat map component and we're able to reference the SQL that's defined below and then you can use the columns from that query in the component itself and then on line four you can even see you can embed like
34:50values or smaller components directly in your markdown as well and this is all it takes to produce a really elegant um dashboard with markdown I also wanted to demonstrate that we are able to build this dashboard from dager itself um dager isn't a cicd tool so I don't know if I entirely condone this practice but I thought it
35:13was really cool just to show the flexibility of Daxter assets being python code so you can really do anything um but here we we're calling the npm command that evidence requires for installing dependencies um pulling those sources locally into Ark files for your dashboards and then building the static HTML in JavaScript that you can then deploy as a website so we were able
35:36to U Define this evidence dashboard asset at the very end of our lineage to then produce this dashboard which I thought was really slick and you can also see in this Asset definition um we specify the dependencies being this DBT Birds um DBT
35:54asset so that's kind of how we get that lineage going um one thing that I looked at while exploring this feeder watch data set was the least observed ducks and these are some pretty cool ducks uh the mcoi duck on the far right is absolutely wildl looking I've never seen anything like that before my life so I thought you
36:13guys would like to see it too ducks can be beautiful not just goofy they're also beautiful yeah sorry these are all beautiful Ducks um cool so I wanted to go back to the pipeline just as a review um we didn't cover all of the assets here but we did kind of dive in a little bit in how you would Define them um but
36:35here you can see the overview of the pipeline we're downloading CSV files from Project feederwatch from Cornell University we're uh using some python assets to take that data and load it into either a duck DB or mother duck database we're using DBT to transform it and then we're building an Evidence dashboard and that's it I've included some additional links and
36:59resources for you all um there's a duck DB integration for dagster there's an Evidence integration from mother duck you can see they kind of all fit together very nicely and they all have Integrations for one another I also included a link here to the evidence Core Concepts page as well as project feeder watch if you want to start
37:17observing some ducks yourself then also a link to the mother duck docs themselves awesome thank you so much and now I'm going to open it up to some questions
37:31thanks for the existing crate questions in the uh in the chat as well but uh yeah happy to answer any other ones you
37:41have interesting we see this question oh you already answered Tim's question about the timing out okay I don't know if you want to re reiterate anything Alex sure yeah I'll I'll wind it back a little bit so um couple great questions thank you um
38:04so we had some questions around the the number of ducklings in mother duck and and how do you specify the number of ducklings um so the way to think about it is really each user is going to get their own duckling their own environment and you can size that however you'd like so today we're in beta we're excited to
38:21be going to uh General availability before long um but as as we're in beta uh that's something where you send me an email and I I'll bump up your instant size but starting in just a couple weeks you'll be able to turn a knob to be able to to bump up the size of each individual user and so you can you can
38:36control that uh horsepower that they're allowed to have and then each user that you have can have a separate uh set of horsepower uh so that works really well if you want to have a user for your bulk data load job and a separate user for your bi workload and each user of your bi tool can have their own user so
38:52really a different model that we think um scals really well you know as you add users they just get their own duckling and it just scales as your us should
39:04scales let's see I do see this one question from Ali about an example of using a partitioned asset from dagster this demo project doesn't do any partitioning but dagster does have a very featureful partitioning Concept in where you can have time based partitions or even categorical partitions we have some examples of that in our documentation um that I believe you
39:29could find pretty um easily through our search but this project doesn't do
39:35that awesome well um Gabriel has a great
39:40question um could be Gabrielle but thank you Gabriel Gabrielle um how would you go about connecting a mongod DB hosting in Atlas to mother duck through dagster um so in terms of the getting into mother duck part really once it's in a format that duck DB can read then you you can upload it directly to mother duck through duck DB just through an
40:01insert statement so you basically if you can read it into duct DB you'll insert it that way um the step I know less about is how to get it from into that step um but if you can convert it into a data frame really of any variety like arrow for example aache Arrow or or polar or pandas that is a a good way to
40:19to do it what would you add yeah that that's a good point um I'm not sure if it's supported with our sling integration but we do have an integration with sling that's meant for database replications I'm not sure if mongodb is supported I think it might be but I would advise you to check that out and then you would just have to Define
40:37some yaml of the source in syns and then it would be able to handle that um like incremental loads and full loads for you um yeah I suspect there may be some tools out there existing tools that you could use that have Integrations with this ecosystem and sling has so right on colon toss the link in there to
40:58the connectors that they have there's both a mango one a duct B one and a mother duck one so oh that's awesome yeah you should check that out and I I've tested the sling connector that was me I know it works I've I've I've done it so uh give it a shot let us
41:14know um Thomas asked so one decorator dynamically composes the lineage from the dag what happens if you want to customize things like weather tests are blocking Downstream models um I I presume this is spefic specifically for our DBT integration and um we we have
41:32this concept of a DBT translator it's this class that you implement where you could have conditional things depending on a specific model um and you could use that to customize things on um like the tests that run also you could use your DBT selectors to kind of like break things down into different subsections so you could have like a tag of your DBT
41:54models that runs in one group of assets and and those are all blocked by some Downstream test but you should be able to leverage the um like upstream dependencies and downstream dependencies of your assets in your DBT model selections so that um they block before anything happens um further Downstream um you're totally welcome to hop on the
42:17uh DBT integration slack Channel and we can uh assist you more if you if you're curious on that cool well we could just uh trade back and forth um so Ed Ruiz had a great question about um how mother duck decides to use local versus Cloud compute and uh does it depend on specific queries and does it require
42:37data to be in the cloud so that's a really great question um today uh we have all the infrastructure set up uh to to make some very interesting decisions today it's uh we've got a couple heris in place so the future is very exciting what we have today is looking at mostly data locality so if your data is already
42:56local um we do our best to keep it local if your data is already in the cloud we'll do our best to keep it in the cloud unless you deliberately bring it down to cach it locally so today it's mostly uh that kind of heris and then giving you the capability to decide if you want to push to the cloud or pull
43:13local so so you have some leverage to to do that with um what we're working on we're doing some research is having this be in the full cost model of the database so that is where it is where basically the same way a database decides what type of join to do or the order of operations within your SQL
43:28query that same process is going to decide where should it run so it really is fully automatic fully you know uh behind the scenes where you you won't have to worry about it and we're looking at things like how fast is your network speed how many cores do you have on your machine how busy is your machine already
43:46how much RAM do you have and really look at that to balance the workload so a lot of exciting possibilities moving forward um for now we we put a lot of that power in your hands awesome I see one more dagster question Ali asked shouldn't the best practice for the duck DB resource be used in an IO manager um and it's actually
44:06interesting you ask that we're making an effort to move away from IO managers actually we found that the Paradigm of IO managers didn't fit as well as we'd like into the overall Narrative of using of how we want to Define assets so the best practice right now is actually to use the resource from your asset that being said IO managers Al are also
44:27totally a way to do this um but right now the way our documentation is leaning you would we would prefer that users would use it from a resource and then in their asset um but that that is a good point and it's it's good that you brought that up all right question from Thomas O'Neal
44:47so what things are duck DB and mother duck still bad at easy questions thank you very good one and what should I not be planning to use mod deck for in a year so kind of what's what are the long-term challenges with our architecture that's a very good question so right off the bat we are a a SQL
45:02oriented relational database for analytics so if you have a uh very transactional workload we're not a perfect fit um likewise we have great Json support and nested structure support um so we think we handle most of the the nosql use cases um but it depends on how far down the nosql path you are um so that's another
45:24consideration um we're also you know we're taking duck DB which is fundamentally a single player experience and we are stretching it into multiplayer so we're very good if you have you know um separate users accessing your bi dashboard works amazingly um if you want to do you know sub 5c ingestion from 10 different streams at the same time that's a
45:45stretch for our architecture today um we can do some simultaneous reads in that type of format um we tend to like it as micro batches so every couple of seconds insert a batch uh but true you know sub 1 second latencies you know streaming um that's not what we're targeting either so um if you have a lot
46:03of data streaming in we recommend microb batching and then transactional um we like postgress we're big fans of postgress around here thank you for the easy
46:18question and Archie thank you for chiming in with an answer you're exactly right it is not an operational database uh you're right there I would say also that's also a great Point Archie we target specifically somewhere in the the 1 to 10 terabyte range of data scale in terms of the data you want to query within one
46:35query um it's okay if you have a whole bunch of data stored we're storing it in the cloud we could store as much data as you'd like but if you ever want to run a query on you know 100 terabytes of data at the same time that's not what we're built for um so if you really have
46:50petabyte scale if you're really you know web scale uh as the as old joke goes um that that will be a stretch for us but we do think that you know classic disruptor models at play here where where we are able to to scale up um more and more over time and and we'll start to eat into larger and larger workloads
47:07uh but our belief fundamentally you know based on the data we saw at Big query at single store there aren't that many petabyte workloads and even if you have a petabyte workload you don't want to run your dashboard off of a petabyte of data you're going to pre-process it in chunks using something like dagster partitioning right that's the way to do
47:25it and that would work pretty well with something like bother um there's no free lunch right there's no way to have a fast and cheap way to process a pabit of data right you have to do it incrementally and uh so if your problem doesn't split up very well um and you do in fact have a petabyte um that'd be
47:49hard all right Peter had a question about duct DB to powerbi well we are recently we're we're we're just now launching our powerbi connector uh it's compatible with duct DB 0.102 mother duck right now is in the process of updating to duck DB 0.10 um so the powerbi connector works great with open source duct TB today and very
48:08soon it'll work great with mother duck as well so we're excited about uh powerbi capability that is a power query connector so anything power query will work so that's powerbi that's Excel power pivot uh both should work um please let us know how that goes we're actively uh contributing to that uh connector we we put that one together um
48:28so uh we would love your feedback all right Tim had a question on microb batching that's a great question I'm not sure we have a perfect set of documentation there on microb batching I think in general there's two sizes to think about what dect be a vector size is about 2,000 rows and a row group size is about 100,000 rows so a vector is the
48:51size that we do throughout the engine you know the every calculation you know if you take column one times column two we do that 2,000 things at a time in the engine and then the row group size 100,000 rows is how we compress things and store things so if you kind of do calculations and processing somewhere around 2,000 rows is going to be
49:10efficient 2,000 or more and then if you're storing that data uh it helps to store it in batches of 100,000 so those are some rules of
49:22thumb all right other great questions uh Paul simmerling asks do you plan to make mother duck available using storage and compute in the EU yes we absolutely do we recognize that's important and you know data sovereignty is important today we are in us east1 on AWS so that's the best place to put your S3 buckets for loading into mother duck uh we do have
49:42it on our road map to scale uh across the world to various regions um we're excited about it thank you for the very good questions uh other questions
50:06all right last chance yeah we need the Jeopardy theme song always need it we'll get it with quacks it won't be annoying at
50:23all cool well that might be it um well yeah thank you oh one more question GPU question I did miss the GPU question I promise I'm not avoiding it will duct Tob support GPU processing so that's a good question um it's not on our near-term road map or really our long-term road map the reason is that the types of workloads that databases
50:47have really stretch what gpus are capable of gpus want repetitive calculations of the same type that are predictable that don't have a lot of branding there's a lot of things in databases that are very heterogen heterogeneous heterogeneous they're very variable and that tends to not work very well with the GPU there's also bandwidth limitations of getting things over to the GPU and the memory
51:12size that you have available in a GPU is much less than what you have on your two terabytes of ram server type of thing um so we keep our eye on that space you know we're a bunch of I'd say a a huge fraction of the company are big database nerds that's me I I love being a
51:27database nerd I not writing the database code but but I'm a I'm a fan so we watch that space very carefully and if there is some kind of switch that flips where there's a huge differentiator we'll we'll look at it but today um there isn't really a big benefit of gpus for databases that we can see and I've asked that exact question
51:48before of the founders of dctb and they gave me a similar answer so it's a very valid question thank you for asking
51:58and the reminder really appreciate that yeah if anyone else has any questions that we missed uh ping us again
52:07please any more buzzer beaters and if you have any questions that you think of later you're totally welcome to hop on either of our respective slacks or GitHub discussions um and we'll be very responsive there yes I think mother. Community slack is is our best bet you can also send us an email at support ATM mother duck.com and uh happy to
52:32chat thank you for the great questions folks it's really great to have such great questions and um great to meet you all virtually stay in touch yeah thanks everyone hope you learned something new and interesting welcome to the flock
FAQS
How do you use Dagster with MotherDuck for local development and cloud production?
Dagster integrates with MotherDuck through a DuckDB resource that accepts a connection string via environment variable. For local development, point it to a local .db file; for production, change the environment variable to a MotherDuck connection string (md:). The code is identical in both environments because DuckDB and MotherDuck share the same SQL engine and feature parity. Dagster's asset-based paradigm lets you define your data pipeline as nouns (tables, models) rather than tasks, giving you full lineage visualization.
Why is MotherDuck better than traditional cloud data warehouses for development workflows?
MotherDuck's founders observed that 95% of BigQuery customers have less than 1 terabyte of data, and most Snowflake users run the smallest instance size. MotherDuck is built for this reality with a developer-experience-first approach: DuckDB is a 20MB binary that installs with pip install duckdb, requires no dependencies, and runs everywhere. Its serverless pricing means you only pay for CPU cycles used, and its dual execution engine lets you test data locally for free before pushing to the cloud.
How does MotherDuck's dual execution engine work in data pipelines?
When you connect to MotherDuck, the full DuckDB engine runs both on your local client and in the cloud. MotherDuck downloads the catalog locally and determines where each query should execute based on data locality. Queries on local data run instantly without network latency, cloud queries use server compute, and you can even join local and remote data in a single SQL statement. This is especially useful for data testing: running tests locally is free and instant, which encourages more thorough validation before pushing to production.
How do you build dashboards with Evidence and MotherDuck in a Dagster pipeline?
Evidence is a BI tool powered by DuckDB WebAssembly that lets you build dashboards in markdown with embedded SQL and visualization components. In a Dagster pipeline, you can define an Evidence dashboard as a downstream asset that depends on your dbt models. Dagster executes the Evidence build steps (npm install, source pulling, and static HTML generation) as part of the asset materialization. The result is a deployable static website with interactive visualizations like calendar heatmaps and geographic choropleth maps.
What is MotherDuck's approach to scaling and multi-tenancy?
MotherDuck uses a scale-up architecture rather than scale-out, based on the observation that a single modern cloud node can match or exceed the power of 3,000 nodes from 2010 (when systems like Google Dremel were designed). Each user gets their own dedicated DuckDB instance (called a "duckling") that can be independently sized. This eliminates resource contention between users and gives per-user cost attribution. MotherDuck handles bin-packing and resource allocation serverlessly, so you do not need to manage instance fleets.
Related Videos

2025-12-10
Watch Me Deploy a DuckLake to Production with MotherDuck!
In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!
YouTube
Data Pipelines
Tutorial
MotherDuck Features
SQL
Ecosystem




