Why and How we integrated DuckDB & MotherDuck with GoodData

2024/06/07

TL;DR: GoodData integrated DuckDB and MotherDuck into their analytics platform to deliver sub-second query performance, eliminate spinning wheels in BI tools, and enable data federation across multiple sources.

The Problem: Spinning Wheels in BI Tools

Jan Soubusta from GoodData opens with a bold claim: stop accepting spinning wheels in your BI tools. The traditional architecture has fundamental problems:

Performance: Every user action triggers a SQL query to your data warehouse
Cost: Hundreds of users querying Snowflake 24/7 means it never hibernates—expensive
Developer experience: Most BI tools lack APIs, SDKs, and proper CI/CD support

GoodData's Evolution

GoodData 1.0: Legacy end-to-end solution with proprietary data pipelines (customers had to move data to GoodData)
GoodData 2.0: Runs on top of your data warehouse (Snowflake, MotherDuck) with modern developer experience

But even with GoodData 2.0 on Snowflake, latency and concurrency remained challenging.

The Analytics Lake Architecture

GoodData built two components:

1. Semantic Layer

Describes data with business context (not just table/column names)
Enables better SQL generation from user requests
Provides consistent definitions across all consumers (JavaScript, Python, APIs)

2. Flex Query Engine

A high-performance execution layer built on:

Apache Arrow: Efficient in-memory data format
Apache Arrow Flight: Fast data exchange over the network
DuckDB: Query execution engine
Custom caching: Results cached as Arrow tables, reused across queries

How It Works

User requests data through semantic layer
Flex Query fetches data from sources (MotherDuck, Snowflake, S3)
Data cached as Arrow tables
DuckDB processes queries on Arrow tables (zero-copy)
Results returned in milliseconds

Demo: Data Federation

Jan demonstrates querying across multiple sources simultaneously:

Orders table: Stored in Snowflake (EU region) or local PostgreSQL
Line items table: Stored in AWS S3
Flex Query: Pulls from both, caches in Arrow, executes with DuckDB

Results:

First query (cold cache): ~6 seconds for gigabytes across two sources
Subsequent queries (warm cache): Sub-second
Exact cache hit: 10-20 milliseconds

Developer Experience

GoodData provides multiple integration options:

React SDK

Copy code
import { Dashboard, Visualization } from "@gooddata/sdk-ui";

// Embed full dashboard
<Dashboard dashboard="my-dashboard" />

// Or build custom visualizations
<MyCustomChart data={executionResult} />

Python SDK (Jupyter Notebooks)

Copy code
from gooddata_sdk import GoodDataSdk

sdk = GoodDataSdk.create(host, token)
df = sdk.compute.for_insight("my_visualization").as_dataframe()

VR Data Exploration

GoodData even built a VR demo where you can walk through 3D data visualizations using a headset—the same semantic layer and Flex Query powers it.

End-to-End Pipeline as Code

Jan shares an open-source data pipeline blueprint demonstrating:

Meltano: Extract/load from GitHub, Jira, S3
dbt: Transform data (works with MotherDuck plugin)
GoodData SDK: Deploy semantic layer and dashboards
GitHub Actions: CI/CD across dev/staging/prod environments

One pull request can update the entire stack: API changes, column renames, semantic layer, and dashboards.

What's Next

AI with Semantic Layer: Using RAG to provide LLMs with relevant context for text-to-SQL (more accurate than raw schema)
Multi-tenancy: End-to-end multi-tenant data pipelines leveraging MotherDuck's upcoming features

TABLE OF CONTENTS

The Problem: Spinning Wheels in BI Tools

GoodData's Evolution

The Analytics Lake Architecture

How It Works

Demo: Data Federation

Developer Experience

End-to-End Pipeline as Code

What's Next

Transcript

0:04and actually Maddie forgot to mention one very important thing and that's uh uh mother duck or duck DB Community yeah

0:15I joined the community their slack like uh more than half a year ago and the support there is basically incredible like uh some of my questions were responded in tens of minutes by CEO of mother duck yeah that's really

0:36unusual yeah yeah yeah you you underutilized him that's that's yeah I can imagine so so I really I really recommend to join the community they they will help you just like they help me uh to build uh uh a concrete uh integration of not

0:59only mother duck but also duck DB with good data yeah and that's uh what my

1:07presentation will be about but before we

1:12start uh recently uh open AI released uh gp40

1:19model so I was thinking about to create a picture funny picture combining good data mother duck and duck DB and actually didn't work as you can see there is no mother duck because gp40 training data sets uh do not contain mother duck yet yeah so that's that's that's a Pity uh so my entry point here in this

1:48presentation is uh the this claim yeah

1:52so you should wake up from s Stockholm

1:57syndrome uh you should no longer accept uh spinning wheels in your bi

2:05tools uh it's very common in Tableau in powerbi in many platforms actually it's even common in Old good data let's be honest and uh we should stop accepted

2:22and we should uh upgrade our Stacks so

2:26this problem no longer exists and it's not not only about performance but it's also about money as always yeah so once

2:37you realize that you pay like thousands of dollars for your snowflake just because uh bi tool is used by hundreds of people and they are using quite simple things there but still every action in bi tool means SQL query in Snowflake and they are using it 247 so snowflake cannot hi hibernate and uh at the end of the month you realize that

3:02you don't have enough money to pay your snowflake bills yeah so this is how we fail by the way uh please raise your hands who is satisfied uh with the performance of uh uh his or her bi

3:24tool okay uh just good data hands raised

3:31that's cheating yeah so so this is exactly what I already described yeah we all failed and not only in the area of performance and cost but also in the area of developer experience yeah imagine that you maintain the whole data stack as code through apis or sdks uh in the world of Microsoft yeah or in the world of Pablo yeah no way

3:59yeah so what we expect here are apis and sdks

4:05Es code approach and some kind of cicd be able to test everything version everything as Med already mentioned and so on so here is just short story of good data Let's uh make it brief so there are

4:24basically two G two good datas yeah the first one we still have many customers uh running good data 1.0 this is endtoend solution including data pipeline very Legacy uh we developed everything in house and customers have to move their data to our premise good data 2.0 uh we started like four or five years ago building new good data on

4:54Green Field uh the main difference you can run it on top of your data warehouse like mother duck or snowflake you do not have to move your data and uh uh the developer experience

5:09is much better and the stack is basically more modern and more

5:16open but is it enough is it enough to run good data on top of let's say snowflake or mother duck uh the problem is that the traditional uh data beare Technologies do not provide low enough latency high enough concurrency for reasonable cost so even with new good data running on top of snowflake let's say it's still not good enough yeah maybe much better

5:46with mother duck pricing but still if you need very low latency and you have to connect uh mad duck running even in different AWS region

6:00I think it can be better and that's why we developed something what we call analytics like that's just a buzz word yeah so please uh don't take

6:15it that serious um but what is what is the

6:22concept what's what what is it about so the analytics slay consists of two parts yeah the first part is responsible for all the physical executions while the other part is responsible for something we call semantic layer why we need semantic layer because the physical data model is not enough there is not enough semantic information about the data yeah you

6:53don't need you don't know uh so first of all most of data models are uh how to say it politely uh it's a

7:05mess yeah so there are tables like two one T1 T2 and T3 there are columns like

7:13Revenue one Revenue two and revenue three that's quite common yeah so that's why you need to describe the data with more properties you need to uh provide

7:26some good description of what is stored

7:30in the database and once you have it you can generate much better um requests for the physical execution which is typically SQL but can be also something else and in the physical uh part of the analytics L uh we build something what we call Flex query engine uh which is actually good data agnostic so we are even thinking about

8:00to open source it uh but you know it's quite expensive to open source something and maintain the community but I would like to do it right now yeah and uh we

8:12use open Technologies in this platform it's not there is nothing proprietary there uh as specifically we use Apache arrow and Apache Arrow flight framework which is very efficient for inmemory processing and for data exchange across the network as a execution engine we use what coincidence ddb and uh with this framework you can

8:39very easily and that's about the developer velocity and developer experience you can very easily build new data services on top of this stack so you can get the data process the data with du DB and then you can do whatever you want you can use like pandas or polers or whatever else and you can Implement use case is like pivoting or

9:01machine learning or whatever yeah and you implement just the logic you want to implement and the rest uh is uh provided by the

9:15framework okay so this is this is I I like diagrams architecture diagrams yeah so this is the architecture diagram of the of the analytics l or respectively the whole analytical platform good data and you can see analytics Lake uh

9:34uh responsible for the semantic layer and SQL generation and then the flex query responsible for the physical

9:42executions connected to data sources like snowflake or mother duck caching the data in a specialized cache uh and postprocessing the data in any way and in the future we would like to allow our customers even to embed their custom modules that should be definitely feasible and easy because they can utilize the framework okay I'm quite fast so we will have more

10:12time for something real so let me present you uh real user respectively

10:20developer experience uh actually powered by Duck DB running on top of mother duck yeah

10:30so first of all I have a lot of demos here so I will context switch between the presentation and and the demos and please stop me if I'm running out of the time Jessica yeah so first of all you need to get the data to uh your

10:48bi tool yeah you need to craw the data from multiple sources you need to transform the data and you usually need to create something like stars or snowflake schema uh clean up the data so they are they are like uh wallid connections wallid primary keys so then you can uh allow

11:10your business users to work with these uh clean data safely yeah so for that uh

11:19I prepared uh something what I call uh data pipeline blueprint it's actually open source repository and uh uh the link will be in the end of this presentation in a form of QR code and in this open source repository I'm demonstrating several approaches first of all uh um let's say following best

11:42engineering practices that's really important if you want to build something what will survive years and it's maintainable yeah so every definition of extract load transform and the analytics every definition is as code stored in Version Control System like git and everything is running in multiple environments like Dev staging and prod production and everything in my case is orchestrated by GitHub actions

12:16but you can use uh any other orchestration tool like airflow Dexter whatever I would like to Extended with something like that but you know there are different priorities yeah uh currently this end to end data pipeline like U until uh two weeks ago

12:39uh it was running only on top of snowflake but now it's also running on top of mdck especially because uh one of the latest releases of mother duck when they started supporting a backward compatibility for drivers so now I can keep the same driver in the Pipeline and run it forever I don't need to release new driver every two

13:03weeks that's really important uh I'm crawling data uh from GitHub API from aw ss3 and from jira

13:14API uh I'm storing this data to snowflake or mdck uh one to one to tables with Json columns and I'm using meltano tool for that because it's extremely versatile meaning you can easily exchange any extractor and any loader you can combine them freely because there is a framework in between yeah so you can change snowflake to mother duck with one line

13:44of code in Amo that's that's all you need to do yeah it's really good then I

13:50transform data with DBT uh again there is DBT plugin for

13:56mother duck DBT plugin for meltano everything is already prepared by the community and once the clean output stage model is

14:07created uh I generate good data semantic layer from it because we have good data DBT plugin and deliver all analytical

14:17artifacts with our SDK to good data into multiple environments yeah meaning uh in

14:25GitHub uh before merging any changes of

14:29this pipeline everything is running against death environment after merch against staging where like business users could test it and once it's merged into production Branch it's delivered to production and there are also many applications uh running on top of good data uh and they are stored in the same repository so you can deliver change like imagine that GitHub changes API and

14:59I need to change uh some column uh so first of all I need to change how I call GitHub API then I need to change column names in many tables and they then I need to change semantic layer in good data and then I need to change all dashboards everything basically yeah I can do it in one pool request

15:22consistently so everything will work after I merge it that's really powerful

15:30uh yeah there is a link so let's try it if my internet works so this is the repository there is quite comprehensive read me with the same picture actually and here I I have to look here

15:48because so here is the last uh actually

15:52scheduled run of the pipeline uh because it's not only running when I'm merging some thing uh to the repository but it's also scheduled every day or every hour whatever uh to run on top of staging as well as on top of production and you can see like five jobs meltano jobs for five different data data sets uh DBT and good data parts and all

16:23these parts are running against snowflake with DBT core against snowflake with DBT cloud and against mother duck okay so let's get

16:39back so the next user experience uh I

16:43would like to show you is standard good data that's boring yeah so let let's be let's be brief ahuh okay this panel is

17:02okay so this is actually the development

17:06environment uh you can see multiple workspaces we call it workspaces for different data products and also I am splitting workspaces based on on top of which database they are running so let's use the workspace mother duck which is running on top of mother dock uh this is GitHub overview uh this is actually craw data are crawled from

17:30our open source repositories multiple open source repositories uh I am not a top contributor actually unfortunately I am not developing enough and you know and then there is second dashboard which is called jocy uh we really like it in good data this term uh because jira is evil

17:54right uh and uh because it's running on top of

18:00semantic layer with the clean semantic model uh we don't have to be that afraid

18:08to let business users play with that yeah so business users can edit dashboard they can create a new visualization like here and they can like I don't know what yeah culate number of jira tickets uh by

18:30by where when they were created and they can filter it like I don't know to last year uh last 12 months is

18:47better and by

18:52months and create column chart and compare it with the previous year and name it like yatc and save it

19:07yeah and save and save the

19:16dashboard did I do it yeah I did it yeah actually this visualization is now different why because there is dashboard filter yeah dashboarding is really complicated yeah uh but but I could overload it and let the whole history to be seen no matter what dashboard filter is set yeah that this is creepy I mean like dashboarding is really

19:39creepy uh so and as as you could see it was really fast I created a new visualization meaning I had to contact mother duck and calculate the query and it was like almost instant even though it's running on the other side of the world in Us East yeah okay so let's continue with something more custom yeah because many companies

20:07do not want our UI yeah because I don't know why maybe it's it's ugly I don't know uh but more often uh they need something custom because they need to embed the data into their business applications yeah they have I don't know let's say you are in a bank and and you are responsible for loans yeah and you

20:32need data to decide if you can approve or not approve the loan yeah and there is some custom application written in JavaScript uh 20 years ago and you need to embed this data as as easy as possible yeah so that's actually possible with our SDK the JavaScript SDK uh as specifically react SDK but we support also other Frameworks

20:59and you can embed like uh yeah uh yeah so let me show the application if it works still I don't know if it works still but it seems that yes okay so first of all you can embed

21:18it's and it's slow because my computer is slow and not because uh the back end is slow so first of all you can embed the whole dashboard into any any application you want yeah that's easy but you can also embed single visualization or you can embed even raw data like a table or you can easily build your

21:43own uh visualization this visualization actually doesn't exist in good data I build it uh uh alone even though I am

21:53not a JavaScript developer uh this is quite small actually so you don't see it most likely but uh this part of code is everything you

22:09need to do to build a custom visualization yeah uh let me just show you so this is this is these data are from federal Administration Aviation from United States about all flights in the United States so let me change from and [Music]

22:34two and store it and we should see different visualization in a while after it's compiled yes uh it works so instead of

22:48FAA regions we now uh analyze from which

22:54uh aircraft and manufacturer um uh are uh

23:01which yeah so we are analyzing number of

23:06flights uh by manufacturer like boing

23:11and by uh the region in United States

23:15like Southwest yeah so you can see that most of flights are managed by boing and

23:23in South uh Southwest

23:29okay so that was quite it was still quite standard so let's make it even more creepy yeah so what you can do with our JavaScript SDK uh Circa two years ago we attended

23:43hackaton here in good data and because I am not a JavaScript developer as I have already mentioned so I invited my colleague Dan Homa who is expert JavaScript developer and together we try to build uh a data visualization in virtual reality so this is a picture uh it's actually running uh it's it's delivered

24:11from the PIP from the same pipeline uh to render com uh uh free free of free

24:19offering and as you can see hopefully yeah it still works so as you can see you can explore data in three dimension with virtual reality headset uh the code is open source the application is public feel free to open it in your virtual head headset it's it's very different from when you see it here I mean like uh you can walk

24:48physically through the data you can even fly with with with thum sticks it's it's quite interesting and completely different uh way how to uh view the

25:03data okay but uh as I mentioned I'm not

25:08a JavaScript fan but I'm more python fan

25:13actually yeah because we build a lot of things in Python that's the second language in our platform and it that's a language which is very often used for data analysis uh but doing something like that locally with CSV files uh well it

25:32doesn't perform that well and you don't have the semantic layer the same uh source of through like your JavaScript developers or any other uh consumers yeah so we provide python SDK connected to our apis so you use the same semantic layer same caching same everything but instead of JavaScript you use Python for example here in Jupiter node book yeah this

26:01is yeah so this is the notebook and hopefully it will do something so first of all you can like just list uh objects in our semantic layer like metrix yeah uh but you can also calculate report like this and you can calculate the report uh from already stored

26:32visualization or you can build on the fly or custom definition of the report uh in this case calculating uh

26:43count of commits and count of pool requests by in per month and per

26:50repository uh and then once you have the result in data frame because result is a data frame you can can use for example

27:00some machine learning libraries uh calculate something like predictions store them back to the database map them to the semantic layer and the whole circle is closed and you can iterate as you want yeah on top of the same source of through and the last live demo so far we were talking about running something on top of mother duck

27:26now we are going to talk about running something on top of pure duck DB the flex query as I have already mentioned the engine responsible for physical execution is part powered by

27:40ddb I decided uh to build uh demo for

27:47data Federation so what you can see here you can see a streamlit application connected with python SDK to our good data server uh sorry sorry uh you can see streamlet application connected directly to the Flex quy engine not good data yeah Flex query engine exposes arlight RPC and provides uh libraries for uh

28:15connecting to flex query and executing anything you want and one such use case which I demonstrate here is the Federation so uh in my case and yeah I forgot to describe what is here does anyone know what is here what's this query exactly and

28:44number uh I think it's four uh so to demonstrate the power of flex query uh I decided to store there are two tables here orders and line item these are the biggest table in the bank bench mark actually so uh to make it complicated I decided to put the smaller table to aw ss3 and the big table to snowflake

29:08respectively to postgress pogress is running locally snowflake is in uh EU

29:15Central one I think region and the flex query is responsible

29:22for crawling data from Snowflake or pogress crawling data from awss three store them as Arrow tables in Flex cach engine and then uh open in process open

29:37duck DB process and and execute exactly this query not but not only not on top of the duck DB but on top of the arrow tables yeah because you can Map arrow tables into du DB process seamlessly absolutely seamlessly uh and this is the the stream L application and to make it more interesting I make it little bit

30:02interactive so what you can do here is you can change the cach key and byp bypass all the caching yeah now it's running against local postgress and a wss3 in 6 seconds uh you crunch like gigabytes of data from two different sources with duck DB including uh connection remote

30:26connection tost gr remote connection to AWS S3 but what if I change the aggregation column so what happened now uh no postgress was contacted no uh AWS S3 was contacted and instead the caches were reused because all columns are already there and everything was calculated only by Duck DB and it was much much faster yeah I can Al also change the

31:00filter and it's again running much faster than before and then I can run the same query and as you can see

31:11this is exact cach hit so no duck DB is involved at all just good data caching is utilized and the response time is like 10 20 milliseconds Max

31:24yeah and yeah I forgot to change it to snowflake let's make it interesting will it work hey it's still cached no it's not cached so how long it will take what do you think the same time actually that's interesting yeah and that's the difference between row base database pogress and column based database uh snowflake yeah so they are

31:57the the duration is the same but the difference is that snowflake is uh uh

32:04not on my laptop but in the AWS

32:09yeah okay so am I running out of the time

32:16cool so last slide just marketing yeah but developer

32:23marketing yeah uh coming soon yeah what

32:28actually I am currently working on is AI

32:33obviously but we are trying to be Cals here yeah so we are going to utilize our semantic layer uh directly competing to all text

32:45tosql Solutions which they are

32:50currently on the market there are many many many such even snowflake invests

32:56enormous effort to build text to SQL we think it's not a good way the good way is to utilize the semantic layer containing much more context and to do it uh in a serious way

33:10in production quality we would like to introduce specific new Services inside our platform and they will be basically responsible for the r architecture yeah so there will be Vector database we will search semantic layer we get only relevant context context relevant to the user question and we send it to llm so the answer is more accurate it's

33:36faster uh and it's much cheaper yeah because you pay for per token yeah in in open AI for example so this is currently what we are working on and we should release it in Q3 actually at least our chief of product says it uh okay so this is just a quick recap

34:00uh so truly responsive analytics means semantic layer duck DB mother duck and the flex query and why does it work because we use state-ofthe-art Technologies like ddb which is really State ofthe art uh we are laser focused on analytics we don't use one database for all we don't use DB clusters because it's no longer necessary like Maddie described in his

34:26presentation and also because the competitive pricing actually yeah I really recommend you to uh go to mother duck pricing and compare it with snowflake pricing and what's next uh uh in our partnership with mother duck and duck DB uh I would like to explore the area of multi-tenancy because that's some our differentiator good data differentiator we can uh serve customers like Visa with

34:57thousands of workspaces and deliver something like that in large scale is really complicated and mdck is going to release features related to multi-tenancy as well so my idea is to build endtoend multi-tenancy in the whole data Pipeline and that's all

35:21[Music]

FAQS

Why did GoodData integrate DuckDB and MotherDuck into their analytics platform?

GoodData integrated DuckDB and MotherDuck to solve the persistent problems of high latency, excessive cost, and poor developer experience in traditional BI stacks. Running BI tools on top of cloud warehouses like Snowflake means every user interaction triggers SQL queries that keep the warehouse running 24/7, leading to expensive bills. GoodData's "analytics lake" architecture uses DuckDB as its execution engine combined with Apache Arrow for in-memory processing and data exchange, delivering near-instant query response times at a fraction of the cost.

What is GoodData's FlexQuery engine and how does it use DuckDB?

FlexQuery is GoodData's execution engine that sits within their analytics lake architecture. It uses DuckDB for query execution and Apache Arrow Flight for efficient data exchange across the network. FlexQuery connects to data sources like Snowflake or MotherDuck, caches data in a specialized cache using Arrow tables, and processes queries in DuckDB. This allows data federation: you can join data from different sources (e.g., S3 and Snowflake) in a single query, with DuckDB processing Arrow tables in-memory.

How does GoodData implement a code-first data pipeline with MotherDuck and dbt?

GoodData provides an open-source data pipeline blueprint where every definition (extract, load, transform, analytics) is stored as code in Git and orchestrated by GitHub Actions across dev, staging, and production environments. Data is extracted using Meltano (switching between Snowflake and MotherDuck requires just one line of YAML), transformed with dbt using the MotherDuck adapter, and then the GoodData semantic layer and dashboards are deployed via SDK. A single pull request can update the entire pipeline from source to dashboard.

What is a semantic layer and why does it matter for BI tools?

A semantic layer adds business meaning on top of physical data models, which are often messy with cryptic table and column names like T1, T2, Revenue1, and Revenue2. It provides descriptions, definitions, and relationships that the raw schema lacks. GoodData argues that a semantic layer is especially useful for AI-powered analytics (text-to-SQL) because it gives much richer context than raw database schemas, leading to more accurate query generation. Their upcoming AI features use RAG architecture to search the semantic layer for context relevant to user questions.

Why and How we integrated DuckDB & MotherDuck with GoodData

The Problem: Spinning Wheels in BI Tools

GoodData's Evolution

The Analytics Lake Architecture

1. Semantic Layer

2. Flex Query Engine

How It Works

Demo: Data Federation

Developer Experience

React SDK

Python SDK (Jupyter Notebooks)

VR Data Exploration

End-to-End Pipeline as Code

What's Next

Transcript

FAQS

Why did GoodData integrate DuckDB and MotherDuck into their analytics platform?

What is GoodData's FlexQuery engine and how does it use DuckDB?

How does GoodData implement a code-first data pipeline with MotherDuck and dbt?

What is a semantic layer and why does it matter for BI tools?

Related Videos

The MCP Sessions Vol. 1: Sports Analytics

Watch Me Deploy a DuckLake to Production with MotherDuck!

The Unbearable Bigness of Small Data