DuckDB Experiments: Peeking into the Future of Analytics ft. Christophe Blefari

2023/12/01Featuring:

Christophe Blefari, freelance Data Engineer, discusses his experiments with DuckDB, focusing particularly on DuckDB Wasm and its potential implications for the future of browser-based analytics.

Transcript

0:05thank you everyone to be here today to listen a few of the stuff I have to say

0:12uh so it's going to be like a raw presentation and a raw demonstration because yeah it's because yeah I'm like a freelancer I have no time to do stuff so I just put blank slide and and text on it so the idea of this presentation is just to uh give you uh demonstration

0:30of a few experiments I've done with dgdb so the people that have been using ddb might be like uh may be saying like oh I already know it but for the new cers it's going to be like a good idea and a good overview about uh a few things that I find magical about ddb uh just to like put a small spoiler

0:53and to give you like a taste I'm going to talk about DBT and ddb and about

0:59whether assembly and how with web assembly you can do stuff that's to me is like the future of data uh so first I will just introduce myself so my name is Christo blar I'm French you might notice from my accent um I'm staff data engineer but mainly what I do is like uh building stuff with data like I do software

1:22engineer software engineering and I build system that use data for people that want to use data and uh I'm freelancer I mainly freelance with clients in France even if I live in berin uh since last

1:39year uh but the reason is because just I have client there but if people want to work with me you can you can chat with me um um on the client I already work

1:51with you might know know some like I work with blab car which is like a carping company I also work with kto

1:59which is a bank for b2c B2B but bank for companies

2:06and for the last year I've been working with the French education Ministry where I try to rework the way they manage data set at the moment the data set are manage with CSV on every dis in the

2:20ministry and the idea is like to put on premise uh blob storage uh with parket everywhere uh like and the DB can enter in this space because we want people like to be able to read parket everywhere from everywhere so it can be like a good solution for us I also do content creation I have a Weekly

2:40Newsletter and I do also a Blog posting on my own blog bff. you can subscribe to new or just read stuff there and yeah personally I like running biking and I also play video games so if you want to play Valance also ping me uh so the agenda of this presentation will be first like a small recap of

3:00what's ddb and why you should you should or you could use ddb uh then I will just explain what's Deb wasn't and then I will do the demonstration of the stuff I have to show uh so in three points what's ddb it's like a noap in memory database it means that insurance on a single node uh to install ddb you only need 25

3:24megabytes uh and uh it hasn't any

3:29external dependent this is like the what but also the why actually the fact that no external dependencies like it's a good Advantage you might yeah understand why and the idea of ddb is like you write a SQL to interact with ddb so in you load data and you just write SQL and you can do it locally like in the browser you can do

3:49everywhere you can do it everywhere uh just small like uh SQL queries that have been written in the documentation so for instance in the first line you read a CSV from the SQL index DB so yeah it's a

4:03way to do a select star on on on a file you can for instance create like a table from this uh R CSV automatically and then you have a table that is like created in your uh in memory database that you have loaded and then you can query it if you want that just like the tip of the Iber but that's the magic of

4:25it um so for instance if you want to use the DB in a modern data stack as we all say this is like a a diagram from a project called fency Data stack that I've been worked on the idea was like to use the most fency tool to create a data stack so if I just put ddb like in

4:46situation uh in this case dgdb could be used ddb SL Mo actually could be used as a as the warehouse at the center of the the stack um I mean more Mo than DG DB and the center because if you want to have a warehouse dug DB is not really the tool right now to do it because it

5:06has locks it's like concurrency is not as good as it could be like on the cloud warehouses we know um so if you want to apply Warehouse concept is more like M that could do it if you want to be more in data L fashion WB could do it uh so

5:25yeah that's where you build it I have a circle if you don't see it so why you could use or should use dgdb uh to save cost I think like this is the

5:38one of the main arguments so this is a LinkedIn post from someone that is working at octar in the security team and mainly what he says is like they had like Big Data Warehouse pipeline they did stuff with ddb and they save hundred of thousand of dollars that's like the point if if you want to save like if you

5:59want to save reading this uh so that's yeah one one reason to use the DB but there are more the first one might be like to Embark SQL cap capabilities everywhere so locally remotely like on the server yeah you just do like pdb and you can run SQL on the server or on the phone I guess like you can embed like in

6:22phone as well like in the Brower or natively uh you can RN data files that parket CS vone might more I guess in the future uh there are like new external dependencies I say wave to pendas because when you install pasas sometimes you still wait until p dependencies resolves uh so yeah that's a good point uh the benchmarks looks great actually I

6:47don't care about BM marks uh because at the moment all the TOs are like quite the same for the use case I do like I do analytics like for reporting for companies so yeah I don't care if like the dashboard or not the dash but the quer is run in like 10 minutes instead of 15 minutes uh so yeah benchmarks look great

7:07uh if if you like if you have a workloads you you have to look deeper in it um and theb is already used in a lot of not a lot in a few data products either like with W or is it like with ZT for instance or like in in the in the core of the products like just move

7:27thata yeah why why do I put an apple slide on my presentation uh not because I'm tick cook but just uh to to to have like a realization which is like right now I guess we all work in startups or like we are in buildings so uh yeah we have all Ma and our Max uh specification are like

7:51huge uh when I say huge it's like the new M3 are like eight core CPUs and at least 24 gabt of memory when I remember

8:01like the talks I have like with C admins at some companies they like start servers at 2

8:08gab of memory uh so in a lot of use cases like my computer is like more powerful than the server that I have uh so why not do stuff locally why not do stuff in the browser that I use every day to like use data and if you we

8:26just compare it with snowflake so the the warehouse size of the snowflake this is like uh not me saying this but select de which is like a company that tries to save your snowflake cost um they approximate the fact that X mile on X M Warehouse with snowflake is like eight8 core and 16 gab uh if I compare

8:51it to my MacBook my MacBook is like 64 gab of R and 26 CPU something like this

8:58so yeah why should I run stuff on snowflake when I can run it on my computer yeah I take a lot of shortcuts but you get

9:18um uh now I just jump to D WM it's like

9:23a tweet I did like two weeks ago when I was trying stuff with ww wasm and I had this is the first time in like a few years that I had this magic stuff that happened to me like this is like the sentence in the middle I I said that I'm able to read a public 500 Mega paret

9:43file do a go buy in less than one second and to me like at the first Glimpse it was like just magical like to be able to read a huge not huge but of Mega

9:55megabytes um files and to do like a gr

9:59it when I say in less in one second it was more like in milliseconds to be honest uh and when you deep dive you understand why it's it's working but when you just do like the the first squel query you're like okay it's magic uh I will maybe explain a bit why it works like so just for people who do

10:19know wasm mean web assembly and wasm is

10:24designed as a portable compilation uh Target for programming languages so the idea with wasm is like to uh compile your C program your binaries and put it in a browser and run stuff in the browser like Doom um so if you want you can play Doom in the browser or do other stuff actually um in this presentation I've

10:47not finished uh stuff but I tried as well like to put DBT squ and stuff in the browser to have like comp thata stack in my brother it's for another time I guess uh so what what it means is with dgdb it means you can run SQL code in the browser without any server and without any server is like yeah for a

11:08lot of use cases you don't want a server you just want like a browser do stuff in the browser you can send like an HTML file JavaScript files to people as they do stuff like locally and you so let's jump to a few ideas uh now

11:23that I've introduced uh my stuff so I have like three stuff that I can Zoot today the first one is like being able to run Deb locally or in theci I will do it only locally because I don't have a CS them but or in theci to replace your data warehouse I mean in production you have your data warehouse like snowf flag

11:44B query other one like Azure synapse if people are using aure synapse um which I

11:52don't um so you have like a production that warehouse but uh you want to test like your SQL queries of your SQL data warehouse and one solution for instance with be query is like to create a CI project run the bigquery uh uh queries uh in the CI and if you already done it when you have 100 of s queries it will

12:16take time and when I say time it will take a lot of time because like you have a cold start between each like SQL run when you do it like even if dry run will be query and it will take minutes uh and slowing down your C okay so if you're able to translate on the fly your SQL your

12:34big query SQL queries to the D DB and run it within D DB uh you win because uh

12:41you don't spend a lot of time like doing HTTP call you just like call you in memory uh I will show you also as well like a web extension I developed uh just to get the pket schema of your data file your data L file uh in the browser for instance when you use S3 or or GCS it's

13:02like very frustrating when you navigate like in your S3 brother you have paret but like it's awful to open pet you have to download it like open a Jupiter notebook pandas real red paret and then you are able to see what's in the file I develop like a small extension to do it it's it's justess like a a PC rather

13:24than something you should in production uh and the last one is uh um adding SQL capacities in react app in any react app so yeah just I will just switch um to my

13:38P CH so I will maybe enter in presentation mode

13:46uh okay how do I open the left I

13:54don't project okay uh so I have no not

13:59react for the moment no uh okay so I have a DBT project I don't know if you are familiar with DBT but I will not do an introduction about DBT I'm sorry uh but I have a small DBT project in which I have uh one source which is a roow order uh this table is already uh loaded in my bigquery

14:22project if I just show you here I have like a project in which I have row order and row order is like a data set generated by CH GPT about someone that is ordering fruits and vegetable um yeah it's because it's started like like the previous sort stuff yeah it's apples bananas yeah we have banana here

14:50um then where is my P charm here so I

14:55have one source I have one model that is like quering this source so in uh a bigquery SQL uh with bigquery SQL function and uh yeah yeah yeah I have one profile and in my profile I have two

15:12uh outputs the first one is B quer and the second one test and test is using like a connection while B is using the B

15:22connection uh in the uh profile for dgdb

15:27I have this key which says emulate me query so uh this is to tell to the DB

15:35adapter uh to uh translate on the fly to transpile on the Fly uh the SQL query uh

15:43by using bqu as a source and to transpile it in the W syntax okay uh so if I just go in my

15:53terminal I just zoom in a bit so if I just do a BBT run it will run on big query so you have to trust me because I've already run it but it will run on big query so you yeah you see it here Target M query so it created the table and if I do run Target

16:14test it will run but onb but locally so

16:18right now it worked but the reason it worked is was because I already had like a database with the source loaded on my computer because if I if I don't have it it will not work so if I remove the database that I locally it fails obviously U so in order to do it I just develop a small script that

16:41sorry I have a small script that uh loads in the dgdb locally a fixture of the order table that is like in production but smaller because yeah we are locally so we don't put like all the stuff on so I have here in CI the sourcing so it's whole order and I have only like two order in this and so I run

17:04I load the data and I do the stuff yeah it's magical we can yeah I will go fast because the idea is just to move to the other presentation as well but when I do the run here so the script that I run is

17:20like a CI script so this is something that uh load the loads oh this is something that loads the row order file that is here so with two orders and compare the output of the revenue model which is this one to another s here uh that um say in for the

17:46order one you had to have uh 8.6 and for

17:51the order two you have to have one actually there is one issue because it expect uh 8.6 but the real result was

18:028.5 so if I change I change my expected output to a.

18:08five yeah because I put the expected output to fail and that's not the query that is wrong that's how we do unit test

18:19right and right now it's work so the idea is like to put this in the CI so in the close fure uh the LI will introduce

18:28tests but the ID is not only related to the DBT the IDE is to put this on all your pipelines and for instance uh what was presented with DLT the IDE to be able to have a platform that in production is the same as locally that runs in memory that is very easy to install uh yeah it's

18:50magical um so I move to the next stuff I

18:55just want to show you which is a web extension um that gives you the schema of your parket file uh in the storage so

19:06here I have the installation already installed in my uh Firefox so I go to my cloud storage I go to the bucket that is set up on this and I have to activate the extension and I guess I have to re the page and so if it work uh when I will pi

19:31work let me see

19:36why Okay cool so what I've done is you just over

19:43uh the paret file and then you have like the the schema of the pet file directly in your border it avoids you like a lot of configuration and stuff so this is just like a PC once again so the idea is like to show you that this it's powered by ddb but actually it's more like arrow interface that does it but ddb is a nice

20:04way to be able in the browser to query like a pocket file um and quering like the schema is one thing you can do but actually what you can imagine is like my extension which is this one uh which is empty you you cannot see it because I put nothing in the panel but you can for instance here put a text area and be

20:26able like to R SQL directly in the in the browser page on your paret file or on your CSV file that are in the data League like yeah good idea uh the code is not yet push anywhere on internet for this extension but I will try to put it after the after the Meetup uh after the Meetup or even

20:48like write a blog post about it uh if I just show you quickly the code which is behind uh so my extension is called parket Ino if you don't know how extension work uh the extension work in two parts there is like a part that is on the background which is background here so you define script that runs in

21:11background and you have stuff that is like the content script and the con script is something that is I will say happened to the to your main HTML page uh and so in order to uh run some stuff you have to put it in the background and do some kind of messaging communication between the background and the first ground uh to make it work so

21:32my uh WB um stuff is in the background

21:36so here this is the way you instantiate uh is it no this is the way you instantiate DB asynchronous database um in Vania JavaScript and in JavaScript actually um I say in van because in react is a bit different because the import a bit different but you you you have to get like the the bundles that are outside on the internet then you

22:02create a database or there are like some credentials that I will change after tisue uh uh because yeah to just query like my uh GCS I have to put the the access key and the secret key somewhere uh and then uh what I do mainly is like just run a SQL query and the SQL query that I run is not on this

22:28side but on the content script side here the squel quing that I R is notice not here okay I don't remember what the screen is um oh yeah it's here yeah I was like blind so what I do is like just create paret metadata I put my file here a lot of stuff is areed because it's a Deo and

22:52then I get like the pass so it's the colon name and the type yeah the type actually is like you you might have SE that type for string is bite array which is like might be wrong it will be better it will be like string or object but yeah still work and the last uh small stuff I just want to show is like oh you

23:15can uh create and integrate like was within your react app uh so for instance here I just have like a small create react apps that is connected to a d that is like in my browser so what I can do is like run SQL G I will maybe zoom in a bit so here I just like do select one as

23:33too and then I have like uh yeah a table

23:38um what I can do if I redo like the demonstration I was speaking about in my

23:45um in my tweet so I go on dat. which is like the open data portal in France here on the side we can see among the uh uh

23:5650 uh thousand data sets that we

24:01have six file in

24:05Pocket yeah things are moving fast uh so I used this one which like is

24:13like all the office where you can vote for the stuff where we vote in France when we vote and I can for instance do like select um so I will first select star

24:28pet uh met meter data from my file which

24:34is this one I run this

24:39oh okay okay yeah if I just do it bucket

24:44and talking uh yeah it's real uh yeah

24:55might extension uh specify no actually it's working without the extension as well uh I think it's the select star was not a good idea to do but I don't remember the name of the colum I want to select so I think it's something like this

25:16Comm is it working okay so yeah so it's web this sometimes we

25:24get some W okay yeah yeah one last

25:33TR one last right I could

25:38Comm from

25:42reds uh this oh yeah I have to put a limit because I guess you will

25:49fail yeah actually it was working fine it's like the GMO effect I would say uh so I will just yeah try something

26:00else so this one is not working but I

26:04have another stuff I can show you and then I will quickly wrap up so on my react app just let me uncomment something for instance I can load default table so I see one new case for like the SQL interpreter in the browser which is for educational purpose like uh

26:26learning SQL is like for every data people something that is like very important and require and the idea is like to bring more and more people like to be able to write SQL the issue with SQL is that if you want to run SQL on your local computer at the moment you have to have Docker and progress and

26:43stuff like this this is awful uh I guess with this in the browser you can provide like a good experience uh to everyone that is that want to RN to learn SQL is like to learn the specific Deb syntax but I think if you put for instance like seet GL in the browser you can uh translate theb to

27:04pogress to Beery to any uh engine out there and create curses for every stuff so if my demo is working now I should be able to see some tables yeah so I have two tables that are like loaded these two tables are like just two files that I've loaded in like a public uh bucket here that I've just registered

27:28before so here I can do Select Staff from har for instance and then yeah the the data is just the European countries and the land area and the population and the IDE is to do a join and stuff like this so for instance you can prepopulate like uh your uh your app and ask people some question and then validate the answers

27:53directly in the browser directly uh without internet connection and it's working yeah that's main the point so I will gra because I guess I'm short in time uh so yeah now I see that everywhere so for instance on the open data portal we can put like a SQL button here and just query directly like the file directly here for instance in my

28:17blog I can put a SQL text area at the top just to query my metrics and stuff like this it can be done in every I guess uh back office of every app uh that exist in any company actually uh we can put it like in education I think we can also like in JavaScript apps uh I think

28:37it's better to write SQL other than map uh yeah map for each and stuff and having like arrays object and stuff to create like a SQL actually to do gr bu

28:50so might might be uh that that that is may be something to do uh to uh yeah simply ify the way the competion is done in JavaScript apps uh in every g website a pH stuff like this where data is like uh displayed you can put like the G was there to just run sequ actually everywhere so yeah that's mainly hit

29:13thank you for listening and if you have any question I guess I can take [Applause] [Music]

29:26it a

FAQS

How can DuckDB replace a cloud data warehouse for local testing and CI pipelines?

DuckDB can transpile BigQuery or other cloud SQL dialects and execute them locally using the dbt adapter's emulation mode. You can run your entire dbt project against DuckDB in a CI pipeline instead of spinning up cloud data warehouse queries, which cuts cold start times dramatically and eliminates HTTP round-trip overhead. Load fixture data locally, run your SQL transformations, and validate outputs, all in-memory on a single machine.

What is DuckDB WASM and why does it matter for the future of analytics?

DuckDB WASM (WebAssembly) lets you run DuckDB entirely in a web browser without any server. Users can execute SQL queries on Parquet files, CSV files, or other data formats directly in their browser with no backend infrastructure. The speaker demonstrated reading a 500 MB public Parquet file and running a GROUP BY in less than one second. This opens up browser-based SQL education tools, data exploration, and the ability to embed SQL capabilities in any React application.

Can DuckDB be used to add SQL capabilities to web applications?

Yes. Using DuckDB compiled to WebAssembly, you can embed a full SQL engine inside a React or JavaScript application. The speaker demonstrated a React app that loads tables from remote Parquet files and lets users write and execute SQL queries directly in the browser. This enables SQL education platforms, interactive data portals, and in-browser analytics tools, all without a backend server or internet connection after the initial data load.

DuckDB Experiments: Peeking into the Future of Analytics ft. Christophe Blefari

Transcript

FAQS

How can DuckDB replace a cloud data warehouse for local testing and CI pipelines?

What is DuckDB WASM and why does it matter for the future of analytics?

Can DuckDB be used to add SQL capabilities to web applications?

Related Videos

LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics

The Unbearable Bigness of Small Data

In the Long Run, Everything is a Fad