Unleashing the Power of DuckDB for Interactive SQL Notebooks

2023/10/10Featuring:

DataCamp’s SQL notebooks, powered by Jupyter Labs and DuckDB. Talk from Rik Bauwens of DataCamp at one of the MotherDuck/DuckDB meetups that happened in September 2023 in Belgium, Leuven.

Transcript

0:04so I am Rick I'm a software engineer at data Camp so I'm not a data engineer so I'm going to bring a little bit of a different perspective than uh nils's talk um um yeah going to show you how you can actually also use ddb embedded into your applications if you build

0:22applications so before I can share all that I have to give a first quick intro to what data Camp is if you don't know us yet it's not the sales pitch um so if you know data Camp you probably know us as uh data science learning provider so people organizations come to us to upskill their employees in data science and data

0:42engineering and noways of course also or people that want to switch careers and learn data science um yeah that's what they are well know for but of course after learning data science comes doing data science and that's why and um yeah

1:00as you probably know a lot of data science is done inside these things Jupiter notebooks and yeah I see you're all pretty familiar with it so we decided at data camp in 2021 to introduce our own tool to build uh to to have to we have our own notbook Tool uh called dat Camp workspace and this is

1:21actually the tool that I mainly work on so it's a modern class-based data science notebook So it's b basically like the jupter notebook that you all know but then um yeah something more modern quickly just going to show you what it looks

1:41like so this is a really super simple demo um as you can see it should look pretty familiar with cells um we have this special type thing called a SQL cell where you can directly in this case query from Red shift um and as you can see I'm here quing some ticket sales data in one sell and you see the result

2:01here um there's some events and some prices per ticket and some quantities of tickets sold and then in the next cell I have some codes to create a a plotly plot I think um to plot this data I'm going to hide real quick and then you visualize the data um yeah very super simple demo

2:25um make it full screen again hey in case I need oh yeah it's fine so that's the first space the tool I work on and now we can go to the interesting part which is how we use the beia data camp we mainly use it in this webspace Tool uh so we have three main ways we use that be a data Camp so the first

2:51thing is um a think called Data frame CSV SQL going to show you really quick how that looks like I'm going to zoom in again a little bit so here um I'm uh doing something a

3:08really simple reading a par uh a parket file into a data frame into a pandas data frame and it's loading for some time now I'm going to refresh real

3:22quick I probably blame it on the internet anyway um this reads data into a data frame and and then if you want to process data uh if you want to process data you usually uh use panas to do that and for example here you uh filter on the passenger account sort of values but like nil shared uh we don't like using

3:47pandas or or uh um python what you can

3:52also do in with this tool data frame and CSV SQL it's a special type of SQL cell where you can write SQL against this data frame frame directly and it could can do the same filtering uh where you here select from the data frame name as if it's a table and normally if the internet is better this shows you the same results

4:13uh but what you can then also do is query directly from the paret files so yeah you don't have to first read it into a data frame uh you can just say the paret file

4:28there so that's our first uh tool so why did we build this well it's super useful for beginner users as you know we are we are we are teaching people data science uh they don't know uh um yeah they don't have a postest database lying around if they want to practice SQL um and for us it's also way easier to organize Live

4:49Events like this if you want to teach people SQL um it's way easier for us to find a paret file on on somewhere or find the CSV file and so we can really easily without spinning up a database this people SQL so that's one reason but then so it's useful for beginner users but it's actually also super us useful

5:09for power users so um if you do a lot of queries against your data warehouse and you pull in large amounts of data always into your notebook you have to download uh that same data again and again if you're doing similar queries um it's obviously slow um but it's also sometimes costly and because you some where I just charge you for ESS

5:32uh uh data so in this case it's often better to keep the data local just download it once into a data frame and then just keep uh writing SQL on top of it with uh uh the SQL the data frame SQL CSV SQL

5:49feature so now you might be a bit suspicious uh automatically reading CSV SP files and data frames it looks a bit like dgdb exactly and you might also be supicious because the talk is called DB a data Camp um but that's because it is it is uh exactly just DB uh powering this so how did we implement

6:12this first I have to show you a little bit how Jupiter laap works so Jupiter La is a really traditional client server application where and if if you for example run jupyter lab locally you also have that server component but then that just runs on your own machine and the client here uh is The Notebook you see

6:31and if every time you execute some code that goes code just goes to a server and the server executes uh that codes and gives you back a result and this result can be anything if you type 1 plus one it gives back two in a string and send it sends it back but it can be also HTML

6:50CSS and and JavaScript to have like a very interactive results and that's how jupter gives you like the very uh interactive compute environment

7:01so then specifically for our feature data frame as CSV SQL um the code that we sent from the client from The Notebook to the server is the uh SQL statement with some metadata and on the server we just run theb uh let it execute that query and then we get back at table Json and so we get all the data

7:21back and we can render it uh nicely in the front end if it uh

7:28works and then yeah this is pretty much how it looks like on the server side it's really not much more than these three lines you import theb We Set uh you can specify a variable name uh of of your data frame and we just execute the query withb it's a little bit more involved uh when when uh in the

7:47production code but it comes down to this basically and for me that really highlights uh the usability and the feature richness of the DB so this is what is mentioned on the the ldb website if you visit it it's really super simple if you're already running python to also run ldb you have to set up no dependencies whatsoever um and it's also

8:09super feature Rich we didn't have to do anything to make the direct par import work the direct CSP import work or like the reading from data frames uh work it just comes all out of box which no installs which is pretty awesome so that was the first um feature that frany vsql the second way we use NDB is for powering SQL on top of Google

8:33Sheets so as you can see here I have very um simple query selecting something from uh a Google sheet and how that

8:44works is it's a little bit more complex than data frame cql but it's uh pretty simple too so we check which sheets you are querying um and then we download that sheet from uh from uh Google slid uh Google sheets and we also allow specifying ranges so for example if you say uh a sheet name exclamation mark and

9:05that range we only get uh a specific range from your uh sheet so we download that and then we um transform that into an arrow uh table in memory if you don't know Arrow it's a colum or memory format um it's language agnostic so you can uh

9:26yeah it's like par par is like a on disk format that different languages can read arrow is kind of the same for in memory so we create that Arrow table representation in node and then we just register that to dgdb and then dgdb handles all of the the SQL syntax and all of the all of the

9:51rest so that means that dougb yeah the highlights here are that dougb plays really well with the data ecosystem so arrow is becoming a bit the Theo way to represent data frames in memory uh and dub can also zero copy read from that so that's pretty nice and then other highlight is that ldb runs also on a

10:11load server so previously I I run it into impon but it also runs perfectly on loads uh so it's super

10:21portable and then those two features I had were uh still using SQL but uh um

10:30this is um yeah a different kind of feature where we don't actually see SQL in the user interface uh this is not something that is live yet but it's something that's still in the works so this is what you would have seen if the tables would have loaded uh in the demo so a nice table that is interactive and

10:49that you can click through with your result of your your SQL statements and yeah as you can see here you can filter through it you can sort it um you can Pate all in the front end in the client and how this works now is yeah the client send SQL the server executed it gives back a table in Json format and

11:12then the client um yeah sorts and filters and Pates all in J in in JavaScript so there are two main problems with this approach that we have now um one is uh the the Json so Json is

11:29a pretty bad format for representing table or anything in general actually uh it's super space inefficient and you have to still transform that into something else to do analysis on top of it so a better way to uh send that it would be in parf format so in par is columnar uh so your similar values are close together so it compresses really

11:51well and yeah then also it's it it U it

11:55supports reading only a part of it so that you don't have to process all of it at the same time um so that's one Improvement we can make but then yeah if we still have to process that parket in JavaScript I have to find a a Park library and how to handle all those details myself so you can already see

12:13where I'm going with this what we can instead and also do is just run the DB inside the browser on the clients to process that par and because dougb just supports par out of the box there's nothing we really need to do except for translates the users action like sorting into some SQL but that's pretty simple yeah and then you might be

12:36wondering uh downloading a lot of paret data every time you execute a SQL uh statement um yeah you don't want if you're uh um SQL statement U returns 10

12:47or 100 or gigabyt uh of of data you don't want us to uh download all of that every time this this works pretty well for smallish tables uh and then it's super nice and it's really interactive the Sorting is super fast because it's all locally on your client of course for larger results like a gigabyte you wouldn't want us to

13:09download it all what we can then do instead is run ldb on the server uh and then do the specific operations uh on the server and send those from the client but then the nice thing is that uh this code is exactly the same so because yeah theb then runs on the server instead we have to write

13:29the same code only once for the client and for the server and the question is does dgdb actually also run inside a browser because it's C++ and these days uh the answer is yes because of web assembly um if you don't know web assembly it's uh a safe fast portal binary instruction format so it's a compilation Target of C++ and rust and

13:54other languages that can compile into it and yeah normally you wouldn't write run bite codes that you download for something in your browser directly but it wasn't that becomes possible uh it's then yeah it's safe it doesn't it can't access like it can't run arbitrary code on your machine then um but it's still fast and there's this package called ldb

14:17wasm uh that uh yeah makes it possible to run ldb inside your browser and so that's also um it's also super easy to run theb inside of inside of a web worker so that if the processing is a little bit slower uh the the the page stays interactive so we use ddb from python from node and within the browser so yeah

14:42this really speaks to the portability of

14:46dgdb so yeah that's the takeaways of this section uh dougb is really everything it promises to be it's super easy to use uh it's fast and it's feature Rich you don't need need to uh set up anything to to to read paret files or to read CSV files and yeah we also found it it to be super fast it's

15:07uh the data frame in C SQL feature it's way faster than uh running pandas codes of course and then the other takeaway is that you can also use theb as an application developer so for normal uh applications that are a bit simpler and don't deal with a lot of data of course you won't use theb but when when you're

15:28dealing with uh like larger 10 megabytes let's say amounts of table or data you can consider using ddb um inside your application because the the barrier to entry is super low and then there's a third section our integration with uh mck so I'm happy to share that pretty recently a few weeks ago we have built our own native

15:54integration with Mod into workspace and that looks a bit like this so instead of having to uh use Python to query Motu from from our notebook you can directly write the SQL uh inside the SQL cells with nice SQL lights syntax highlighting and for example here I don't know you see this yeah I'm uh making this query

16:16available as a table here and I'm quering in from it directly here and that's yeah enable some powerful workflows where you can have ddb um it can have uh well for processing your SQL and then also still use Python for those edge cases and that's it for me thank you for your [Music]

16:47attention

FAQS

How does DataCamp use DuckDB to power SQL notebooks?

DataCamp uses DuckDB in three ways inside their Workspace notebook tool. DataFrame/CSV SQL lets users write SQL directly against Pandas DataFrames and Parquet files without setting up a database. SQL on Google Sheets downloads sheet data, converts it to Apache Arrow format, and registers it with DuckDB for querying. DuckDB WASM runs in the browser via WebAssembly to handle interactive, client-side sorting and filtering of query results.

Can DuckDB run inside a web browser?

Yes. DuckDB compiles to WebAssembly (WASM) and runs directly inside a web browser. The duckdb-wasm package makes this possible, allowing client-side SQL execution without any server round trips. DataCamp uses this for interactive result tables where sorting, filtering, and pagination all happen locally in JavaScript. Because DuckDB runs in a web worker, the page stays responsive even during processing.

Why is DuckDB a good choice for embedding SQL capabilities in applications?

DuckDB has zero dependencies and requires no server setup. You just import it in Python, Node.js, or the browser. It natively reads Parquet files, CSVs, and DataFrames without additional configuration. The same SQL code works across Python, Node, and browser environments, making it very portable. For applications dealing with 10+ megabytes of tabular data, the barrier to entry is much lower than setting up a traditional database.

How does DuckDB integrate with Pandas DataFrames for SQL queries?

DuckDB can query Pandas DataFrames directly by referencing the variable name in SQL, for example SELECT * FROM my_df. It automatically detects Python local variables, checks if they are a recognized DataFrame type (Pandas, Polars, or Apache Arrow), and performs zero-copy reads since DuckDB runs in-process. This makes it easy to alternate between DataFrame operations and SQL queries with no performance overhead.