0:04Thanks everyone for coming. It's great to be here. My name is Gabor Sarn and I'm a developer relations advocate at Duck DB Labs. And tonight I will talk about Doug DB and some short examples with Doug DB and DBT. Briefly about my background. I spent a decade in academia. I studied here at the technical university of Budapest. Then I
0:24moved to Amsterdam in a research institute and that research institute happened to be where Dark DB was created. So that got me close to Dark DB. During research I worked on benchmarks for graph database systems. I designed benchmarks that measure the performance of analytical graph databases and now I'm sort of the chronicler of the Doug DB project. I
0:45edit the technical documentation of Doug DB and I contribute to and edit the blog of DUD DB. So what is duck DB? In a single sentence, duck DB is a universal data wrangling tool. It speaks SQL. You can do analytical queries on your tabular data and it runs in process. How many of you have heard about DUD DB
1:06before? Okay, so it's a it's a great audience. It's 80%. So I chopped all the
1:13slides that are the high level overview of DuckDB and I want to talk about more the deep dive and in-depth questions of Duck DB. First of all, why did we choose an improcess architecture? The improcess architecture stems from the observation that the traditional client server divide which works really well has a few inherent drawbacks. One is that someone
1:35has to operate the server. So that already incurs time and money costs. Second, someone has to set up the server and the connection which is cumbersome.
1:44You need to authenticate. You need to copy tokens around. That's just not nice. And then the third is that these arrow in between actually form a huge bottleneck. If you want to fetch a table with a billion rows and the computer is sitting in AWS EC2, you have to download it. It's going to be expensive. It's going to be slow. So duct DB takes the
2:02database and brings it into the client application. And when it does so, it's very easy to connect to. You just say import duck db and you have your duck db instance. And most people just run duck db in an in-memory setup. So it's very comfortable and very easy to interact with. This is what I call the truly
2:22serverless architecture because you know in snowflake even if you choose serverless there is always a server.
2:28Here duck DB is just sitting on your laptop. People think Doug DB is an in-memory database and while it can work in memory it's primarily a disk based database. In fact it has a format where we use a single file database. So all your data is in one file. Can you hazard a guess where we got this idea
2:48from? This is SQLite. Correct. SQLite with a trillion deployments. The most popular database project or maybe the most popular software project was the inspiration of Duck DB. And the way we like to think about where Duck DB belongs in the area of database systems is that Doug DB has the form factor of SQLite. So it's in process and it has a
3:10single file format, but it has the workload style of snowflake. So it's geared toward analytical queries. Let's move on to the internals mainly storage execution and the indexing of duct DB. So Doug DB has a column based storage which means that unlike posgress, my SQL and SQLite which do row based storage where every row is stored physically together on disk. Doug
3:35DB uses a column storage. This has a bunch of advantages. For example, we have this railway data set where we have some dates, some train stations, some of the delays that the trains have accumulated and we have organized this in a calendar layout. And what you can see on the slide is that this data like quite few data sets is actually sorted
3:57along the date column. So it's ordered along the date column. Meaning it's very easy to compress it. We can use something like a delta encoding and just turn it into a single integer with a starting value. Similarly, sometimes the delay is zero because we had a lucky day and the trains were on time and then we can just compress away the entire column
4:19in a single constant value. So this can save a lot in disk storage. Execution wise, there is also a big difference. The two main ideas that would come to mind is to do tpple at a time execution. You can read the first row, do your processing, read the second row, the third row, and the fourth. But this is difficult to parallelize. It's
4:40prone to cache misses and it's usually not how modern CPUs like to execute code. Column time operations, on the other hand, are great. You can just put all your dates in, and then you can aggregate on them and then you can do the same on the delays. But this is prone to running out of memory. This is
4:58the reason why pandas runs out of memory. Pandas has a column at a time execution and if you have too much data it will just simply crash because it runs out of memory. So vectorzed execution is kind of an in between approach. We try to reap the benefits of columner execution by taking chunks of the data. These chunks are the vectors
5:19in Doug DB. This means 248 entities are in every given vector and then we iterate along those vectors.
5:28And these vectors are actually quite great because these allow us to chunk the data along these vectors and then distribute them along the row groups of the data to the different CPU threads.
5:40And modern CPUs have quite big L1 caches, meaning that with the rightly chosen vector sizes, all the vectors for a given operation for a given chunk fit into the L1 cache of the CPU. And of course, the L1 cache is much faster than the L2 cache. the L3 cache and the main memory. So 32 to 128 kilobytes is
6:03actually quite a large amount of memory if the data has this compressed column door layout with vectorzed processing.
6:11And one of the things that people like to ask is, you know, why didn't Duck DB exist 15 years before? And one of the main answers is is that you need a modern CPU compiler to auto vectorize your code. We make sure that our code uses tight loops. We use branchless code so that the compiler can do its magic
6:28and turn it into SIMD single instruction multiple data instructions. This is something that is only available in modern compilers roughly 10 years ago.
6:3915 years ago you would have to do it in basically writing SIMD intrinsics which is not much easier than programming in assembly. If you run your workloads on a laptop, I found that you can usually read a CSV file from the laptop's disk into Doug DB's native database format at more than 1 GBtes per second. In one
7:01hour, you can complete something like the TPCH benchmarks queries. This is a well-known and rather rigorous benchmark that's designed to torture databases and is designed for analytical workloads.
7:13And this can run on a modern MacBook on the scare factor 3000 data set which is three terabytes of CSV files. This is something that 10 15 years ago was the territory of big iron workstations that you bought for hundreds of thousands of dollars and now it works on a MacBook.
7:31It also works in other places. Doug DB is really proud of its portability. DDB has a bunch of clients for languages like Python, R, Command line, npm, Java, and so on. And Doug DB also runs on Macintosh, Linux, and Windows. DuckDB is so portable that you can compile it to web assembly, and then it runs in your
7:51browser, and it also runs in the browser on the phone. You can just type shell.duckdb.org in your phone's browser, and you can run SQL queries as shown in the example. A bunch of neat features about DUTDB. I promise that I will talk about indexing. And kind of the counterintuitive thing about indexing is that people learn in uh
8:13graduate school that you should put indexes in your data if you want to make queries fast. This is very true for transactional systems if you know the queries in advance. But in analytical systems, it's usually not a good idea to overindex your data. It will cause the loading phase to be very slow. It will cause the updates to be slow. And
8:32sometimes it doesn't even help the queries. What does help the queries is to have some sort of lightweight indexing to help pruning. And this is quite similar to the micropartitioning uh figure that Tibbor has shown. Here we have a date column, an ID, a station, and the delay column. And for every single one of those columns, for every
8:49single row group, we store the minimum value and the maximum value. And why is this great? This is great because if you say I want something about the trains that ran during the summer, we can just look at the minimum and the maximum values, conclude that this is only row group two. Only row group two on the
9:08right has trains in the summer and then we can further prune along the column storage and that means we can basically just read 1/8 of the data compared to reading the two row groups. DB supports quite a few formats and protocols. We do CSV, parket and JSON which are the standard textual and binary formats. And we do delta and
9:31iceberg which are the lakehouse formats that are getting increasingly popular. We can read data through HTTP, HTTPS, S3 and from the Azure blob storage. And we can also connect directly into databases and run analytical queries using Doug DB's engine but reading from the storage of these systems. And finally we integrate quite tightly with the data science world. You can run duct DB
9:57queries on pandas numpy or duck DB with
10:02dlier. And the nice thing about all these integrations is that duct DB is kind of the fabric in the middle that weaves these together. And what this means is that these integrations usually work in synergy. For example, you can read something like a remote S3 or HTTPS store parket file. And the same tricks that I showed about Doug DB's native
10:23format, meaning that it can prune along the columner layout and the minmax indexes. This works well even if you query a parket file on a remote endpoint. And this is thanks to some quite obscure things in the HTTP standard. That means you can formulate your HTTP requests in a way where you have range request just seeking into the
10:43file and reading given bytes of the HTTP endpoint. Duck DB also has something called friendly SQL. So duct DB speaks a dialect based on the posgress SQL and that's actually something that helped us a lot with integrating with other tools.
10:58But of course posgress is not perfect. And if you have something like this, which is a matrix of the train station differences in the Netherlands, and you would try to turn it into something from a white table to a long table, this which is very popular and common in data science operations, you would need to write something like unpivot on and then
11:18list all the train stations 400 or so. Doug DB has something called on columns which the columns asterisk expression is substituted to the list of columns in the table and then you can exclude given columns of it. So this is a really neat and short way of expressing it. Dark DB also has something called create or replace which is incredib increasingly
11:41common in modern SQL dialects but not supported by systems like posgress yet. Create or replace is neat because you can write these item potent scripts with it. You can just keep running your script and it will always replace the tables. If it crashes halfway, you don't have to throw away the temp tables. Just write over them in the next run. And
12:01something that we announced yesterday is the Doug DB local UI. So most people so far have interacted with Doug DB using either the command line client or Python or a Python notebook. But now you can just launch duck db with the duct db minus ui flag and it will run this internet browser based neat user interface where you have notebooks, you
12:23have a table analyzer and you have your u SQL editor in your browser without doing cumbersome magic in the command line. This is of course dbt meetup. So let's talk about duck db's dbt integration. This was a use case that we like to think of as a pipeline. The three major Doug DB use cases is interactive analysis for data science.
12:47Creative architecture where people run Doug DB on Raspberry Pies, on phones, smartwatches and smart TVs. But DBT falls in the middle with the pipeline components where it's not an interactive setup. It's just duck DB that spun off quickly that it does something usually in inmemory mode and then it just shuts down. So, DBT DuckDB was originally developed by Josh Le in Silicon Valley
13:14and he used it for a pet project and ended up using it for some use cases too. Uh, Doug DB fits DBT really well because it's lightweight, it can be started quickly and it is rather fast.
13:27And when I tried to come up with an example of how I would use duct db with dbt, I was reminded of this use case of mine where we had large graph data sets and I inherited a code base which generated them with Hadoop. But of course it was rather difficult to work with and very slow. So we migrated it to
13:48Spark and it was still rather difficult to work with and very expensive to run.
13:54We spent tens of thousands of euros running spark jobs. So ultimately we said okay this is obviously no way to live. Users will not go through the hoops of getting all these AWS quotota increases and running the spark jobs of AWS. So we said let's just pre-generate the data. But what then happened is that we had this combinatorial problem of
14:16let's generate all sorts of different layouts with all sorts of different date formats. And we ended up distributing eight somewhat similar but different layouts for the data. And this grew to a size where it was tens of terabytes. And it was so expensive that I had to talk to research institute to store it on tape. So this is somewhere
14:37in Amsterdam. And if you want to download it, you have to fetch it from tape to disk and then from disk which is obviously not nice. So I have been wanting to reduce the size of this for some time. And one thing that is something that we should have done in retrospect is obviously move from CSV and compressed CSV files to a nicer
15:01compressed binary format like parquet. So I did some experiments and it turns out that CSV is very easy to beat of course with parquet but if you do compressed CSV that's around 3 megabytes that's quite difficult to reach. So for that what I ended up doing was I used duck DB and I changed to our new park
15:20version which has been supported as of the latest duct DB release and I turned on the ZSTD compression which then managed to be under 3 megabytes. So this is overall 4x save and then I created a simple dbt file and I created an input model that just reads the parket and then every single layout is just one SQL
15:43query which takes the input and then creates an external materialized table and writes it into the CSV. And this is basically the whole DBT model. You can run dbt build on it. And this is something that takes less than a second.
15:58And this is where the lightweight dependencyfree nature of duct DB really comes at play because obviously with these small data sets the processing time would be fast both in duct DB and in other systems like Spark but for Spark you have this big overhead of setting up the system starting waiting for it and so on. Doug DB does
16:20everything combined with less than half a second. And the neat thing about the duck db UI which I found out during debugging is that you can just find the dev.db file that dbt duckd leaves connect to it with duck db minus ui and then you can peek into the import and the export tables and just troubleshoot your queries which
16:42is again much nicer than working with spark and something that is on a remote endpoint with a big delay. So what's the business model behind duckd? This is something that people usually are curious about. Doug DB is developed by a company called Doug DB Labs. This is where I work. It's based in Amsterdam and we are a small team of roughly 15
17:03people. And the interesting thing about our business model is that we are not funded by venture capital. We are plainly funded by revenue and that revenue comes from other companies which we provide consulting and support services for and many of those companies have venture capital. So we work with motherdoc, work databicks and so on.
17:25And if you follow the database world in the last five years in particular, it has been very common for companies to do a license change. MongoDB changed licenses. So did Radius, Elastic, Neoforj and so on. And this is something that users are quite concerned with. So to alleviate this concern, we actually moved all the intellectual property of
17:45Duck DB to a Netherlands-based foundation. This is called the Duck DB Foundation and this owns the IP of the project. So we cannot rely it because it's not part of the duck db labs organization it's in the foundation even if labs gets acquired by a big tech player it stays there and it stays MIT licensed forever the foundation is
18:07funded by donations from contributors companies like mother duck posit who develop our studio and positron and voltron data who are very active in the arrow ecosystem and speaking of motherduck mother duck is our close partner company which is a venture capital funded American company based in Seattle. Motherduck does this weird trick where we said okay let's not do
18:32the client server architecture because it has all these inherent problems and duck DB is this simple embedded system.
18:39Motherduck takes that system and says well actually those things are rather useful if you want to collaborate with people if you want to work in the cloud if you want to share your data and it builds back many of those components in the cloud. So other black is a duck DB based cloud data warehouse with quite a
18:54few cool ideas like hybrid query execution where one part of the query runs locally maybe having access to your sensitive personal health data and the other part of the query runs in the cloud and has access to some open data set. So to sum up I like to think about Doug DB as a kind of good old school
19:14system. DDB would feel at home in a research institute old Santos machine 20 years ago because it is open source. It runs from the command line. It is built on open standards. And if you open the Doug DB website, we go to great length to make sure you have a smooth experience. We don't have cookies. We don't have telemetry. You can download
19:35our documentation and it works on offline. And if you think about this spectrum of data processing systems, we like to say that DDB is positioned around the middle. Obviously for small data sets, Excel and Pandas are great.
19:49And for large data sets, Spark is also great. But with Doug DB being increasingly more popular and having more and more integrations, you can now read Excel files with Doug DB. You can read pandas with Doug DB. And with DD being more and more scalable, I think DDB is pushing what's doable on a single machine and leaves Spark only for the 10
20:11terabyte plus big jobs. And Spark and the Duck DB integrates really well with DBT. Thank you. And if you're interested, follow our socials and I'm happy to answer any [Applause]
20:27questions. Q&A questions. Question one, DD runs on CPU. any plans on supporting GPU based calculation. So this comes up quite frequently because you have pretty amazing GPUs in computers mostly thanks to the AI boom but this would hinder the portability of duct DB. So we have thought about this and there are currently no plans of doing GPU
20:52calculations um because of the portability issue and also because GPUs are only really good at fixedsiz data. As soon as you eat something like strings, it starts to become more and more cumbersome to program them and then the benefits kind of start vanishing. Second question, full delta support read write Unity catalog when when someone steps up and funds it.
21:16That's uh part of the road map, but our road map has a big part that says open for funding and this is full delta support is one of them. Your new local UI is awesome. Do you plan to open source it? will add the SQL formatter for lazy people. Um, so the UI is shipped as a duck DB
21:38extension and that's open source. The component that runs in the browser, the JavaScript that's not open source obviously you get the source but it's an offiscusated piece of JavaScript source.
21:51I talked to the developers yesterday and we are considering open sourcing it. It's going to be a bit difficult to do that but it is under consideration. Will we add the SQL formatter? I have seen a SQL formatter a couple of days ago. I think it's in the awesome Doug DB repository. Uh and someone also developed a formatter as a Doug DB
22:14community extension where I just write a query and then it prints you out the nicely formatted one but this is not part of the UI yet.
22:22And the last question is I read some interesting news lately about distributed duct DB small pond. Uh do you think it will get traction? So this got a big attention because it was done by the Chinese company DeepS who got famous because of their R1 model.
22:39Um it's a bit early to tell whether it will get traction. The thing about it is that they built this stack with their own distributed file system and with their own distributed orchestrator for DG DB that uses Ray internally. Um, and
22:57they got quite a bit of attention. What I read from the comments is that people are a bit skeptical about this becoming an industry standard. So even if their distributed file system for example has some inherent benefits, you wouldn't really move over compared to something like AWS S3 because S3 is the industry standard and this is a shiny new idea.
23:18So I think this is something that um hackers will experiment with. It's not necessarily something that you should start using in production just yet. But this is very interesting and if you can program it meticulously and if you can make sure that small pond wants to exe can execute your queries this will alleviate a lot of the scalability
23:40issues that people say if you have a pabyte of data then you shouldn't use duct db alone maybe with small pond you can actually make sure that your queries finish all right any more questions all right seeing None. I hand back the microphone back to Alan.
23:58Thanks, [Applause]