Talk from Mihai Bojin at the DuckDB meetup in Dublin on 23 January 2024! Provides overview of big data landscape, data warehouses, DuckDB features and how simplicity always wins!
Transcript
0:04what so out of out of curiosity how many how many people here are familiar with Dr TB I've heard about it okay so we have a few people that have that know it a little bit and then people that that are just Expo exposed to it uh for the first time that's that's great because my my talk at least
0:25from a from a very selfish point of view my talk is for uh let's say people that are uh learning about ddb for the first time um today I think we have endless
0:37data processing Technologies uh but we're all busy people and I'm here to tell you why I think ddb is worth our
0:48time really like why why should we invest our brain Cycles to to learn about it and use
0:58it um yeah sorry first a disclaimer so I'm here in my own personal Accord I I am not here on behalf of my employer Google or alphabet um just starting with that um so the Big Data landscape in the in the past in the past 10 years um I was looking at this and you see here in you know in 2012 we
1:23had quite a few Technologies but there's not that many of them and uh yeah fun fact here and Shameless plug I guess is tenen is actually the mongodb company I used to work for mongodb before before Google U so they were part of the I guess the the data land landscape even from back then um in 2016 the space got
1:45a bit more crowded you can still see the logos you can read the companies like there's a lot more of them but but it's still uh somewhat easy to grasp and then in 2023 you can no longer understand what's happening right like there's like just too many tools a a lot of tools somewhere out here we can spot a wild do
2:06TV um basically if I if you think of Big Data today we've got a staggering amount of tools I I did a very finger in the air estimation and I think we have about nine nine times more uh than we had in 2012 but I think that's probably a low estimation I think there's probably a lot more than that there's a lot of
2:26startups in the in the space as well um we've got data warehouses we've got data legs we've got leg houses we've got you know extract transform low we got elt then we have El honestly it all feels a little bit ity the the the explosion of of options
2:46is is both a blessing and a curse it's a blessing because we have a lot of uh tools to choose from so we can we can go and choose the the thing that seems to work best for our use case but also the the problem is that we we all have the same amount of hours in the day and the
3:04question arises or question arises what is a good investment of our time because we can't possibly learn everything there's just too much uh too much diversity in the space um at the same time I think SQL has established itself as the as a de facto language of choice for data processing like the T in in ETL it's a
3:27it's a universal language that many analysts big uh and it has vast support in many platforms especially if you think of the Big Data ones like snowflake or big query or many of the red shift like there's there's again too many too many tools but it is it is brings me to to do DB right which is a database that that uses the
3:51SQL qu language um so if databases held a popularity contest they would all be looking at the database engines ranking uh this is a I guess a good proxy for how database technology fares to to one another um it uses s several ranking
4:10factors like uh website mentions search engine traffic like it it basically looks at people's interest in database Technologies and I guess I'll I'll make a small uh small Segway here to to point you to the data generated image which I thought was very funny because it's almost correct right only it has little errors and my f my personal favorite
4:34being mongr me which I imagine is a combination of mongodb and FAS um but I thought it was funny so I I kept it um basically the the ranking looks like this you have a number of databases like my SQL obviously being a popular one pos like a popular open source databases mongodb being you know one of the top 10
4:56but then you have du DV and D DB is 86 but actually a year ago it was 153 so it's growing fast um and I think if we compare it to to snowflake which is obviously a very successful platform of choice for for data analysis um and maybe you could argue that it's not a a fair onetoone comparison because
5:19SLE is a platform and and ddb is a database engine uh but still it's they they operate in the same space um so I thought it would be relevant to to look at the tool and and compare them so you can see that the growth that D DB is experiencing in the same amount of time in about in about two years is similar
5:37to snowflakes early days now snowflake did taper off around that time and as it reached I guess it's its initial user base it it stayed somewhat flat for for the next year and a half two years whereas dou DB is continuing to grow although the curve seems to be flattening a bit uh but in my I guess in
5:55my in my read of this of this chart is that du DB is is is has reached enough
6:02people already like it's certainly it's certainly generating a lot of noise in the in the market and after a few more years it's it's it's very well placed to actually explode to become mainstream and become a deao tool in in a data Engineers uh tool set also notice that this the scale is logarithmic so every line is actually a 10x increase right
6:23snowflake increased 10x DB increased 10x and it's set to to reach another 10x it's probably like six at the moment um in any case I think that this makes ddb a very useful skill to have like certainly a useful skill to invest at this very point in time um I can show you some other nonsensical stats GitHub stars for
6:46example like they don't really show much after all anybody can go and start start a project but what I find interesting here is that if I compare it to pogress it reached the same amount of stars in about half the time and I mean obviously posess has been around for longer than 2012 it's just that this is this how much data they
7:05have here but it's growing fast that's that's really like the the key takeaway in my in my opinion also like it's it
7:15python library is seeing about 1.7 million downloads from on piie um so it's competing for attention span with many of the open source popular databases out there like the transactional ones at least so that brings me to Features right like what's what's so cool about it anyway so it's an inprocess database uh it runs anywhere it's as simple as doing
7:41a brew installed ACB or maybe Za or yam whatever you install it and you can run it and you can process data uh it can be run as a python Library so uh that also obviously means that you can run it in any sort of jupyter compartible notebook uh it it's it's super easy uh and of course because of that you can run it in
8:01CI and you can run it in any Cloud environment and so on um it has a universal like I guess it has the advantage of being universally compatible like it even runs in your browser in fact if you scan this QR code with your phone and you have an an a chromium based browser you can actually you'll end up in Shell DB and you can
8:20run queries on your phone uh which I think is pretty cool as far as technology goes um in terms of extension so it has a very very Flex extension mechanism uh they're dynamically loaded at at run time um it can read Json it can read parket it can read many other formats actually but it can also read
8:40directly from S3 which means you could for example point ddb at a CSV hosted in in S3 and just process it straight away you can treat a CSV file as a as a database with no further processing without needing to import it like it's it's very straightforward and it's it's very fast um the tldr is that you've got endless
9:03integration POS possibilities especially because you can bring your own extension like if you want you can develop your own extension and then load it in induct DB and integrate it with something else uh it's just a binary at the end of the day so it's very low complexity you don't have to deal with credentials you don't have to deal with ACLS you don't
9:22have to deal with firewalls uh it's is just a very simple
9:27tool and that has a major advantage also it has no dependencies so it has a small footprint and you don't end up in this scenario right where you install it and then it installs a million other things so it's it's efficient to to run as well uh it integrates with data frames um and
9:46obviously that means you can query and store results uh in Panda data frames so even if you run across some data set that maybe is not directly quable from uh from Doc DB you can actually use pandas as or data frames as the the glue between this and it's also stable and efficient so du DV in general works very hard to
10:11avoid out of memory exceptions um it off flows to disk as needed and it's also very fast so the the team has worked in the past few I guess in the past year to to improve its speed and it has has achieved some some pretty significant improvements in speed uh so in my opinion Simplicity always
10:33wins uh if you look at Big Data like it's traditionally processing processing data in a big data setup it has hidden complexity costs and sometimes they may not be necessary um if you look at uh like one machine like one machine will run right it's it's generally rare for your laptop to crash and even if it does
10:53the probability of a second laptop to to crash to to fail is lower but also you're a user and you will not if something is wrong it's not like you're running thousands of machines like like a like a big data distributed system would where failure is not only expected but it actually happens and if you have any sort of long running non- retriable
11:12pipelines which which tend to be affected by one failure towards the end of the pipeline you end up re rerunning the whole thing and in some cases obviously uh paying for it right because the cloud isn't free um do DB in this sense allows you to to iterate faster and and cheaper uh especially as you as you shift left and you you you start
11:35figuring out your data sets and like what to do with them before we productionize them now obviously like you know you will any any sort of production productionize pipeline will U will be run in the cloud right but the fact that you can process it and and iterate Lo locally on your laptop is is a big Advantage um and also you can save
11:57your company some money because uh you know you already have a powerful MacBook sitting on your desk and it can actually do a lot on its own um now in closing I'd like to show you three three syntactic features that I really like uh about du DB's SQL syntax so the first one is the first one is
12:18Group by all um traditionally when you write a query you aggregate by a number of fields let's say you do a you compute some sort of agregation right you have to like repeat the same fields in in the group group by Clause uh or in Du debut you can just say Group by all so that makes the query much much much easier to
12:35read but also much easier to edit because you don't have to usually usually your query is not going to look like this you're probably going to look at a page of query formatted like you're going to have to scroll up and down uh so just having group by at the end means you can anytime change the list of
12:49fields and and you don't have to repeat yourself right the the Y principle if you will now then it has the select exclude uh syntax so instead of of let's say selecting 20 fields and excluding one you can actually select star and exclude only the one right much easier to again to write and to read for someone to
13:09understand and finally if you have something like time stamps right like where you're trying to join um let's say almost equal time stamps but they're never quite equal because they always have you know millisecond differences you could try to bucket them in an SQL but it's not like it's possible but the syntax isn't nice uh or in ddb you can do an as of join
13:32right and you can specify the fields and then let ddb figure that out for you and also optimize it so that's pretty much what I had I hope at least I can spark your interest a little bit um if you'd like to learn more about ACB there is a a free book uh actually there are some QR codes uh
13:56there that you can take and then scan and then you you get the early access to the to the dob book there's a conference happening next week actually in Amsterdam Doon number four um there's a rep reposter I found which I which is pretty cool like it lists a lot of uh a lot of uh I guess Tools in in the space
14:14and Integrations the ddb has like it basically captures the ddb ecosystem and of course you can join one of our future events um thank you so much and if you have any questions
14:33n
Related Videos

2025-11-19
LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics
LLMs excel at natural language understanding but struggle with factual accuracy when aggregating business data. Ryan Boyd explores the architectural patterns needed to make LLMs work effectively alongside analytics databases.
AI, ML and LLMs
MotherDuck Features
SQL
Talk
Python
BI & Visualization
2025-11-05
In the Long Run, Everything is a Fad
Benn Stancil uses Olympics gymnastics scoring to argue data's quantification obsession is generational. We went from vibes to math and may return to AI-powered vibes. Will dashboards matter to the next generation?
Talk
BI & Visualization

2025-11-05
The Unbearable Bigness of Small Data
MotherDuck CEO Jordan Tigani shares why we built our data warehouse for small data first, not big data. Learn about designing for the bottom left quadrant, hypertenancy, and why scale doesn't define importance.
Talk
MotherDuck Features
Ecosystem
SQL
BI & Visualization
AI, ML and LLMs

