Taming file zoos: Data science with DuckDB database files

2025/06/02Featuring:

Problem statement

Data scientists working in Python often spend the majority of their time cleaning input data, frequently from files. These files have many formats, can be located anywhere, and sometimes have names like ‘data_final_final_v3.csv’. Data scientists often produce similar files! We call these “file zoos”.

Taming file zoos with DuckDB

DuckDB fits perfectly with Python

The MIT-licensed DuckDB database management system was designed to fit perfectly into data scientists’ workflows. Install DuckDB’s pre-compiled, dependency-free binary from pip. It can read and write dataframes (Pandas, Polars, and Apache Arrow) for interoperability. It also has an advanced persistent file format.

Read and write files with confidence

DuckDB can read and write to and from csv, parquet, json - even xlsx and Google Sheets. The csv reader in DuckDB is world-class, quickly querying even messy csvs. DuckDB interoperates with object stores across clouds and reads lakehouse formats like Delta and Iceberg.

Organize using the DuckDB format

Use DuckDB’s highly compressed columnar file format to persist many large tables all in the same file. Store processing logic in views and functions and even update just portions of the file. DuckDB serves as a catalog when files should remain in place.

Beyond the format itself, DuckDB provides ACID transactional safety and parallel processing, it can be read in 15+ languages, and is guaranteed to be readable for years to come. It unlocks larger-than-memory analyses to solve 2TB problems, not 16GB ones!

Extensions

Community extensions enable DuckDB to read additional formats and are provided through a pip-like package repository.

Takeaways

Attendees will learn how to install and use DuckDB locally, how to integrate it seamlessly in their existing Python scripts or Jupyter Notebooks, and how to smoothly manage the deluge of files in their workflow.

TABLE OF CONTENTS

Taming file zoos with DuckDB

Transcript

0:00Good afternoon. For our next session here, Alex Monahan will be talking about taming file zoos. Uh, Alex says he probably won't have time to take questions in here, but he is willing to take questions out in the hall afterwards. So, let's give Alex a big hand. Thank you. All right. Welcome everybody.

0:26Thanks so much for being here towards the end of the conference. really appreciate you. I'd love to talk to you today about teaming file zoos doing data science with ductb database files. First, I'll say howdy. Howdy. I'm Alex Monahan. My background is industrial and systems engineering from Virginia Tech and then I spent nine years at Intel breaking into the data

0:47world starting as an industrial engineer becoming a data analyst and then a data scientist. In 2020, I discovered DuckDB and it was a perfect fit for what I was working on at Intel. So, I tweeted about it a ton and I actually got recruited to do some documentation and blogging by DuctTb Labs. DuctTB Labs is the company

1:09that's founded by the creators of DuckDB, Hans Muleheisen and Mark Rossfeld. It's a group of 20 database researchers based in Amsterdam. And so, I continue to work there part-time.

1:21And then about two years ago, I decided to do all DuckDB all the time and I joined a company called Motherduct.

1:27Motherduck is building a cloud data warehouse with ductb at its core. And we think having ductb as our core means that we can be fundamentally more efficient and more performant as long as you don't have Google levels of scale.

1:40We think there's so many companies out there, so many people where uh we can do things more efficiently as long as you don't have pabytes or xytes. And part of our secret sauce is that we actually can use a combination of server side and local resources even in the same SQL query. So you can run one query and part

1:58of it will run on our server, part of it will run on your laptop. And the part that's on your laptop is free and with no network lag. So a lot of cool opportunities for advanced data v visualization, really fast interactivity. But I'm here today on behalf of DDB Labs talking all about the open source project DUTDB.

2:19So in data science, if all of our inputs were perfectly clean, all in one file, our lives would be easy. But we know that one file is easy, but files in the aggregate are really hard. You could have all different kinds of file formats. You got flat files of all different shapes. You've got binary files like parquet, table lakehouse

2:41formats like iceberg, delta lake, and then a huge long tale of other places that data can come from. spreadsheets, statistical files, and if you're in the geospatial industry, you've got quite a lot of files that could that could be in use. Not only that, some of these files can be huge or you can have thousands of them and you really need the ability to

3:01handle all of those edge cases to get the insights out of the data to help the business. Not only is there a huge variety of files, they can live everywhere. They're probably not all already on your laptop when you start the project. They're scattered everywhere. all the different clouds all over. And sometimes when you get a CSV

3:21from a stakeholder, you might say, "That's no CSV. That's an abomination." And you'd be right. Sometimes data is messy. And we need to be prepared for that and be able to handle it. And sometimes the problem is actually past me. Sometimes I am the one creating the file zoo because I'm iterating rapidly through my data science workflow. So, I could have

3:44final. I could have final final. I could have final final final V3. Any other final finals? Anybody raise your hand also if you've had at least a V4. Okay, that counts. All right. All right. Yep.

3:56We've all been there. We build this ourselves because this is an iterative process. And this is the sequel to that.

4:03Sometimes you have to fix just a portion of your analysis. And so now one of your files stays the same, but you actually should use this other file over here.

4:11It's gets to be tougher and tougher to keep track of as the project grows and grows. Has anyone ever crashed Python?

4:19Accidentally hit control C when you didn't want to. Maybe your laptop battery died. Well, if you were in the middle of writing out a file, good luck reading that file. That file is going to be corrupted. It's going to be broken halfway through and it's going to be difficult to recover any useful information out of it.

4:35And if you're lucky enough to be using parquet files, one does not simply edit a parquet file. You must rewrite the entire file or create another one and use both. So with parquet, how do you output just a tiny bit more data? Well, yeah, I'm going to need you to work the weekend. I mean, I'm going to need you

4:52to replace the entire file or just add another one and another one and another one. You can solve some of those problems with a flat file. Sure, just add another row. But what data type was that?

5:05Who knows? And I hope you didn't want it to go fast. It's not very fast to write out a fully flat text file, especially with large data sets that we're working with. You could use a table lakehouse format. Sure, they'd be happy to add another parquet file under the hood, but then you got a lot of metadata you've

5:21got to write in a lot of different places. You got to talk to your catalog.

5:24It's not necessarily a fast or easy process either. So let's talk a little bit about DUTDB and then we'll talk specifically about how it can help with these file zoos. So duct DB is an open-source analytical database as a library. The analogy we like to use is we like to call ourselves the SQLite for analytics. But let's unpack those that

5:47phrase a little bit. It's open source. It's MIT license. You can use it for anything commercial or otherwise. And it's an analytical database. So it's focused on large bulk transactions, the things you're doing in data science, processing a billion rows, processing terabytes of data, and it's built on cutting edge research from the database research community. But being cutting edge is not

6:10enough. You don't want to be the one being cut by the cutting edge. It has to be simple. It has to be robust. It's got to be production ready. And DuctTb went 1.0 almost a year ago. And we also take ease of use very, very seriously. That's why we're packaged as a library. You don't have to have a separate server

6:27running your ductb. You just import ductb at the top of your Python file and you use it like you would any dataf frame library or SQLite. And that means it's an inprocess database. Another word for that is an embedded database. So it lives inside the same Python process as your Python interpreter which gives it some extra superpowers we'll talk

6:46about. So to use ductb you pip install ductb. It's about a 20 megabyte file.

6:51It's pre-ompiled for almost every platform except for there's there's always one that's a little different, but pre-ompiled for for all the ones um that we know about and it has zero dependencies. So, it's very easy to add to your existing project. We won't cause any dependency conflicts. Then to use it, you just import duct DB. You can

7:10create a connection. You can create connect in inmemory mode if you'd like. But in this case, we're going to connect just to a path to a file. And if the file doesn't exist, we'll go ahead and create it just like with SQLite. And then that will be your persistent ductb file that you get to work with. And at

7:24that point then you can start analyzing data. You can run SQL directly pulling from CSV files u or use our relational API. So just a few lines to get up and running with DB. Not only can you use it in anypy script that you'd like, we also have a number of different ways you can work with ductb in notebooks. So this is a

7:46look at a a SQLon IDE. This comes bundled with the ductb command line tool that you can download separately. And this is the greatest way to author SQL that I've ever seen. It is so fast. We sample your data. When you go into this instant SQL mode, we sample your data and we can run the SQL statement on

8:06every keystroke. We can parse it. We can bind it. We can look for error messages.

8:11And we also will even allow you to inspect all of your CTE with a click of a button. So you can actually see all of the intermediate steps that produce your final output. This to me solves a huge problem with debugging SQL. So if you're a SQL fan, give it a try. It's free to use. Just check out the ductb

8:31CLI. But we're at Pyon. So if you want to mix and match SQL and Python, Jupiter is a great place to do that. You can download um another library called Jupy SQL and you can turn a cell into a SQL cell which just percent percent SQL and then you get nice syntax highlighting and you can have that return a data

8:49frame out of it and you can just alternate back and forth one cell Python one cell of SQL simil similarly there's a modern notebook on the market now MO it's open source and this actually has ductb as a first class citizen where there are ductb typed blocks uh in marry mo uh They're here around the conference. I

9:09encourage you to find them. Another superpower of ductb that really makes it helpful for data science is that we don't believe that you should have to wholesale switch to ductb. We want to interoperate with the ecosystem with the community. So this is an example of interoper with with pandas. You can create a pandas dataf frame and then you

9:29can say duct db.sql select the average of a from my df. And you might say Alex what's happening there? I didn't pass in that variable. How did it figure out where to read this data from? That's just a string. Well, what we actually do is the first thing we check is we check, is myf a table in your database? And if

9:48it's not, we don't give up. We check your Python local variables and we look for a variable named myf. And then we check the class. Is it a class that we know how to read? Is it pandas? Is it polars? Is it Apache arrow? If it's any of those, we can both read and write those formats.

10:06And it's zero copy because we are an embedded inprocess database. We already live right where they are. We don't have to copy it first. Don't have to send it over a socket. So this is very very fast. So it's truly a great design pattern actually to alternate between a dataf frame line of code SQL line of code dataf frame again. There's really

10:25no overhead. As you can see at the very end of this I can get things right back in a panis dataf frame withdf.

10:34So that's a little bit about duct DB. How can it help us as we're fighting through and taming these file zoos? It's not a coincidence that ductb can read all the files on this slide because I built this slide. But an important thing to note here is that this is not just developed by the core u ductb developers. Uh these

10:56also come from community extensions. So all of the statistical package readers came from a community extension. The Google Sheets reader and writer also from a community extension. So ductb is very extensible. So if there's a file format that you don't see up there that is important for you, there's a path. So once we can read these files,

11:15it's not enough to just be able to read it. We don't want this to be a fight or a struggle. We want this to be seamless.

11:21So ductb has SQL as we'll see, but we also have a full relational API. So it can feel much more like Python. So in this case, I import duct DB. I connect and then I can use read CSV. Looks a lot like the pandas function we were inspired by. And not only can you point to one file, you can point to a whole

11:40glob of files. So a whole directory of files. And we will look at those files.

11:44We will sniff them. We will look at the column names and the data types. We'll deduce them automatically and pull them in for you to interact with. Duct DB operates lazily. So we don't read in all those files all at once. So you can do things like just show me the top 100 rows and we'll only read 100 rows. Um or

12:03like in this last step here, we're going to go and create a duct DB table based on that data. But we're not going to read everything into memory and then write it out to duct DB. We're going to stream it straight from one file to the other so you can work with arbitrarily large files and convert them into the

12:17compressed DuctTb format so you can work with them over and over. Another alternative to the relational API is the friendliest SQL in the world.

12:27Um, and I think that that is sounds like a bold statement, but I truly believe it. You can see an example here at the beginning. Select star is optional in ductb. Why do we need select star? Means you want everything. Okay. Well, we'll just assume you want everything and if you want less, you can tell us. So we

12:44start with from and then instead of having a special copy statement to then you know exactly specify how you want to read these particular files you can just point us at a path and we'll treat it like a table and we'll deduce all the column names and all the column types and just read it. Once again this is lazy so you can

13:03just preview the files if you'd like or you can fully analyze them. And then you can see this last SQL statement is creating a table based on this and it's doing the same thing as before.

13:11Streaming the data from the CSVs straight into a duct DB format file. And these files live everywhere.

13:19DuctTb can pull from all the major clouds. You can pull from all the major clouds in the same query if you'd like mixing and matching as well as some of the other places like Min.io or Cloudflare R2. And that same friendly SQL also works in the cloud. So it's not any more complicated to work with data scattered

13:38everywhere that we have to deal with as data scientists. We can create a secret.

13:43This is an S3 type secret. And we just tell it to use our credential chain which means we already did single sign on with AWS a AWS SSO login. How many times have we typed that? And instantly we'll use those credentials to access AWS for you. And then the same SQL works. All you change is the path point

14:01to the S3 path. And once again, you can glob. So you can look at an entire bucket or the files all in one statement. I also want to make the case to you that duct DB is the number one way to read CSV files. And I want to show you a benchmark of it. And you might be thinking, oh, does that mean

14:18that uh it's the fastest? Uh I'm not going to try and make the case that we're the absolute fastest, but we care deeply about performance. And I'm not going to try and make the case to you that we're going to be the lowest memory. Although we are streaming off of disk and streaming back to disk. We are

14:32deliberately very wise with our memory management. I'm going to try and make the case to you that we're leading in the metric of most likely to actually read the file. And you might say to me, Alex, how hard could it be? It's a CSV file. It's got commas. It's got new lines. How hard could it be? And if you've seen this

14:51show, Top Gear, whenever Jeremy Clarkson says this, it means he's really in for it. Okay? It's never as easy as you think.

14:59So there's actually a benchmark for this, the Polloc benchmark. And they looked at 245,000 CSV files. What a job.

15:07245,000 CSV files from government entities across six continents. So this is not just CSV files uploaded by, you know, random people on the internet like me. This is from our governments of the world. And they found some messes. So this is one example. Some rows are just missing the last column. And it might be because these train stations, you know,

15:28had no delay. So they saved like three whole characters by not including a delay of zero. Like, thank you. You broke half the CSV readers out there.

15:36Really appreciate that. My favorite thing whenever I export things from my bank, they've got like 10 header rows that tell me nothing. Why is that? This is not helpful. I just want one header row and then I want my data. Not only that, maybe you're editing this file iteratively back and forth, different operating systems. you have different new line characters. What

15:56a frustrating tiny detail that can also still break things. Or maybe you just don't bother escaping quotes, so we have no idea where the end is to any of these strings. There is no escape. And some of these CSV files don't even use commas to separate the values. What does what does CSV even mean? What do words mean? What do

16:21acronyms mean? Do we believe in anything? But maybe maybe we'll allow it. Maybe we'll allow it as long as it's ducks. So this is a look at the benchmark results. Duct DB is actually the most likely to read your files. But this is not competing against formats you've never heard of. This is the creme de la creme of dealing with messy data.

16:44These are the tools that we reach for when we see these kinds of problems because they're great. You got SQLite.

16:49You got Libra Office. You have a mysteriously named spread desktop, an industryleading desktop spreadsheet. You have a mysteriously named spread web, a industryleading web-based spreadsheet that we shall not disclose to protect the nameless, the innocent, excuse me, or Python directly to the native CSV reader in Python or pandas. And we were inspired by the way that pandas parses CSVs, but we

17:12recognized that that is a huge thing people love pandas for. And we also wanted to really be investing in the things that are actually practically useful in data science, which is dealing with messy data. So now that we've read a variety of files, they live in a variety of clouds and we've read the dirtiest, grungiest CSVs that our governments can

17:34provide us. Now, how do we not create a file zoo ourselves? And we can do that by using the all-in-one ductb file format. So let's take a look at what does that format look like under the hood. So this is looking at one database file. I like to use the extension DB, but you can really use any extension

17:51you'd like. So get creative. And within that, you can store as many tables of data as you'd like.

17:58And in addition to just storing tables, we can also store a lot of metadata. We can store SQL views, SQL functions to actually store your processing logic on top of those tables. You can store primary and foreign key relationships.

18:09So you can define how things are actually meant to map together between those tables.

18:15But if we look specifically at tables like table one, you can see it's broken up into chunks. Chunks of rows. And so let's look deeper at row number one. So row group number one, this is about 120,000 rows in a row group. And what we do is we store things column by column within this row group. So ductb is a

18:33columnar file format. And what that allows us to do is compress the data really well. In a row-based format like SQLite, you might have a column that is a date next to an integer next to a string next to a float and those don't have a lot in common, which means they're really hard to compress. But with duct DB being a column format,

18:53integers all look pretty similar. Dates tend to all be very close together, so they compress very, very well. So you'll typically see 5 to 10x compression compared with a row oriented format. But not only that, we also don't want to just store data. We also want to store metadata. So you can see that we store the minimum and maximum of each um

19:13column in this row group. What that allows us to do is if we're running a query where I want to look at table one where column one is greater than 9,000.

19:22So over 9,000, I can skip this entire chunk because I already know everything here is 9,000 or lower. That allows us to only read off of disk what we actually need for the individual query that we have.

19:36There's a lot of other things to like beyond just the format itself. It's MIT licensed which means it's totally open source. You can see the full implementation. Uh and it's backwards compatible. So since ductb went 1.0 last year, we can actually read all those files with the latest ductb version even back before 1.0. And we're committed to

19:55maintaining that for several years. We think that's really really important if you want to store files for a long period of time.

20:02It's also a cross language format. You can read it in 15 different languages. Python, R, Java, JavaScript, Rust, Go, C++, C, really across the board. Makes it really useful to to pass to any of your colleagues. As we saw, it's columnar and it's compressed. It's compressed using techniques that are called lightweight compression, which allows it to be very fast to read and

20:24write as well for your CPU. And as we saw, it's a single file. You can store tables, many tables, and relationships between them. And as a part of that, we haven't really talked about this part yet, is that it is actually editable.

20:37So, because we want to be able to store multiple tables, we need to be able to write to this file multiple times. But not only can we append to this file, you can delete individual rows, you can update rows, you can add columns, you can really fully and completely modify this file, which makes it very, very flexible, especially if you're working

20:54locally. And duct DB is really designed. Part of the reason DuctTB exists is because Hanes and Mark went and met actual data scientists. You know, computer scientists went and talked to people.

21:10It's not always common. And the goal was to understand why they didn't love databases. Hey, we're database people.

21:18You're data science people. Don't you love databases? Aren't they so cool? And they looked at and said, well, no, not really. We don't really love databases.

21:24They add a lot of overhead. They're very painful. And so that's why we have um

21:30such an easy to install uh format. That's why we operate in process instead of a separate server. So we've we've tried to learn a lot from the data science community in the core design of duct DB. Uh we also want to bring some of that database research over into data science and some of these hard one kind

21:46of principles and properties of databases and one of those are some of those are the ACID principles.

21:52So ACSID is an acronym and it's all about just leveraging some of the hard one database wisdom of just error cases that are very tough. And so the first one A stands for atomicity and this means that change is all or nothing. So you can set up a group of changes where you want to insert into several

22:10different tables. Maybe you want to delete from a table and insert and you want to make sure that you don't do just one of those things and not the other.

22:18It's either all or nothing. That's really helpful if you want to do incremental file import where you don't want to have duplicates by accident. You want to delete an insert atomically.

22:28It's very very helpful. How else can it help in data science? Well, there's also the ability to undo changes. You can begin a transaction, you can insert your data, and then you can test it. And if your tests fail, you can roll it back and your data is if it was as if it was never inserted. So it's very powerful to

22:47have that kind of undo primitive all the way down to storage on disk. C is for consistency. This helps prevent data quality problems before they happen. And this is largely based around primary and foreign keys. Primary keys allow us to detect duplicates and let you know when things are being duplicated. And then foreign keys allow us to let you know when tables are

23:08missing data that should be there in other tables. So, if I go and insert a set of order data that happen to be ordered by certain customer IDs, I might want to check if I have that customer ID in my customers table or if maybe there's some kind of issue in my data.

23:22And this allows it to be an upfront error rather than silent data corruption where I just have missing data. Isolation is all about allowing multiple queries to run at the same time without interfering with one another. So that means that I can be inserting data with one query and reading data with another and I won't read that data

23:41that's being inserted until it's committed. So there's no cross talk between queries. DuctB has a full multi-verion concurrency control system.

23:50So as a data scientist, I can do things like be updating in the background and visualizing at the same time and I won't get partially updated data while I do that.

24:00Durability means that once you commit that data, once you finish that insert, once you commit that transaction, your data is safe. We're going to write it all the way to disk to where it's ones and zeros on some sort of storage medium and not in some operating system cache.

24:14And that means that we can prevent file corruption. And because we have this kind of snapshot commit process as well, if your battery dies midway through writing data, we have the ability to roll it back and we can uh allow that file to continue to be used without being corrupted. So in summary, DUTDB can really help bring

24:35order to the chaos that we that we dig through to add value as data scientists.

24:40We can read a huge variety of data formats. We can read them all across the clouds and we can tame some of these really wild real world CSVs that are out there. Not only that, once we're done, we can output all of our results all in one file. So that way it's easy to transfer to our stakeholders, easy for

25:00future mess. It's also built and designed for speed for the bulk transactions we do in data science because it's columnar and it's compressed.

25:14And we can still edit it. So we can work incrementally. We can add columns. We can add rows. Excuse me. And finally, we can depend on the hard one database wisdom of

25:33asset. And so with that, I'd like to say thank you very much for your time and your attention and personally welcome you to the flock.

25:40[Applause]

25:49All right. I'm happy to answer questions now if we have time. Otherwise, outside. Yeah, we act we actually have about four or five minutes. If there are any questions, we can answer take a few here and then more outside.

26:03I'm curious if you could briefly speak to some of the GIS features you mentioned kind of in the intro. H maybe come talk to me afterwards and I can at least find some resources for you. It's it's a deep field. It's an entire field.

26:16Um we have a spatial extension. We integrate with Gall and so we are able to read all the formats that it can and then we have some geospatial data types that allow you to do some uh efficient processing. But come find me. I am actually using Doug DB to a small degree because of a library that I

26:34use and love. Um, my biggest hesitancy is actually having to write SQL in quotes. You ever thought about creating like a lambda style syntax for this language, you know, like link lambda, something along that line? Probably too many too many years of looking at vulnerable PHP code.

26:53I can't stand SQL and quotes. Sure. Yeah, we do have the ability to do prepared statements. So, you can parameterize and the quotes will be escaped to prevent SQL injection.

27:04Um, we also have our relational API. We have an experimental pispark API as well. Um, and I think the SQL notebooks are also a really great way around that as well. You can have those SQL

27:17cells. Excuse me.

27:25All right. Other questions?

27:31Hey. Hello. I have one question. Can you use this tool to write back into a SQL database or just kind of analyze data pulling out of and second question I don't know if I remember if you show can we read Excel files or just CSVs?

27:47Great question. Yes, we can actually read and write to multiple different SQL databases. Excuse me. Um, we can read and write to my SQL, Postgress and SQLite. We can also read from BigQuery.

28:02Um, we can also read Apache Arrow which means that you can pull from Snowflake with ADBC and you can read that very uh nicely with DuctTb as well. Um, we can also read and write Excel files. Uh, so we do have a full um, Excel reader and writer. Great questions. And I didn't plant him, but I didn't mean to talk about that on our

28:23slide. So, thank you. All right. Well, thank you all so much. I'll be out in the hallway.

28:29Cheers. [Applause]

FAQS

Why is DuckDB the best tool for reading messy CSV files?

In the Polloc benchmark testing 245,000 CSV files from government entities across six continents, DuckDB was the most likely to successfully read the files, outperforming SQLite, LibreOffice, Pandas, Python's native CSV reader, and leading spreadsheet applications. DuckDB handles common real-world issues like missing columns, multiple header rows, mixed newline characters, unescaped quotes, and non-comma delimiters. Learn more about taming CSVs with DuckDB.

What are the advantages of the DuckDB database file format for data science?

The DuckDB file format stores multiple tables, views, SQL functions, and primary/foreign key relationships in a single file. It uses columnar storage with lightweight compression, achieving 5-10x compression compared to row-oriented formats. Unlike Parquet, DuckDB files are fully editable: you can insert, update, delete rows, and add columns. The format also stores min/max metadata per row group, so DuckDB can automatically skip irrelevant data during queries.

How does DuckDB provide ACID guarantees for data science workflows?

DuckDB has full ACID transactions that protect against common data science problems. Atomicity means changes are all-or-nothing, preventing duplicate data during incremental imports. Consistency uses primary and foreign keys to detect duplicates and missing relationships. Isolation via multi-version concurrency control lets you update data in the background while visualizing at the same time. Durability writes committed data to disk, preventing file corruption if your laptop battery dies mid-write.

How do you read files from S3 and multiple cloud providers with DuckDB?

DuckDB can read files from all major cloud providers (AWS S3, GCS, Azure, Cloudflare R2, MinIO) using the same SQL syntax. You create a secret with your credentials, for example using your existing AWS SSO login, and then reference S3 paths directly in queries as if they were local tables. You can glob entire directories, mix files from different clouds in the same query, and DuckDB reads lazily so you can preview large datasets without loading everything into memory.

Taming file zoos: Data science with DuckDB database files

Problem statement

Taming file zoos with DuckDB

DuckDB fits perfectly with Python

Read and write files with confidence

Organize using the DuckDB format

Extensions

Takeaways

Transcript

FAQS

Why is DuckDB the best tool for reading messy CSV files?

What are the advantages of the DuckDB database file format for data science?

How does DuckDB provide ACID guarantees for data science workflows?

How do you read files from S3 and multiple cloud providers with DuckDB?

Related Videos

Data-based: Going Beyond the Dataframe

LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics

The Unbearable Bigness of Small Data