Why CSVs Still Matter: The Indispensable File Format

2024/05/07Featuring:

In this discussion with Mehdi Ouazza and Pedro Holanda from DuckDB, they talk about all things CSVs and dive into a pragmatic example of how the CSV parser works in DuckDB. Enjoy!

Transcript

0:00why are CSV still around even if we have more efficient file format like bar and where CSV is still playing a cral role in our data stack well that's what we are going to answer today with Pedro from dougb labs and we'll even show you a pragmatic example and you can see that you can basically throw anything at it

0:19and it will parse it nicely for you but should you throw a such a bad CSV in the first place hey guys how is everyone doing pleasure to be here I'm very thankful uh for Met's invitation uh unfortunately I love to talk about CSV files and parsing so unfortunately yeah I I bet you you've

0:43been rocking at at parties right when you talked about that yeah man what can I say you know people just starting showing their weird stuff to me you know it's a the files I'm headl but but you know people actually got super passionate because sometimes I start showing uh kind of what I believe are weird files and then someone the room once I was in

1:05this company and I saw this this and that like even more weird I guess can you uh please uh give us a bit a few line intro about you and your background so you are at the DB Labs maybe I should give some context because we get often the confusion I get it do you get it about

1:24mod deck and D DB absolutely uh it's very common that people think it's the same company yes so dug DV Labs is actually a different company uh from Mod duck mod duck is the US based uh company which ship dug DB in the cloud and uh Doug DB Labs well ped is going to tell us all about it and what is what is

1:45you're doing at DB Labs um yeah absolutely so uh actually dub itself it started kind of this uh project at the research center called uh CWI be SC MAA and stands for like a center of mathematics and computer science uh it actually started by one of my best friends Mark rosals and we were PC students together uh so I did some

2:13things in and out at the time uh but then after a few years uh Mark and his supervisor hes realized that KS actually create a company on starting to have some commercial interest people were uh using uh uh this product that was still some some somewhat new and very academic so they started uh theb Labs basically to provide more official commercial

2:38support to the de Tob sof um and I was

2:43still around should the point so I basically join the the ls from the getg go so from day one I think there were like already like four or five people well so yeah so yeah really from from the start right uh and uh working with Mark which is acting you know now more as a CTO right yeah Mark

3:06is absolutely a CTO the person that approves anything that gets in uh I can

3:13every then pressure him a little bit to merge myself stuff I know if I want it but no but no like he's a guy is the C through of the company today uh basically that means it si like technology with thoughts uh a lot of the things we do and Implement thought yeah and so how when actually the story of

3:32CSV came with uh with dug DB was it from from the start so this is actually kind of a funny question to me specifically uh because theb was maybe three weeks in and I kind of joke like three weeks uh from the start but three weeks of Mark programming that basically means like three years in a real company so

3:53although very much in an early stage I could already to step like aggregations and groupings you know projections uh it had support already like for SQL to a certain degree had like all these types of logical PL discal PL optimization what else and then at that point I was like oh this is uh starting to be a

4:13pretty cool project uh I'm going to contribute of something but I also want to contribute of something that's fairly easy so I decided to go with a csb reader uh and not only because I wanted to do something easy but also because at that point if you think about that canot WR it um aggregations and groupings you

4:32can already do a few tpch queries so tpch is like this famous important Benchmark or data analysis and because we were researchers we're like okay we kind of really want to know if the these operators are going the right direction of performance um and at the time basically the generator of the data would produce tsv files tsv ACS because

4:53they are tab separated um and then we just needed a way ingesting that so I actually CED the very first copy from and copy to of theb uh and basically the time I just you know like the if I think about the acttion implementation maybe it took me three days most of it was like building The Logical plant physical PL

5:15Transformer kind of stuff the exv parsing took me 20 minutes so it of course sucked like very breed could only really handle like super well constructed SV files with no like uh new lines in quotes quoted new liines um so yeah

5:35so for for our purpose which was loading to data was perfect so actually my implementation stayed around for like a year and a bit yeah now so so that's a that's

5:48impressive like how you know small critical feature can stay in a product you know so so I mean Bo trap it quickly and stay over there you know for pretty long time right because absolutely that that was the the the entry point from from people when like how because uh I

6:10was not there like what what time frame are we are we talking here we're talking 2018 right so uh at that point uh

6:18basically we were the only ones using it it was also like kind of a freefor all we didn't with that instructor was as academic as you can imagine just like pushing stuff to Main and hoping that everything go correctly so this is like pre any kind of release that there was nothing there yet uh and because the

6:36only purpose was to load this specific type of data uh and like I basically didn't care for performers I literally wrote like the simplest yes we really could imagine uh and that was fine because basically it was used in RCI maybe some benchmarking and what we were actually interested in benchmarking was not older uh and only when it actually I

6:57think was either the first or the second release somewhere there there's like okay maybe we need to rewrite it to be a bit more careful and um have have a bit more focus on performance have a bit more focus on what are we actually bringing from the file to memory you know like this kind of a smaller

7:15details um so that's why like I say it took a year but it's also a year that we are basically the only ones kind of using it yeah yeah yeah okay so you mention uh unit tests right CI um and I

7:29think this is I I'm curious what uh what the audience where the audience is like people are still using CSV maybe we can dve on that but this one this is one use case I still do see a lot like typically

7:44small small un need testing and for for CI uh do you do you still use uh CSV within the ddb repo for that um so in a way I'm not sure if you ever or the audience Maybe had seen the way we test uh the ACV we basically have what's called SQL test you have a query and you'll validate that the result of

8:06the query is correct their result is actually a s Yeah so basically it's used kind of uh all the time there and you can think like oh I mean these guys are da engineers and they're using CSV why are they not like doing parket result checking or whatever well the the reason for that because if you look at these

8:24test files you'll see that all these results are handwritten and handwriting spe fire was like super easy but how do you hand WR a fork fire you don't right basically a system generates that it becomes like a FL you cannot interpret that with a text editor or whatever uh and basically another system can read this but as a human you're you can't

8:46really so much yeah and that's very true and I've been doing that and I know a lot of data Engineers still doing this for that specific reason especially if you write in it test you want a quick feedback loop right so even if you have a tool for writing to par whatsoever now you want to be able to you know WR a

9:05specific case and then you know assert the case quickly as you go to code so this is still this is still a common like standard or people at least in the python word I know you know they're more going into the data frame approach so they basically you know coding the the data structure over there without yeah

9:28using the the the CSV but it has it has Pro and cons and I think it's it's depending on your use case because some sometimes you can have side effect with arrow changing you know certain type where if you want specifically to try to to test something against uh you know sewers data on CSV then yeah it's it's

9:50there is nothing better than handwriting you you mentioned um you know the Benchmark uh TPC uh H can we

9:59uh quickly talk a bit uh about this because I'm familiar with bit but just in term of research there is different kind of data set right uh in TPC so can you elaborate a bit on that like where why do you pick specifically uh you know the TPC H and uh yeah and a bit more maybe I can

10:23share there is an official uh website which is uh

10:30uh just DPC uh.org right where you have basically all the information and I believe the not load of the data set but can you tell us a bit more about the different kind of data set we we can have here absolutely so I I think for data analysis specifically there's two benchmarks there are very much used in research uh

10:52tpch and TB CDs uh so basically there's a committee uh that defines exactly uh the queries of these benchmarks and how the data looks like and all sorts of distributions of the data right uh and the reason that tpcs is very popular uh for example is that you have a limited number of queries I think you have 21

11:13queries uh but the scores although Limited in quantity they are very diverse in terms of The Operators and the things they test right so for example we have qu uh the qu six oras

11:29basically you only have aggregations so you're you're doing like sums and average and whatnot of one table so this is basically you like okay how fast your system add is on that query you will Define how fast you can do aggregations but you have query one that's now aggregations with groupings so you have now this one extra operator that you

11:48also need to check and then you have all sorts of qu that will test like join order that will test correlated subqueries so basically all these operations are important for analytics and then TPC CDs is kind of like TPC on steroids uh so you have like I don't know maybe over 100 queries uh and then you get more information about it um and

12:12of course these are benchmarks right like the thing that make them special to Academia is because there was a whole committee uh defining it and that's why you don't really see that many papers on the internet design bmarks or company design benchmarks they don't usually don't get really popular in Academia because it's mostly like one company that thought

12:31that was interesting but that was never Ved by a group of people yeah that's a that's a good point that it's actually um a way for people to do Benchmark

12:44against their tool and competition right so we've seen all all the place where we

12:50compare apples to apples but what's what's the down side of those of those Benchmark do you have a do you have an opinion on that um yeah the the reality is that especially when it's doing system comparison uh that that's super difficult uh systems are optimized in different ways performing dis bench marks is extremely not easy and that's

13:13also why uh the way this scope like TPC AG want to be actually properly Ved you have to pay them a lot of money uh to get like their stamp saying these are your actual Benchmark now because it's super complex so I mean anyone can run it every part whatever they want right uh but to get the actual bettered

13:33results is quite complex can you tell us a bit like you you mentioned when you started the first uh you know CSV uh load for ddb what is like in a nutshell

13:47the storyline you know from 2018 when you started uh to now how does the things looks different absolutely so 2018 I did this first implementation uh in 2019 uh so so I I basically touched this and I did not touch CSP files until like last year right like csb reader I was kind of doing some other stuff but in

14:102019 Mark I reimplemented parts of it so

14:15he basically added um buffers to the

14:19file so now you read you read the file into actually buffers in memory and that gives you a lot of control of uh which pieces of data are actually in memory and he implemented bunch of different optimizations uh to make the parsing more efficient I think I got like 5x faster or something like this than my original night

14:38implementation then I think like in 2020 uh too uh that used to be a master student s now it's a developer modu uh

14:48actually implemented uh the csvs Neer uh

14:53and the CSV Z I think is super interesting and I think that makes CSV files much more read and the whole process more easily because uh it's a piece of code that's able to detect uh column names column types to detect if certain roles are dirty uh to detect the CSV options right the limitter quotes what is escape so the user doesn't have

15:16to do that manually uh and I kind of joke that t did that kind of in the wood stook era of dub because we start like getting all these CSV files that don't conform to any Norms like of crazy stuff and then it's like okay you need to Sol the wood Woodstock error so you kind of need to go in this direction um and it's

15:38also a joke that's the CSV option explosion error because I think we used to have like made five options for CSV reading so basic like the LI quote scape has ADD and the column types are actually inferred by the the table created already and then I think like from these five options you ended up like 30 options right so if you open

16:00like the dogs of dub today you see like there's basically like a huge list um since I'm currently mon I'm a bit more annoying into adding new options so every now I joke that I get like a bu to request them off and I'm like I don't know if we really need this option but that's that that's a that's a

16:20good point because I feel like as a data engineer I sometimes uh and say say uh

16:27say hi in the chat if you feel that uh in a way where you build a lot of rules to either prevent basically be able to parse bad data sets yeah but at the very end the best way to fix it is to fix it Upstream right to have nice data set so of course when you build a database you

16:48need to have you know such a feature but within a company you rather go talk to the producer if you can of course that's not always possible Right but to say hey can you actually do something better than that there is two things to scare data engineering by the way is FTP server and csb and those two combine

17:11with with Excel Excel sheet I would say like that's like the the the Bingo the Bingo Jack but I totally agree that I think maybe you're you're confront to such a decision where you building you know really deep custom rules and you think like is it really worth it or is it other kind of features that been uh

17:32so for for like SV options specifically of course it really helps that there's there has been like an excellent to tool for SB parsing for many years right P yeah everyone from python have used PS for the past few years so it's usually like if someone is asking me shoping I'm like okay what do p is you they have

17:53this option or it so it ends up becoming a great API reference that sense right and I think the the other two questions I I ask myself before deciding to implement something is can I uh do I already have this option in a different way you know like can I already have the same results by setting a different

18:14option or different combination of options and then also how many people are asking so if there's like a random guy uh with a very specific use case that's probably something that adults got having a look uh immediately yeah I see um another fun story that uh

18:35I saw online is uh this tweet right uh

18:40from Hanes hello hes if you're if you're watching the stream by the way uh which is the creator you know the co-creator of Doug debut with Mark that we just mentioned uh he mentioned that Rabel Bank uh which is uh Dutch bank right um

18:57is export csb do not escape double quote which make it impossible to correctly read those files uh I think that's something so I think there is a lot things to unpack there the first thing is that I'm guessing I'm not the only one that wants their final data you know in a clear way so if you if you've been

19:18uh struggling with like exporting your financial data even like on big Banks it's it's totally pretty common actually but I think the most scary part is the second is the reply right uh can we can

19:36we tou that out that why do you think they they reply on this and why why why did they mention Excel actually I mean I I think there reason for that of course I I mean I'm going to go with a couple assumptions here but I imagine that uh saska is a a person that's not that's not completely related to it or a

19:57programming or not not this right so it's just also uh I imagine that if you're not a person that's ingesting these files through a system and try to really like reach pick performance opening them in Exel is perfectly valid and the reason that you can really open uh kind of like extremely broken csvs in EXO is because there's no restrictions

20:20in your schema in any like you can have columns with multiple data types you can have columns that me uh rows that Miss columns or have extra columns or whatever you can even have cses with multiple uh tables you're going to be able to open this in EXA right so uh I imagine that for someone that's completely outside the spute it's like

20:40how how do you have problems opening this like just use EXO uh it sounds funny to us uh and it also uh sounds

20:50like interesting that you see for example that uh this is kind of a basic error like are you not uh a Quoin yeah

20:59quing escaping your quotes right like how can this happen but like I'm pretty sure that Rach uh has a CSV exporter but

21:08not an importer you cannot import new transactions and give you more money so have to to write a writer you don't really uh like these things are easy to miss if you're not super familiar with the file formats uh and if you don't actually have ever like tried to write a reader right yeah so I I understand the

21:28deser I thought it was hilarious when I saw this as well uh so yeah yeah I think I think you mentioned I think there is two point here really interesting is that Excel is really you know allow a lot of freedom for good right so at the end

21:47that you mention that this person was not probably not technical right as you like hypothesis that someone you know managing the the sport on the on the social abut but the answer at the end is still correct right it it's like it sounds stupid right but it's it's also easy it's like yeah why would you not load it uh in Excel and probably I think

22:11uh you know companies and maybe people like sexia should have lower technical barrier to entry with other tool right because I mean DB is really awesome and I would use it any time of the day to analyze my financial data but as not you know um so easy than than approachable than than Excel right have you have you

22:36thought have you thought about that a bit like around the DB itself and you know how people are using CSV um what kind of user are you know are using it and do we need to make a a simple interface for for ddb I don't know I'm just uh brainstorming here like visual because I think about the thing

22:58that makes Excel uh Xcel

23:02Excel is because they have this no cish thing right which L be can't have has a relation database is very much against this wood stock kind of vibe you need like things very well defi they all need to fit in square right it's just a so you have to give up this freedom to reach performance it's simply the way it works

23:23you cannot have both uh so this is one thing and the other one of course like XL is an excellent uh visualization tool it's really focused on that we have I think people that are building uh in a way visualization tools using du be I don't have like a name my head but I've seen it around uh that could maybe uh

23:47you need as whole but I think it's also like Excel just allows you to do whatever you want like you can change uh cells to whatever you want you can again like you can have the same more than one table like and and I feel like as a user this is of course extremely powerful if you don't really care about performance like

24:09if your has like couple thousand lines like do you really care about like really crunching these numbers uh as fast as possible probably not you care that you manage to do it I think that becomes the more important part there yeah know I I kind of have a love and hate relationship with Excel CSV like in general I hate I hate Excel but they

24:33love them we we ended up always getting them you know around the data ecosystem and there is a lot of like still government um you know public data set that's using you know Excel and CSV uh

24:47heavily and I think again it's just a matter of of usage it's it's easy you can drop it so it's one big answer takeway probably for this session is that um you know csvs also kind of link it into uh the Excel which is has a huge

25:06Baseline of user and the the the way that people communicate sometimes with you know it Department from from an Excel sheet is often uh to to csb unless you've seen something else the no no I I I fully agree I think especially when you don't have like a I'm just get U because it's oh it might be a good time

25:27to introduce him so this is your for coding this is The Barking analyst uh he's the reason I work actually to pay the bills for him but um no so absolutely I think like a lot of this data ends up being generated by Excel or by hands or by systems that end up having the only way of exporting data as

25:51SV files I think it's also another big point of CSV files is that if they are while constructed it's pretty easy to write a fer for them right like as I said I did it like half an hour uh not not a great one but if you're like starting a system like you can kind of believe that any

26:11system will reach smv files yeah noway is becoming also true that you can kind of expect that most systems will read par files but even I think like a few years back that would not be the case yeah no just kind of this guarantee that anyone can uh operates yeah indeed and I'm and I'm wondering like I mean that's that's we

26:34are really going too far into the discussion with if there will be one day maybe a better file uh format that you

26:42know has all the benefit of CSV whole the B far I give it file format is always a compromise right uh but uh I can tell you my beliefs oh yeah go ahead if you have some uh some thought of that but yeah go yeah so I I I think like Sesame FES will stay here forever uh there will always be this mess and

27:02it's just kind of like a deal with it like you know if my grandma can open a CV file and change it that thing is not die and then if you go something like as a paret file that's a performance driven uh file format so basically why why is sp interesting because data is uh stored in the color format so if you have an

27:22electrical engine you can access faster and it does a bunch of different smart combinations of compression algorith to make it smaller and you also have partition information right yeah but the thing is all these performance driven uh techniques can change right like next year maybe someone at CWI for example develops a new compression algorithm for floting for hours actually they did it

27:46this year last year and now this becomes interesting for fire format because you can compress it even bit further so I don't think that par will be here like in the next 10 years for example be something else maybe or whatever but I think CSV files will always be here more or less in the format they are because it's just like

28:07they already completely fulfilled that spectrum of being very visual very easy to use you can open anywhere just next F yeah okay but that that's a good up take let's let's talk in 10 years uh if you're if if you're still maybe working on this par but I think it's an interesting out take uh that par will not be there in 10

28:31years but c will remains and I think it's fair uh giving just how you know

28:37certain government and entities use data today and will not probably adapt uh those kind of P formats so before uh we're going to dive a bit uh into some CSV example last thing I want to uh talk is about uh this uh tweet another tweet

28:58I've I've been spying on you pedal that's the no uh I think it it was interesting because um you were talking about uh you know lecture you were doing um and uh basically someone ask uh so I don't understand why you don't integrate for your existing tool rather than building uh your own parser from scratch uh can

29:22you can you give a bit like you you answer here but go a bit that

29:29I I think first off this is a totally valid question right like why rent the wheel if you have something that's uh good enough and that's out there um and I gave there a couple reasons right like so first off is that there could be uh by philosophy or design is supposed to be this very lightweight dependency PR

29:48uh so we are very careful like we we do have dependencies you know where but they get in L yeah uh and in line had dependency can be a very uh difficult process and it basically it becomes also very difficult to update it so you get kind of like the static dependency uh it's a bit problematic so this is why

30:09one of the reasons before we decide if we really need that piece of Library if it's really going to save us time if we're never going to modify it then it becomes pretty good so for example we have a new line extension for time zones you know like okay there's a pretty good extension that use it we're not going to

30:26modify it guess like so the the other reason is that uh induc of course you have a bing process so that this binding process is when you define uh at the very least the column name and types and later on you start actually reading the file so how do you how can you extract this biing information from real tools uh becomes

30:49very problematic do they even do something about do they new fits uh I'm not sure uh the the first reason they gave us vectorized Square execu so induc be the execution is factorized which basically means that you go through Small Checks off the table all the time and that's very important because it enables cash locality and basically it

31:10ends up like a huge eliminating a huge bottle NE of data analysis and then like you need to be able to also uh process your data all the chunks uh so I mean May management again like they runs like these uh very limited memory scenarios right like maybe you can only spare like couple hundred megabytes of your memory

31:32or or whatever uh so you do need like a pretty fering control of what you bring memory what you keeping what you're removing how do you do this if there's another Library that's actually having full control of the bio access uh and what gets in there and I can't pick up more like 15 reasons but I think these

31:50are wor no that's that's uh that's some valid one and I think for I mean the the dependencies has been appreciated it uh in general in the data EOS system because the the general data EOS system at the micro level or tooling or you know software is already you know deeply integrated for sometimes bad because when something doesn't work with with

32:15other things there is an update and then as you go lower into that stack right from the I would say you know vendors to uh programming language and you know database then that's true that it's it's really nice that you just installed GDB with a binary and it has all the the depy install and so we have uh about uh

32:3715 minutes uh what what about diving into some uh some example of what uh

32:46techb is capable of parsing um so I'm

32:51not sure should we build the CSV example

32:55together do you see my screen correctly yeah zoom out this we don't this um so I have a ddb shell uh install and I have a file here I'll still see uh so what do you suggest do you have a do you have actually example um to share and I think one reference that we can give to uh to the people trying

33:22out de DB is that we so there is the documentation with the CV import there is also page for let me open that the CV

33:34export which is just through uh copy statement and then um there is

33:42also a Blog that you share uh and the blog is uh this one was already in a back in

33:53October uh where we talk about the dbcs

33:58so that's on the wdb website you go on the blog uh I'll share it uh directly on

34:04the the comment section if you want uh to know more but let's go let's go a bit into into coding back here so what do you what do you suggest we start I mean again just start with a very simple file right like if we go to how maybe just create free columns let's say uh name age and uh

34:29Heights or whatever doesn't really matter uh and then we can just create like a couple rules for that oh oh perfect oh wow that's magic man what is that that's that's micropilot so you have the dog and I have the Excel you know copilot it's just a V Code copilot oh beautiful okay so that's which which is really fun because you can try to uh

34:53see uh you know when you put some comments or whatsoever if you P up some Secrets like it secrets and because there is this uh just a small parenthesis where on what G up calot has been trained and there is a lot of leaked uh itbs keys for example on yeah exactly so no but I I try some some time

35:15of deps but you can play around uh if you want like for example uh let's see

35:21uh it is uh ke

35:31uh yeah so you see it's gonna it's gonna and then okay here is more is more

35:39maybe probably the the thing is that but I already like got up like create just the data set with like key are they real or not but but yeah that was just a small parenthesis for uh for for fed let's let's come back to uh to this so yeah we had two lines and then if we just uh so I I

36:00explain like you have this always s for this old thing so you really have to set up nothing hopefully so if we do this on the CLE from uh and then the the path between CLE quotes uh we should be able to get our results yeah and can you tell us a bit what's what happening behind the

36:21scene here so I just did so this is my SQL um so I just did this but I didn't specify any function so what's happening behind the scene yeah absolutely so so basically What's Happening Here is that during The Binding phase basically when we constructing the Cory plan we already figure out uh what the table looks like

36:42right so in this case uh the neifer will go for or our CS5 is very small but we

36:50go for about the first 20,000 rows and we'll try to figure out their types so is for example okay the First Column uh is Joel and Jane so so of course it's string the second second column is numbers so it's an E64 and the third column is of course like it has a DOT there so it's obviously a double um and

37:10then basically we go for the first 20,000 lines it try to figure out and basically tries these types all the time right so it starts like a bullan value and thinks like Okay can John fits in the bullan value oh no Okay can John fits in that integer oh no can John fits on adobo or no and so on so forth until

37:26you end up with the bar chart and also during this process you identify things like what is being used for the delimiter right so in our example we have coma but we could have I don't know a tab right a pipe whatever uh a value that would also be identify it also identifies what the quote value is so

37:46here we have a double quotes but if we also add a single quotes that'll be total the F and one one way you can see this is that I've implemented this function called the is CSV so if you just add is CSV before the p is no just PR yeah and then is new SB s

38:11n i FF s n

38:15i FF underscore CSV yeah and then have the BFF a uh parenthesis yeah uh this this

38:25function basically runs only the the sniffer part the part that detects like all the SPs and returns to us like all the vals right so for example returns oh this is the ler that we detected this is the code we detected and it has like 11 columns so like all these kind of options that the sneer goes from yeah and my favorite part of

38:47it is that also returns you the prompts that you can use disabling the sneer oh okay okay so one reason that might be interesting is that if you have a CSV file that s is very expensive just get the prompt of the S turned off and things go faster okay oh that uh yeah

39:07I'll learn I'll learned something new I think uh something maybe it's worth to mention also because I've seen that uh improving over the years that there is a hook um you know directly uh within Doug DB to detect that it's a CSV file what what's happening so that that's a bit uh easy actually it just looks at the

39:31extension right okay and but also is loading so the the the

39:39necessary extension uh if you need any uh depending on the on on the DBK file format because I think that's that's something uh when I show that to people I say oh this is Magic but what if I want to pass some function on I'm reading a park or whatsoever a no that's like kind of a generic entry point where

40:01you don't need to specify anything and the magic is happening but we can also get basically go to the read CSV and if I'm going back to uh the documentation so we have uh this basically specify here we we can specify the theer header columns and so forth yeah so so that's also exactly what that prompt part will return to you yeah

40:27because we could see it because it gets collapsed but it basically return like the column definitions that average yeah yeah so if I do select pront this uh and maybe I can do more line so

40:43that we see it so this is what is being return return right yes so the all these options that you see here the name to quote Escape what is the new line uh this skip is how many rows you skip from the beginning of the file so for example if in our first rules we just had like

41:00random comments it will detect it's probably dirty so it skips for the the header and the columns names and types yeah yeah yeah yeah indeed um so yeah

41:13let's let's break uh some stuff and see see what what else is able like to uh so we started with a really simple use case right so uh um so now let's let's see

41:28what what is possible yeah so one thing that's uh common for example is that if you have a a column that's missing oh sorry a row that's missing a column right so you just have the name in the age uh that of course uh should throw an error to us because uh then you you're missing a value what do you do right

41:50so something like that uh no uh just an age uh instead of the height whatever 30

41:58sorry can you repeat that so instead of one if you remove one six5 like the last column yeah okay yeah and the color yeah so this is

42:11basically a file now that the second uh the the the first row of data has three columns and the second row of data has two columns right uh so actually because RSD for is also buil in a way that tries to not break it uh this time when you run the from lcv it basically thinks everything is just one

42:31column yeah so so the first row we have three columns the second row have three columns but it just looks like it's all one yeah uh so to actually force it to break you need to define the the limiter uh in our qu so to do that yeah so read

42:51CSV perfect and then I think it's the Ling or where so so this is all I don't

43:04use it's just the d i l i m yeah and

43:08then the

43:12exactly uh sorry I'm I'm resuming a lot so that's why I'm just going to do like this so I think uh let me double check

43:24yeah it's just uh D like

43:33this um and if I have a typo so in CSV

43:38when Sni file yeah yeah so so it's basically

43:45saying that um with this the limiter we're giving uh it's cannot autod detect the other options because now if this Del we're going to have this issue right okay but if we add for example an option ignore errors equal true uh hopefully our broken line uh will be skipped uh so this is the option yeah let's hope

44:11it works and yeah yeah so now we just got the first row because uh the second row of data of course uh breaks it but what that's also very common on this type of csb files missing a column is that uh

44:33people want them just to be no padded and new pading basically means that you replace the the missing value of the columns with no so if instead of ignore have errors you have no pading you get that effect as well so basically what what I'm trying to say here is that we also have like a lot of options to try

44:52to allow you to uh engage uh yeah the uh

44:57different files what what I'm I'm curious what happening if I have another row uh like this with it wa it but the

45:06two rows or just the first one and stop oh well so that's beautiful because like this is something where you have uh pretty common in CSV to have one roll out of like thousand which is broken for weird Behavior right it's breaking you you full pipeline um so uh so so that's

45:30that's uh uh that's interesting and you can can you I so if I do you mention will it can you can I know is there sorry a way to know which line is problematic because year it was saying yeah so the the reason that we were not getting the line number there is because this error specifically is

45:54happen during sing and the reason this ER happens during sing is because our csb file was to small uh in reality these csb files are like thousands of lines and then they s for they only runs like on the okay this is like what we detected so this s good here all options I can read the slides and then you end

46:13up with the bad stuff uh so for example if you get the prompt uh on the first line yeah uh and and we use that with this Neer disabled we can uh we can see exactly that that behavior um and I think it's maybe also worth mentioning that I refactored all our error messages uh recently it's going to

46:35be out in the next release and now all these error messages they all like including the not only the line number but the original CSV line okay uh and also tips on how to fix it right so basically if you would get this error it would say hey uh okay like it's missing a column but um maybe you can try new

46:53Penny or ignore errors so it also tries to help with the the user yeah no I think and I think again that's that's uh I think if you get the r number you get already a lot of work done if it's if it's isolated but uh but if you get like a quick suggestion that's uh even better because I've been there like when

47:13ring CSV with different things it's it's uh it's sometimes difficult to you know notice on which roow there is there is actually a problem so here I have the

47:25the C the prompt you want to try this one and yeah so if we copy and paste that uh and just uh change there the the

47:35limiter to the comma because since we ran this neifer it gave us like something that uh yeah it's not good because that's what I use and the quotes basically to the double quotes I guess uh Yes actually now that I think about it the Zer of course gave us like a lot of wrong things we also have to redefine

47:56the col because it try to read everything as one column so we need to also create this stru with like name varar age uh integer and the height double

48:10yeah um and it so that would look

48:15uh this for and then uh yeah

48:22H in

48:28and eight double yeah this yeah

48:37exactly all right so now we get a different error message so it first shows like uh the options that were use oh it's already giv the possible fixes so you're using like oh you're using the very One latest okay and then it gives you like z. 10.1 yeah yeah so so I extended this even more like on the the

48:55one that's GNA come up But Here For example gives us uh the input error so now a csb error or light free uh so it is our light free because we have the hner and it gives us a more precise uh ER there like okay it's we're expect different number ofs wow oh okay that's

49:14the that's uh that's super nice we already already uh running out of time um maybe Let's do let's do one more uh I would say uh dirty CSV that for works and maybe explain like what what what is happening behind the scene can you give like another example yeah so let's maybe uh make Jane uh have a I again

49:40uh so I don't know just a comma one points whatever yeah and then let's maybe add like a couple dirty lines on the top of the fire just like write like ROM J no before the the

49:55header yeah so this is also something that's pretty common uh on CSV files is that you basically have like some kind of oh thank you uh so have like a couple of rows that basically do some maybe kind of description of what the f is going to be about uh it's

50:16basically yeah some kind of gut right and then uh this Neer can automatically uh detect uh that this are broken so if we try to run our sleer uh and hopefully maybe this time even if this small file Works nicely uh wait you want um let's let's run sorry yeah just yeah let's uh let's run it this and

50:46see what this wow wow yeah so so basic completely ignoring that you're assome isn't isn't that hurtful I I I think uh I've been touching it uh

51:01weirdly no so yeah so so basically it can like detect to a certain degree that you have uh this kind of uh you know uh

51:12wrong rows um and and there's all kinds of other errors you can have on CSP files that uh we don't really like for specifically for this case where you're missing a column we have a way of uh a meding it with notepad but if you have extra columns I mean what do you do right like what does that mean which column should

51:32you ignore or if you have unquoted values uh so so there's like all this pleer of errors that uh with they ignore errors they will hopefully get ignored and you can still read your CSV file yeah uh so if you want to run a different CSV file you could even like add a new column AR right so bottom line

51:54is I think these two options ignore errors and the n b really help uh people

52:01uh to hear to to read all sorts like of weird yeah that is cool uh so let's just recap what we we cover here so uh and for people watching this and not familiar with Doug DB I'm just running vs code I've installed Doug DB here on my terminal you can also use Doug DB within a Jupiter notebook with you know

52:22people install DB and I'm sending those SQL comment and we had the the CSV file here to play with um basically what we did is just a simple CSV uh um see that without specifying anything what kind of things is uh te be doing uh behind the scene we use the sniff CSV that I was actually not aware so I did learn uh

52:46something because usually the the magic happening doesn't need me to go on that function but now I will definitely uh look a bit more um and we tried to broke the CSV in different ways and so also what kind of error we get you know surfaced which was pretty nice uh there was specifically the line code error if you have one line

53:10that's just messed up in your CSV and here we just like did some stuff on the headers and it's still able to parse it but be aware you have to um sometimes also put ignore errors through depending on what kind of things will break it um

53:29the the parser that's probably what we we cover now um and yeah was a was a fun

53:35discussion I I did learn a few a few things uh with you bedro what what's next on on the on the agenda for for you work and maybe first on the CSV side and just you work in general uh yeah uh that's a good question uh so I didn't like uh I I I

53:59did recently quite a huge real work in the way the SV operates uh basically with uh performance focus and um it was

54:09a very very interesting uh but I think by now we have certain cases that we kind of like getting bottleneck by IO uh so for example if you get the click bench SV fire and just like RS qu directly on the SV files you get bot NE Mario which basically means there's no nothing really you can do because you're

54:28waiting for the thing to be read in memory right um so I think in performance wise for the general case I'm pretty happy there's of course optimizations that you can do like when you're reading uh CSV files from the internet uh so I already have ideas of how to make it a bit faster uh there's uh when you read the same file multiple

54:50times there are caching ideas that you can Implement so the there are still things to be done but I think for the General case I'm very happy what I think is still can be done better is our testing uh CSV files uh they come in all shapes and forms with all different types of Errors uh and one thing that I

55:10think is nice about that them is that you can you know what kind of Errors can happen right they have a certain limitation on the number of craziness so you can write a fuzzer that with certain number of rules can just generate infinitely broken CSV files and because you generated them you know where all these errors are right so you can uh

55:31validate your system is probably like properly detecting all of these uh properly running so in my opinion the next big thing for the CSV file will be robustness uh I think that the dream is that it never crashes uh that it's always able to read it even though you get like your whole fire as one column but at least you as the user you can

55:56like get one call I see how the data looks like I can use this Z function and see what's going wrong and then amend it myself yeah so I that's kind of like the next Big Goal there um yeah and you and

56:09and for your work itself are you are you GNA still focus on on this specific thing or what you would like to to work on uh so I mean we are currently in the

56:20stage that we're very much focus on bugs uh at du so we're not really producing that many new features uh I've mostly been doing CV B fixing uh unfortunately uh but um uh for for me

56:35what one of the things that I'm highly interested on is on I sync IO uh and this is something that I noticed on SSB files but actually touch any kind of files that we go from our file system it basically means that the access to disk for that file and the actually operations are inter leads so you can

56:52have scenarios where uh you're not bottlenecked by a but your CPU is not at Max usage yeah basically because your friends are waiting for data uh necessarity so you can do that as synchronously and for all the files so that I think is pretty cool because I mean won't get like a 10x performance uh but it will highly increase paral this

57:15uh and across like the whole stack so that that probably like the next big thing that I'll be working on cool now that sounds that sounds exciting like that sounds perf performance uh thing that we definitely need on IA and I think on the barer there is already a lot of good work and I've been I've been

57:33using like wdb on the CLI just to do you know querry in one line of CSV with the wgb CLI has been really refresh for me thank you uh very much Pedro and I guess see you around I'll see you around uh in Amsterdam maybe of course see you [Music]

58:04he

FAQS

Why does DuckDB build its own CSV parser instead of using an existing library?

A few reasons. DuckDB follows a lightweight, dependency-free philosophy to keep the core binary small and portable. Its vectorized execution engine processes data in small chunks for cache locality, which requires tight control over how data enters memory. DuckDB also needs to integrate CSV parsing with its binding phase for schema inference. An external library would not give the necessary control over memory management and IO access.

How does DuckDB's CSV sniffer automatically detect file formats?

DuckDB's CSV sniffer analyzes roughly the first 20,000 rows of a file to automatically detect column names, data types, delimiters, quote characters, escape characters, and whether initial rows should be skipped. It progressively tries to fit values into increasingly general types (boolean, integer, double, varchar). You can inspect what the sniffer detected using the sniff_csv() function, which also returns a ready-to-use SQL prompt with all detected options. Learn more in our guide on taming CSVs with DuckDB.

Will CSV files ever be replaced by Parquet or other formats?

According to Pedro Holanda from DuckDB Labs, CSV files will likely exist forever because they fill a unique role: they are human-readable, easy to edit, and universally accessible, even by non-technical users in Excel. Parquet and other performance-driven formats may evolve as new compression algorithms emerge, but CSVs solve a fundamentally different problem. As Pedro put it, "if my grandma can open a CSV file and change it, that thing is not dying." For analytical workloads, though, Parquet remains the preferred format.

How does DuckDB handle broken or malformed CSV files?

DuckDB provides several options for handling malformed CSVs. The ignore_errors option skips broken rows entirely, while null_padding replaces missing column values with NULL instead of failing. DuckDB's error messages include the exact line number of problematic rows and suggestions for how to fix issues. In recent releases, error messages have been improved further to show the original CSV line content alongside tips like "try null_padding or ignore_errors."