Why CSV Files Won’t Die and How DuckDB Conquers Them
2025/02/04 - 8 min read
BYI've been working in the data field for a decade, across various companies, and one constant challenge that’s almost unavoidable is dealing with CSV files.
Yes, there are far more efficient formats, such as Parquet, which avoid schema nightmares thanks to their typing, but CSV files persist for many reasons:
- They’re easy to edit and read, requiring no dependencies—just open the file.
- They’re universal: many services still exchange data in CSV format.
- Want to download data from social media or your CRM? CSV.
- Need transaction history from your bank? CSV.
However, this simplicity comes with its own set of challenges, especially if you want to process CSVs without breaking pipelines or pulling your hair out.
Fortunately, DuckDB has an exceptional CSV parser. The team behind it invested heavily in building their own, and in this post, I’ll show you a real-world example where I had to parse multiple CSV files. I’ll also share some SQL tricks and demonstrate how smoothly everything worked using DuckDB and MotherDuck, resulting in a ready-to-query database.
The cherry on top? The final output is a database containing all Stack Overflow survey responses from the past seven years. Stick around if you’re curious about extracting insights or querying the data yourself!
The biggest challenges when reading CSVs
In my opinion, there are four significant challenges when working with CSV files:
- Schema Management
- Row-Level Errors
- Encoding Issues
These challenges become even more complex when handling multiple CSVs that need to be read or joined to each other.
Let’s see how we address these issues with Stack Overflow survey data.
About the Dataset
Each year, Stack Overflow publishes the results of their developer survey, including raw data in—you guessed it—CSV format. These files are available on their website: https://survey.stackoverflow.co/.
Here’s an example of how the dataset is organized:
Copy code
├── raw
│ ├── 2011 Stack Overflow Survey Results.csv
│ ├── 2012 Stack Overflow Survey Results.csv
│ ├── 2013 Stack Overflow Survey Responses.csv
│ ├── 2014 Stack Overflow Survey Responses.csv
│ ├── 2015 Stack Overflow Developer Survey Responses.csv
│ ├── 2016 Stack Overflow Survey Results
│ │ ├── 2016 Stack Overflow Survey Responses.csv
│ │ └── READ_ME_-_The_Public_2016_Stack_Overflow_Developer_Survey_Results.txt
│ ├── stack-overflow-developer-survey-2017
│ │ ├── DeveloperSurvey2017QuestionaireCleaned.pdf
│ │ ├── README_2017.txt
│ │ ├── survey_results_public.csv
│ │ └── survey_results_schema.csv
│ ├── stack-overflow-developer-survey-2018
│ │ ├── Developer_Survey_Instrument_2018.pdf
│ │ ├── README_2018.txt
│ │ ├── survey_results_public.csv
│ │ └── survey_results_schema.csv
│ ├── stack-overflow-developer-survey-2019
│ │ ├── README_2019.txt
│ │ ├── so_survey_2019.pdf
│ │ ├── survey_results_public.csv
│ │ └── survey_results_schema.csv
[..]
Key observations:
- Schema Changes Over the Years
Some questions and their formats evolve annually, making it difficult to standardize across years. - Pre-2016 Format
Each column represents a question, with names like:
What Country or Region do you live in?, How old are you?, How many years of IT/Programming experience do you have?, ...
Additional challenges include:
• Column names with unusual characters.
• Querying such column names can be tedious.
From 2017 onward, Stack Overflow improved the exports by separating:
• A file containing the answers (columns with clean names for each question).
• A schema file (.csv) that maps question codes to full question text.
To keep things manageable, I focused on datasets from 2017 onward.
Manual cleaning over automation
We’ve all wasted hours trying to automate tasks that could have been done manually in minutes. This is a common trap for data engineers. Sometimes, quick manual cleanup is the most efficient approach.
Here’s what I did:
• Placed all CSVs in a single folder.
• Renamed files by adding the corresponding year as a prefix (e.g., <year>_<file_name>
).
• Ensured column names in schema files were consistent (e.g., renamed name to qname where needed).
These steps took less than five minutes and saved me headaches later. Not everything needs to be automated!
Loading the CSVs
Now for the exciting part: loading the data. DuckDB supports glob patterns for loading multiple files. For complex structures like Hive partitions, it works seamlessly too.
Here’s the core query for loading survey results:
Copy code
CREATE OR REPLACE TABLE stackoverflow_survey.survey_results AS
SELECT
* EXCLUDE (filename),
substring(parse_filename(filename), 1, 4) as year,
FROM read_csv_auto(
'data_2017_2024/*survey_results*.csv',
union_by_name=true,
filename=true)
Breakdown:
- We
CREATE
a table based on aSELECT
statement. - We select all columns but
EXCLUDE
the filename. This is a path of the containing file; we get this one by enablingfilename=true
. - We parse the
filename
to get only the year. As we have a convention on the file name to prefix by<year>
, we take the first four chars and create ayear
column - We use the glob pattern to only load
*survey_results*
as a single table (we'll do another query for thesurvey_schemas
)
Alright, let's run this one... 🙏
Copy code
duckdb.duckdb.ConversionException: Conversion Error: CSV Error on Line: 35365
Original Line: 35499,I am a developer by profession,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
Error when converting column "Hobbyist". Could not convert string "NA" to 'BOOLEAN'
Column Hobbyist is being converted as type BOOLEAN
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'Hobbyist': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.
file = ./2017_2024_schema/2020_survey_results_public.csv
delimiter = , (Auto-Detected)
quote = " (Auto-Detected)
escape = " (Auto-Detected)
new_line = \n (Auto-Detected)
header = true (Auto-Detected)
skip_rows = 0 (Auto-Detected)
comment = \0 (Auto-Detected)
date_format = (Auto-Detected)
timestamp_format = (Auto-Detected)
null_padding = 0
sample_size = 20480
ignore_errors = false
all_varchar = 0
Bad news, it didn't successfully parse the CSVs. But the GREAT news is that we have an excellent log error!
We know :
- On which line we have an issue
- A proper error message
Could not convert string "NA" to 'BOOLEAN'
- Possibles solutions
This saves so much time! Sometimes, just one row can mess up the whole process, and if the error message isn’t clear, you’re stuck guessing what went wrong. You might even end up throwing out your CSV or trying random fixes over and over.
For us, increasing the sample_size fixed the problem right away. 👍
Wrapping up and automate the rest
With the initial query successful, the next steps were to:
- Repeat the process for schema files.
- Add row count checks to ensure no data was lost during merging of the CSVs
Here's a generic function to wrap the query we saw and run them depending on the pattern name of the files (either for results
or schemas
).
Copy code
CSV_DIR = './data_2017_2024'
# Global configuration
FILE_CONFIGS = [
{'pattern': 'schema', 'table': 'survey_schemas'},
{'pattern': 'public', 'table': 'survey_results'}
]
def process_survey_files(csv_dir: str) -> None:
"""
Process Stack Overflow survey CSV files and load them into DuckDB tables
"""
con = duckdb.connect('stackoverflow_survey.db')
for config in FILE_CONFIGS:
logging.info(f"Processing {config['pattern']} files...")
con.execute(f"""
CREATE OR REPLACE TABLE stackoverflow_survey.{config['table']} AS
SELECT
* EXCLUDE (filename),
substring(parse_filename(filename), 1, 4) as year,
FROM read_csv_auto(
'{csv_dir}/*{config['pattern']}*.csv',
union_by_name=true,
filename=true,
sample_size=-1
)
""")
# Log row count
count = con.execute(f"SELECT COUNT(*) FROM stackoverflow_survey.{config['table']}").fetchone()[0]
logging.info(f"Loaded {count} rows into {config['table']}")
# Log unique years
years = con.execute(f"SELECT DISTINCT year FROM stackoverflow_survey.{config['table']} ORDER BY year").fetchall()
logging.info(f"{config['table']} years: {[year[0] for year in years]}")
con.close()
Finally, we added another function to check row count and make sure we didn't lose any rows during the process :
Copy code
def verify_row_counts(csv_dir: str) -> None:
"""
Verify that the sum of individual file counts matches the merged table counts
"""
con = duckdb.connect('stackoverflow_survey.db')
for config in FILE_CONFIGS:
pattern = config['pattern']
table = config['table']
logging.info(f"\nVerifying {pattern} files counts...")
individual_counts = 0
for filename in os.listdir(csv_dir):
if pattern in filename and filename.endswith('.csv'):
file_path = os.path.join(csv_dir, filename)
count = con.execute(f"SELECT COUNT(*) FROM read_csv_auto('{file_path}')").fetchone()[0]
logging.info(f"{filename}: {count} rows")
individual_counts += count
merged_count = con.execute(f"SELECT COUNT(*) FROM stackoverflow_survey.{table}").fetchone()[0]
logging.info(f"Individual {pattern} files total: {individual_counts}")
logging.info(f"Merged {table} total: {merged_count}")
assert individual_counts merged_count, f"{pattern} row count mismatch: {individual_counts} != {merged_count}"
con.close()
logging.info("✅ All row counts verified successfully!")
Sharing the dataset
Now that I have a DuckDB database containing both tables (results and schemas), the only thing left is to share it! Let's see how that works with MotherDuck.
I’m using the DuckDB CLI, but this could also be part of a Python script. It’s just four simple commands:
Copy code
duckdb
D ATTACH 'stackoverflow_survey.db'
D ATTACH 'md:'
D CREATE DATABASE cloud_stackoverflow_survey FROM stackoverflow_survey;
D CREATE SHARE FROM cloud_stackoverflow_survey;
┌─────────────────────────────────────────────────────────────────┐
│ share_url │
│ varchar │
├─────────────────────────────────────────────────────────────────┤
│ md:_share/sample_data/23b0d623-1361-421d-ae77-125701d471e6 │
└─────────────────────────────────────────────────────────────────┘
- We attach the local DuckDB database with
ATTACH
command. - We connect to MotherDuck using
ATTACH 'md';
. Note that I have mymotherduck_token
stored in anENV
. - We upload the database to MotherDuck using the
CREATE DATABASE x FROM x
- We create a public share so that anyone can start querying!
To make it even easier for MotherDuck users, I put this one in the existing demo database sample_data
, which is attached by default for any users.
Querying the dataset
This dataset offers plenty of opportunities to uncover insights, but I’ll wrap up this blog with a simple query that wasn’t included in the original StackOverflow study.
I wanted to explore the average happiness score of people based on their work location (remote, in-person, or hybrid).
Copy code
SELECT RemoteWork,
AVG(CAST(JobSat AS DOUBLE)) AS AvgJobSatisfaction,
COUNT(*) AS RespondentCount
FROM sample_data.stackoverflow_survey.survey_results
WHERE JobSat NOT IN ('NA')
AND RemoteWork NOT IN ('NA')
AND YEAR='2024'
GROUP BY ALL;
and the results :
Copy code
┌──────────────────────────────────────┬────────────────────┬─────────────────┐
│ RemoteWork │ AvgJobSatisfaction │ RespondentCount │
│ varchar │ double │ int64 │
├──────────────────────────────────────┼────────────────────┼─────────────────┤
│ In-person │ 6.628152818991098 │ 5392 │
│ Remote │ 7.072592992884806 │ 11103 │
│ Hybrid (some remote, some in-person) │ 6.944303596894311 │ 12622 │
└──────────────────────────────────────┴────────────────────┴─────────────────┘
Two interesting takeaways: remote and ybrid workers make up the majority of survey responses, and on average, they seem to be happier too!
Check out our documentation if you want to explore this dataset further.
In the meantime, get ready to tackle future CSV challenges with ease—DuckDB and MotherDuck (start for free!) have got you covered!
Why DuckDB’s CSV Parser is Special
CONTENT
- The biggest challenges when reading CSVs
- About the Dataset
- Manual cleaning over automation
- Loading the CSVs
- Wrapping up and automate the rest
- Sharing the dataset
Start using MotherDuck now!
