
2025/01/16 - Jacob Matson
Local dev and cloud prod for faster dbt development
Spark the Joy of beautiful local development workflows with MotherDuck & dbt
I've been working in the data field for a decade, across various companies, and one constant challenge that’s almost unavoidable is dealing with CSV files.
Yes, there are far more efficient formats, such as Parquet, which avoid schema nightmares thanks to their typing, but CSV files persist for many reasons:
However, this simplicity comes with its own set of challenges, especially if you want to process CSVs without breaking pipelines or pulling your hair out.
Fortunately, DuckDB has an exceptional CSV parser. The team behind it invested heavily in building their own, and in this post, I’ll show you a real-world example where I had to parse multiple CSV files. I’ll also share some SQL tricks and demonstrate how smoothly everything worked using DuckDB and MotherDuck, resulting in a ready-to-query database.
The cherry on top? The final output is a database containing all Stack Overflow survey responses from the past seven years. Stick around if you’re curious about extracting insights or querying the data yourself!
In my opinion, there are four significant challenges when working with CSV files:
These challenges become even more complex when handling multiple CSVs that need to be read or joined to each other.
Let’s see how we address these issues with Stack Overflow survey data.
Each year, Stack Overflow publishes the results of their developer survey, including raw data in—you guessed it—CSV format. These files are available on their website: https://survey.stackoverflow.co/.
Here’s an example of how the dataset is organized:
Copy code
├── raw
│ ├── 2011 Stack Overflow Survey Results.csv
│ ├── 2012 Stack Overflow Survey Results.csv
│ ├── 2013 Stack Overflow Survey Responses.csv
│ ├── 2014 Stack Overflow Survey Responses.csv
│ ├── 2015 Stack Overflow Developer Survey Responses.csv
│ ├── 2016 Stack Overflow Survey Results
│ │ ├── 2016 Stack Overflow Survey Responses.csv
│ │ └── READ_ME_-_The_Public_2016_Stack_Overflow_Developer_Survey_Results.txt
│ ├── stack-overflow-developer-survey-2017
│ │ ├── DeveloperSurvey2017QuestionaireCleaned.pdf
│ │ ├── README_2017.txt
│ │ ├── survey_results_public.csv
│ │ └── survey_results_schema.csv
│ ├── stack-overflow-developer-survey-2018
│ │ ├── Developer_Survey_Instrument_2018.pdf
│ │ ├── README_2018.txt
│ │ ├── survey_results_public.csv
│ │ └── survey_results_schema.csv
│ ├── stack-overflow-developer-survey-2019
│ │ ├── README_2019.txt
│ │ ├── so_survey_2019.pdf
│ │ ├── survey_results_public.csv
│ │ └── survey_results_schema.csv
[..]
Key observations:
What Country or Region do you live in?, How old are you?, How many years of IT/Programming experience do you have?, ...Additional challenges include:
• Column names with unusual characters.
• Querying such column names can be tedious.
From 2017 onward, Stack Overflow improved the exports by separating:
• A file containing the answers (columns with clean names for each question).
• A schema file (.csv) that maps question codes to full question text.
To keep things manageable, I focused on datasets from 2017 onward.
We’ve all wasted hours trying to automate tasks that could have been done manually in minutes. This is a common trap for data engineers. Sometimes, quick manual cleanup is the most efficient approach.
Here’s what I did:
• Placed all CSVs in a single folder.
• Renamed files by adding the corresponding year as a prefix (e.g., <year>_<file_name>).
• Ensured column names in schema files were consistent (e.g., renamed name to qname where needed).
These steps took less than five minutes and saved me headaches later. Not everything needs to be automated!
Now for the exciting part: loading the data. DuckDB supports glob patterns for loading multiple files. For complex structures like Hive partitions, it works seamlessly too.
Here’s the core query for loading survey results:
Copy code
CREATE OR REPLACE TABLE stackoverflow_survey.survey_results AS
SELECT
* EXCLUDE (filename),
substring(parse_filename(filename), 1, 4) as year,
FROM read_csv_auto(
'data_2017_2024/*survey_results*.csv',
union_by_name=true,
filename=true)
Breakdown:
CREATE a table based on a SELECT statement.EXCLUDE the filename. This is a path of the containing file; we get this one by enabling filename=true.filename to get only the year. As we have a convention on the file name to prefix by <year>, we take the first four chars and create a year column*survey_results* as a single table (we'll do another query for the survey_schemas)Alright, let's run this one... 🙏
Copy code
duckdb.duckdb.ConversionException: Conversion Error: CSV Error on Line: 35365
Original Line: 35499,I am a developer by profession,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
Error when converting column "Hobbyist". Could not convert string "NA" to 'BOOLEAN'
Column Hobbyist is being converted as type BOOLEAN
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'Hobbyist': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.
file = ./2017_2024_schema/2020_survey_results_public.csv
delimiter = , (Auto-Detected)
quote = " (Auto-Detected)
escape = " (Auto-Detected)
new_line = \n (Auto-Detected)
header = true (Auto-Detected)
skip_rows = 0 (Auto-Detected)
comment = \0 (Auto-Detected)
date_format = (Auto-Detected)
timestamp_format = (Auto-Detected)
null_padding = 0
sample_size = 20480
ignore_errors = false
all_varchar = 0
Bad news, it didn't successfully parse the CSVs. But the GREAT news is that we have an excellent log error!
We know :
Could not convert string "NA" to 'BOOLEAN'This saves so much time! Sometimes, just one row can mess up the whole process, and if the error message isn’t clear, you’re stuck guessing what went wrong. You might even end up throwing out your CSV or trying random fixes over and over.
For us, increasing the sample_size fixed the problem right away. 👍
With the initial query successful, the next steps were to:
Here's a generic function to wrap the query we saw and run them depending on the pattern name of the files (either for results or schemas).
Copy code
CSV_DIR = './data_2017_2024'
# Global configuration
FILE_CONFIGS = [
{'pattern': 'schema', 'table': 'survey_schemas'},
{'pattern': 'public', 'table': 'survey_results'}
]
def process_survey_files(csv_dir: str) -> None:
"""
Process Stack Overflow survey CSV files and load them into DuckDB tables
"""
con = duckdb.connect('stackoverflow_survey.db')
for config in FILE_CONFIGS:
logging.info(f"Processing {config['pattern']} files...")
con.execute(f"""
CREATE OR REPLACE TABLE stackoverflow_survey.{config['table']} AS
SELECT
* EXCLUDE (filename),
substring(parse_filename(filename), 1, 4) as year,
FROM read_csv_auto(
'{csv_dir}/*{config['pattern']}*.csv',
union_by_name=true,
filename=true,
sample_size=-1
)
""")
# Log row count
count = con.execute(f"SELECT COUNT(*) FROM stackoverflow_survey.{config['table']}").fetchone()[0]
logging.info(f"Loaded {count} rows into {config['table']}")
# Log unique years
years = con.execute(f"SELECT DISTINCT year FROM stackoverflow_survey.{config['table']} ORDER BY year").fetchall()
logging.info(f"{config['table']} years: {[year[0] for year in years]}")
con.close()
Finally, we added another function to check row count and make sure we didn't lose any rows during the process :
Copy code
def verify_row_counts(csv_dir: str) -> None:
"""
Verify that the sum of individual file counts matches the merged table counts
"""
con = duckdb.connect('stackoverflow_survey.db')
for config in FILE_CONFIGS:
pattern = config['pattern']
table = config['table']
logging.info(f"\nVerifying {pattern} files counts...")
individual_counts = 0
for filename in os.listdir(csv_dir):
if pattern in filename and filename.endswith('.csv'):
file_path = os.path.join(csv_dir, filename)
count = con.execute(f"SELECT COUNT(*) FROM read_csv_auto('{file_path}')").fetchone()[0]
logging.info(f"{filename}: {count} rows")
individual_counts += count
merged_count = con.execute(f"SELECT COUNT(*) FROM stackoverflow_survey.{table}").fetchone()[0]
logging.info(f"Individual {pattern} files total: {individual_counts}")
logging.info(f"Merged {table} total: {merged_count}")
assert individual_counts merged_count, f"{pattern} row count mismatch: {individual_counts} != {merged_count}"
con.close()
logging.info("✅ All row counts verified successfully!")
Now that I have a DuckDB database containing both tables (results and schemas), the only thing left is to share it! Let's see how that works with MotherDuck.
I’m using the DuckDB CLI, but this could also be part of a Python script. It’s just four simple commands:
Copy code
duckdb
D ATTACH 'stackoverflow_survey.db'
D ATTACH 'md:'
D CREATE DATABASE cloud_stackoverflow_survey FROM stackoverflow_survey;
D CREATE SHARE FROM cloud_stackoverflow_survey;
┌─────────────────────────────────────────────────────────────────┐
│ share_url │
│ varchar │
├─────────────────────────────────────────────────────────────────┤
│ md:_share/sample_data/23b0d623-1361-421d-ae77-125701d471e6 │
└─────────────────────────────────────────────────────────────────┘
ATTACH command.ATTACH 'md';. Note that I have my motherduck_token stored in an ENV.CREATE DATABASE x FROM xTo make it even easier for MotherDuck users, I put this one in the existing demo database sample_data, which is attached by default for any users.
This dataset offers plenty of opportunities to uncover insights, but I’ll wrap up this blog with a simple query that wasn’t included in the original StackOverflow study.
I wanted to explore the average happiness score of people based on their work location (remote, in-person, or hybrid).
Copy code
SELECT RemoteWork,
AVG(CAST(JobSat AS DOUBLE)) AS AvgJobSatisfaction,
COUNT(*) AS RespondentCount
FROM sample_data.stackoverflow_survey.survey_results
WHERE JobSat NOT IN ('NA')
AND RemoteWork NOT IN ('NA')
AND YEAR='2024'
GROUP BY ALL;
and the results :
Copy code
┌──────────────────────────────────────┬────────────────────┬─────────────────┐
│ RemoteWork │ AvgJobSatisfaction │ RespondentCount │
│ varchar │ double │ int64 │
├──────────────────────────────────────┼────────────────────┼─────────────────┤
│ In-person │ 6.628152818991098 │ 5392 │
│ Remote │ 7.072592992884806 │ 11103 │
│ Hybrid (some remote, some in-person) │ 6.944303596894311 │ 12622 │
└──────────────────────────────────────┴────────────────────┴─────────────────┘
Two interesting takeaways: remote and ybrid workers make up the majority of survey responses, and on average, they seem to be happier too!
Check out our documentation if you want to explore this dataset further.
In the meantime, get ready to tackle future CSV challenges with ease—DuckDB and MotherDuck (start for free!) have got you covered!

2025/01/16 - Jacob Matson
Spark the Joy of beautiful local development workflows with MotherDuck & dbt

2025/01/22 - Simon Späti
Master the essential data engineering toolkit—Linux commands, Docker, Python, SQL, and developer tools. A practical guide to the tools every DE needs.