YouTubeShortTutorial

Simple way to convert CSV and Parquet files

2024/01/17

As developers, we live in the terminal. It’s our command center for everything from Git commits to running applications. But we all have those small, annoying tasks that break our flow—the little papercuts of the development cycle. One of the most common? File format conversion.

You have a Parquet file, but a legacy tool or a colleague needs it in CSV format. Or maybe you have a massive CSV that you want to compress into the more efficient Parquet format. What's your go-to move? Do you spin up a Jupyter notebook and import pandas? Do you write a quick, one-off Python script? Or do you resort to a sketchy online file converter?

These interruptions, while small, add up. But what if the perfect tool for the job was already on your machine, ready to go?

Enter DuckDB. While you might know it as a powerful embedded analytical database, it's also a versatile Swiss army knife for your data. Because it's a lightweight, serverless tool that speaks fluent SQL and natively understands formats like Parquet, CSV, and JSON, it's the perfect utility for lightning-fast file conversions directly from your command line.

Let's build a powerful, reusable conversion utility in just a few minutes.

Instant File Conversion with a DuckDB One-Liner

First things first, you need the duckdb CLI. If you're on macOS, installation is a single command with Homebrew.

Copy code

brew install duckdb

For other operating systems, check out the official installation documentation.

Once installed, you have everything you need. Let's say you have a file named data.parquet and you want to convert it to data.csv. The magic is a single command that leverages DuckDB's powerful COPY statement.

Copy code

duckdb -c "COPY (SELECT * FROM 'data.parquet') TO 'data.csv' (HEADER, DELIMITER ',');"

Let's break down this command to see what's happening:

  • duckdb -c "...": This is the key to using DuckDB as a scripting tool. The -c flag tells DuckDB to execute the SQL command that follows and then immediately exit. No interactive shell, no fuss—just pure, scriptable execution.
  • COPY (...) TO 'data.csv': This is the workhorse. The COPY command is incredibly efficient at moving data into and out of DuckDB.
  • (SELECT * FROM 'data.parquet'): Instead of copying from a table, we're telling DuckDB to copy the result of a query. The magic here is that DuckDB can query files like Parquet or CSV directly, as if they were database tables. It automatically infers the file type and schema from the file extension.
  • (HEADER, DELIMITER ','): These are options specific to the output format. Here, we're telling DuckDB to include a header row in our final CSV file.

And that's it. In the time it would have taken you to open a new editor tab, you've converted your file.

Building a Reusable Conversion Script

That one-liner is great, but we can make it even better. The real power of the command line comes from creating reusable, generic tools.

Let's wrap this logic into a simple Bash script. Create a file named file-converter.sh somewhere convenient, like ~/scripts/.

Copy code

#!/bin/bash # file-converter.sh # Check if two arguments are provided if [ "$#" -ne 2 ]; then echo "Usage: $0 <input_file> <output_file>" exit 1 fi INPUT_FILE=$1 OUTPUT_FILE=$2 duckdb -c "COPY (SELECT * FROM '${INPUT_FILE}') TO '${OUTPUT_FILE}';" echo "Successfully converted ${INPUT_FILE} to ${OUTPUT_FILE}"

Make the script executable:

Copy code

chmod +x ~/scripts/file-converter.sh

Now, you have a generic script that takes an input file and an output file as arguments. The final step is to create a shell alias for ultimate convenience. Open your .zshrc, .bashrc, or equivalent shell configuration file and add this line:

Copy code

# Add to your .zshrc or .bashrc alias dconvert='~/scripts/file-converter.sh'

Restart your terminal or run source ~/.zshrc to apply the changes. Now, witness your new superpower. You can convert files back and forth with a simple, memorable command.

Convert Parquet to CSV:

Copy code

dconvert data.parquet data.csv # Successfully converted data.parquet to data.csv

Convert CSV to Parquet:

Copy code

dconvert data.csv data.parquet # Successfully converted data.csv to data.parquet

It doesn't get simpler than that. You've just built a universal file conversion utility that is faster and more reliable than a custom script and safer than any online tool.

Conclusion: More Than Just Conversion

We started with a simple problem and ended with an elegant, reusable solution. With one small script, you've added a powerful tool to your developer toolkit, powered by DuckDB.

But don't forget what's happening inside that command. You aren't just copying bytes; you're running a full-fledged SQL query. This opens up a world of possibilities that go far beyond the simple 1-to-1 conversion our dconvert alias handles.

What if you only wanted a subset of the data? For more complex tasks, you can bypass the alias and use the duckdb -c command directly to run a more powerful query.

Copy code

# Filter for specific rows before converting duckdb -c "COPY (SELECT * FROM 'data.parquet' WHERE category = 'A') TO 'filtered_data.csv' (HEADER, DELIMITER ',');"

What if you only needed a few columns?

Copy code

# Select specific columns duckdb -c "COPY (SELECT user_id, event_timestamp FROM 'logs.parquet') TO 'events.csv' (HEADER, DELIMITER ',');"

The SELECT statement is your playground. You can perform filtering, transformations, and even simple aggregations as part of your conversion pipeline, all within that single command.


Get Started Today

  • Try this out and share your favorite DuckDB one-liners with us on Twitter!
  • For more details on the options available, check out the official DuckDB documentation for the COPY command.
  • When your challenges go beyond local files, see how MotherDuck brings the power of DuckDB to the cloud for serverless, collaborative analytics.

Related Videos

"Is DuckDB the Secret to Unlocking Your GIS Potential?" video thumbnail

14:49

2024-08-29

Is DuckDB the Secret to Unlocking Your GIS Potential?

In this video, ‪Mehdi walks you through the basics of working with geospatial data and introduces the DuckDB spatial extension. By the end, you will create your own heatmap using DuckDB, Python, and MotherDuck for sharing and scalability.

YouTube

Tutorial

"DuckDB & dataviz | End-To-End Data Engineering Project (3/3)" video thumbnail

0:21:46

2024-06-27

DuckDB & dataviz | End-To-End Data Engineering Project (3/3)

In this part 3 of the project, @mehdio explores how to build a Dashboard with Evidence using MotherDuck/DuckDb as a data source.

YouTube

BI & Visualization

Tutorial

"One data tool with all its dependencies: DuckDB and extensions" video thumbnail

0:00:55

2024-06-21

One data tool with all its dependencies: DuckDB and extensions

Learn about DuckDB extensions, including the ability to query data in your AWS S3-powered data lake.

YouTube

Short