Breaking the Excel-SQL Barrier: Leveraging DuckDB's Excel Extension

2025/05/27 - 5 min read

One of the underrated features that snuck into DuckDB 1.2.0 was the excel extension got a major upgrade. In the recent past, it was used merely for formatting text in excel format (important for a very specific use case, I suppose) but now it can read and write XLSX files!!

I am excited for this as someone who spent a good chunk of my career working in and with finance teams that had key datasets in Excel files. Integrating them into our data warehouse for downstream reporting was a painful, manual process. It was so painful that at one company we wrote a custom excel plugin to allow end users to import their excel files into tables in our SQL Server based data warehouse! (I think about that plugin more than I care to admit).

Now with the this upgraded extension, I don't need to think about that plugin anymore - we have something frankly way better and easier to integrate into workflows.

Getting Started with the Excel Extension

Installation is similar to other DuckDB Extensions:

Copy code
-- Install the extension (needed only once per DuckDB installation)
INSTALL excel;

-- Load the extension into the current database session
LOAD excel;

Once its installed, it works similar to the csv or json readers: We can query directly from .xlsx files without any functions as the use of the extension is implied.

Copy code
FROM 'my_excel_file.xlsx'

Of course, there are a few config knobs available in this extension, which can be invoked with the read_xlsx() function, again similar to csv or json. Where this comes in handy most often with reading Excel sheet is for (1) choosing a sheet that's not the first sheet (which is the default behavior), and (2) handling datatype issues with all_varchar and ignore_errors flags.

For example, reading the second tab of an excel sheet and casting all the data to varchar is invoked like this:

Copy code
FROM read_xlsx(
  'my_excel_file.xlsx', 
  all_varchar = true, 
  sheet = 'sheet2');

Handling Excel files with MotherDuck

It should be noted that as of this writing, the MotherDuck UI does not allow importing of Excel extension files, so you need to use the DuckDB CLI to accomplish this integration. While this is fine for data pipeline work, it is fairly annoying for ad-hoc data exploration; we are aware of this and working on it.

Now that we've established how to use the Excel extension for reading, lets handle some hygiene as it relates to loading Excel based data into MotherDuck. In general, when handling certain adversarial data sources like Excel files, I like to use the all_varchar flag when reading and loading the data, and then handling typing as a second stage.

An example of this would be something like this in the CLI:

Copy code
-- attach motherduck so you can see your cloud databases
ATTACH 'md:';

-- add the data to motherduck
CREATE OR REPLACE TABLE my_db.my_table AS 
  FROM read_xlsx(
  'my_excel_file.xlsx', 
  all_varchar = true, 
  sheet = 'sheet2');
  
-- enforce types
CREATE OR REPLACE TABLE my_db.my_cleaned_table AS 
  SELECT col1::int, col2::numeric
  FROM my_db.my_table

By separating these steps, we can assure the data is loaded and potentially add some try / catch logic in our pipeline when our ~~adversaries~~ users inevitably introducing some typing issues in the source data.

Additionally, you can load ad-hoc data sets into MotherDuck from excel files and join them to your core data warehouse data. This especially helpful in classification exercises where you may have a list of products or customers with additional dimensions for aggregation, and traditional warehouses would force you through a formal data pipeline to make those columns available. With MotherDuck, you are empowered as an analyst to enrich the data in an ad-hoc manner to answer pressing business questions, without dependencies on your data engineering team. This illustrated in the ad-hoc query below:

Copy code
SELECT
  e.category,
  SUM(d.sales) as tot_sales
FROM dwh.sales d
LEFT JOIN (FROM 'my_excel_file.xlsx') e ON e.product_id = d.product_id
GROUP BY ALL

Of course, we aren't limited to merely reading Excel files, we can also write them out. This is helpful especially when dealing with finance stakeholders who may need the data in Excel so they can fold it into a larger process, or are just more familiar with using Excel.

Again, for this exercise of writing files, its best to use the CLI so you can interact with your local file system to produce the file. This can also be done in your data pipelines, i.e. writing the files out to Object Storage.

We can see an example of Excel writes here:

Copy code
COPY report_data
TO 'products.xlsx'
WITH (
    FORMAT xlsx,
    HEADER true,
    SHEET 'SalesData'
);

This will save the file in directory we are running DuckDB in, although you can also specify the path in the TO clause.

Take-aways

With the Excel Extension and MotherDuck, you have all you need to build both a robust reporting pipeline and also handle ad-hoc requests from users based on Excel data. Or if you so desire, even treat Excel files as sources with your data pipeline itself. This type of flexibility is core to MotherDuck and is critical to make sure that business value is never blocked by IT frameworks. Keep Quacking!