Replicating Spreadsheets to MotherDuck

Key bits of data and side schedules often exist in spreadsheets. It is nice to be able to easily add that data to your data warehouse and query it. This guide aims to show you how to perform this workflow using the DuckDB CLI.

tip

In order use these extensions, you will need to first install the DuckDB CLI. Instructions can be found here.

Microsoft Excel

note

The purpose of this guide is to show you how to load data from Excel into MotherDuck. Detailed documentation on loading xlsx files can be found on DuckDB.org.

To read from an Excel spreadsheet, first install and load the spatial extension. Do not use the excel extension, which serves an entirely different purpose. The SQL to do this is below:

INSTALL spatial;
LOAD spatial;

This installs the st_read function, so then you can query an Excel file with the path, for example:

SELECT * 
FROM st_read('myfile.xlsx', layer = 'Sheet1');

More typically, you will pass the fully qualified file name to the st_read function. Getting the qualified name depends on your operating system:

Windows: hold Shift and right-click the file or folder, then select "Copy as Path" from the context menu.
Mac OS: right-click the file or folder, press and hold "Option" and then select "Copy ... as Pathname" from the context menu. You can then paste that into your SQL query, as per below:

SELECT * 
FROM st_read("C:\users\sql_user\documents\myfile.xlsx", layer = 'Sheet1');

The previous query simply returns the data set to the terminal, but the query can be modified to write the data into MotherDuck with "Create Table As Select" (CTAS).

CREATE OR REPLACE TABLE my_db.main.my_table AS -- use fully qualified table name
SELECT * 
FROM st_read("C:\users\documents\myfile.xlsx", layer = 'Sheet1');

Of course, sometimes there is data in multiple tabs. In that case, you can use the layer parameter to pass the tab names, and depending on the context, even union multiple tabs into a single table.

CREATE OR REPLACE TABLE my_db.main.my_table AS -- use fully qualified table name
SELECT * 
FROM st_read("C:\users\documents\myfile.xlsx", layer = 'Sheet1')
UNION ALL
SELECT * 
FROM st_read("C:\users\documents\myfile.xlsx", layer = 'Sheet2');

Google Sheets

The first step to handle Google Sheets is to install the duckdb-gsheets extension. That is done with these commands from the DuckDB CLI:

INSTALL gsheets FROM community;
LOAD gsheets;

Since Google Sheets is a hosted application, we need to use DuckDB Secrets to handle authentication.

This is as simple as:

CREATE SECRET (TYPE gsheet);

note

Using this workflow will require interactivity with a browser, so if you need to run it from a job (i.e. Airflow or similar), consider setting up a Google API access token.

In order to read from a Google Sheet, we need at minimum the sheet id, which is found in the URL, for example https://docs.google.com/spreadsheets/d/11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8/edit. The string between d/ and /edit represents the spreadsheet id. It can therefore be queried with:

SELECT *
FROM read_gsheet('https://docs.google.com/spreadsheets/d/11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8/edit');

The previous query simply returns the data set to the terminal, but the query can be modified to write the data into MotherDuck with "Create Table As Select" (CTAS).

CREATE OR REPLACE TABLE my_db.main.my_table AS -- use fully qualified table name
SELECT * 
FROM read_gsheet('https://docs.google.com/spreadsheets/d/11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8/edit');

For convenience, the spreadsheet id itself can be queried as well.

SELECT * 
FROM read_gsheet('11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8');

To query data from multiple tabs, the tab name can be passed as parameter using sheet to select the preferred tab.

SELECT * FROM read_gsheet('11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8', sheet='Sheet2');

For more detailed documentation, including writing to Google Sheets, review the duckdb-gsheets documentation.

Handling More Complex Workflows

Production use cases tend to be much more complex and include things like incremental builds & state management. In those scenarios, please take a look at our ingestion partners, which includes many options including some that offer native python. An overview of the MotherDuck Ecosystem is shown below.

Diagram

Microsoft Excel​

Google Sheets​

Handling More Complex Workflows​

Microsoft Excel

Google Sheets

Handling More Complex Workflows