Introducing Flights: agent-native data pipelines in MotherDuckJoin the livestream

Skip to main content

Using Excel and Google Sheets data in MotherDuck

Key bits of data and side schedules often exist in spreadsheets like Excel and Google Sheets. It is useful to add that data to your data warehouse and query it. This guide shows how to perform this workflow using the DuckDB CLI for both Excel and Google Sheets.

tip

To use these extensions, you will need to first install the DuckDB CLI. Instructions can be found here..

Microsoft Excel

note

The purpose of this guide is to show you how to load data from Excel into MotherDuck. If you'd like to retrieve MotherDuck data in Excel, you can follow this guide.

To read from an Excel spreadsheet, open the DuckDB CLI by typing duckdb 'md:' in your terminal. This will ask you for access to your MotherDuck account if you haven't already provided it.

You can read Excel files directly with SELECT * FROM 'movies.xlsx', which will automatically load the DuckDB Excel extension. If you want to get more control you can use the read_xlsx function directly.

SELECT * FROM read_xlsx('movies.xlsx', sheet = 'Action Movies');

The previous query returns the data set to the terminal, but the query can be modified to write the data into MotherDuck with "Create Table As Select" (CTAS).

CREATE OR REPLACE TABLE my_db.main.my_movies AS -- use fully qualified table name
SELECT *
FROM 'C:\users\documents\movies.xlsx';

Sometimes there is data in multiple tabs. In that case, you can use the sheet parameter to pass the tab names, and depending on the context, even union multiple tabs into a single table.

CREATE OR REPLACE TABLE my_db.main.my_movies AS -- use fully qualified table name
SELECT *
FROM read_xlsx('C:\users\documents\movies.xlsx', sheet = 'Action Movies')
UNION ALL
SELECT *
FROM read_xlsx('C:\users\documents\movies.xlsx', sheet = 'Romance Movies');

Google Sheets

Query Google Sheets as CSV over HTTPS

If a Google Sheet is publicly accessible, or can be accessed with HTTP authentication, query it from MotherDuck with DuckDB's read_csv() function and the Google Sheets CSV export URL:

SELECT *
FROM read_csv(
'https://docs.google.com/spreadsheets/d/<sheet_id>/export?format=csv&gid=<tab_id>',
MD_RUN = REMOTE
);

The sheet_id is the value between /d/ and /edit in the Google Sheet URL. The gid identifies the worksheet tab. When you run this while connected to MotherDuck, the HTTPS read can execute server side in MotherDuck.

To keep the spreadsheet queryable as live source data, create a view:

CREATE OR REPLACE VIEW my_db.main.sheet_source AS
SELECT *
FROM read_csv(
'https://docs.google.com/spreadsheets/d/<sheet_id>/export?format=csv&gid=<tab_id>',
MD_RUN = REMOTE
);

To snapshot the current spreadsheet data into MotherDuck, create a table instead:

CREATE OR REPLACE TABLE my_db.main.sheet_snapshot AS
SELECT *
FROM read_csv(
'https://docs.google.com/spreadsheets/d/<sheet_id>/export?format=csv&gid=<tab_id>',
MD_RUN = REMOTE
);

For private sheets, create an HTTP secret with an OAuth bearer token and scope it to Google Sheets:

CREATE SECRET google_sheets_http IN MOTHERDUCK (
TYPE HTTP,
SCOPE 'https://docs.google.com',
EXTRA_HTTP_HEADERS MAP {
'Authorization': 'Bearer <google_oauth_access_token>'
}
);

See the DuckDB HTTP authentication documentation for more httpfs authentication options. For more detail on this Google Sheets URL pattern, see Swimming in Google Sheets with MotherDuck.

Query with the Google Sheets extension

info

While the Excel extension is a core DuckDB extension, the Google Sheets extension is a community extension maintained by Evidence.

The first step to handle Google Sheets is to install the duckdb-gsheets extension. That is done with these commands after starting the DuckDB CLI with duckdb 'md:'

INSTALL gsheets FROM community;
LOAD gsheets;

Since Google Sheets is a hosted application, we need to use DuckDB Secrets to handle authentication. This is as simple as:

CREATE SECRET (TYPE gsheet);
note

Using this workflow will require interactivity with a browser, so if you need to run it from a job (i.e. Airflow or similar), consider setting up a Google API access token.

To read from a Google Sheet, we need at minimum the sheet id, which is found in the URL, for example https://docs.google.com/spreadsheets/d/11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8/edit. The string between d/ and /edit represents the spreadsheet id. It can therefore be queried with:

SELECT *
FROM read_gsheet('https://docs.google.com/spreadsheets/d/11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8/edit');

The previous query returns the data set to the terminal, but the query can be modified to write the data into MotherDuck with "Create Table As Select" (CTAS).

CREATE OR REPLACE TABLE my_db.main.my_table AS -- use fully qualified table name
SELECT *
FROM read_gsheet('https://docs.google.com/spreadsheets/d/11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8/edit');

For convenience, the spreadsheet id itself can be queried as well.

SELECT * 
FROM read_gsheet('11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8');

To query data from multiple tabs, the tab name can be passed as parameter using sheet to select the preferred tab.

SELECT * FROM read_gsheet('11QdEasMWbETbFVxry-SsD8jVcdYIT1zBQszcF84MdE8', sheet='Sheet2');

For more detailed documentation, including writing to Google Sheets, review the duckdb-gsheets documentation.

Handling more complex workflows

Production use cases tend to be much more complex and include things like incremental builds & state management. In those scenarios, please take a look at our ingestion partners, which includes many options including some that offer native python. An overview of the MotherDuck Ecosystem is shown below.

Diagram