Swimming in Google Sheets with MotherDuck

2024/09/04 - 4 min read

Quack Notes

Often you will have spreadsheets that you want to mash up with other spreadsheets, or data in your database, or some random files on your desktop. With MotherDuck, you can easily handle all of these scenarios. In this series of post, you will learn how to read from Google Sheets in two ways: (1) with publicly-shared sheets and (2) with private sheets.

Publicly-Shared Sheets

For Google Sheets that are shared with a public link, extracting the sheet data is as simple as using the read_csv function and passing the URL. There are two things to note here - you will want to make sure to set the format as ‘csv’ and the gid as the tab that you want to load.

Copy code
FROM read_csv('https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv&gid={tab_id}')

As a practical example, you can extract the sheet id and tab id from the URL, as seen in the screenshot below.

I have loaded some F1 data from kaggle into a Google Sheet and made the link public. This Google Sheet has id '1unpDUkTx8UVhuO0bo2yyC4RrAHNhxGnzJziLu5jeXvw' with the following tabs:

Depending on the use case, we can use either views or tables. If you want to keep things in sync with the spreadsheet, a view will work best. If you want to do more complex analysis, materializing as a table (or a temp table for this session) are great ideas for better performance.

The code example below creates the destination schema and then loads the data into MotherDuck:

Copy code
CREATE SCHEMA IF NOT EXISTS f1;

CREATE OR REPLACE TABLE f1.races AS
FROM read_csv('https://docs.google.com/spreadsheets/d/1unpDUkTx8UVhuO0bo2yyC4RrAHNhxGnzJziLu5jeXvw/export?format=csv&gid=2031195234');

CREATE OR REPLACE TABLE f1.constructors AS
FROM read_csv('https://docs.google.com/spreadsheets/d/1unpDUkTx8UVhuO0bo2yyC4RrAHNhxGnzJziLu5jeXvw/export?format=csv&gid=0');

CREATE OR REPLACE TABLE f1.constructor_results AS
FROM read_csv('https://docs.google.com/spreadsheets/d/1unpDUkTx8UVhuO0bo2yyC4RrAHNhxGnzJziLu5jeXvw/export?format=csv&gid=1549360536');

This allows easy subsequent analysis, for example, identifying the top scoring teams in the constructors championship each year.

Copy code
SELECT
    c."name" as constructor_name,
    r.year::text as year,
    sum(cr.points) as points_scored,
    count(*) as races
FROM f1.constructor_results cr
LEFT JOIN f1.races r on r.raceid = cr.raceid
LEFT JOIN f1.constructors c on c.constructorid = cr.constructorid
GROUP BY ALL
HAVING points_scored > 0
ORDER BY points_scored desc

Screenshot 2024-08-30 at 10.02.52 AM.png

Private Sheets

In order to load private sheets into MotherDuck, we need to handle Google Authentication. This is a complex topic, so I'll leave the details to this tutorial by Saturn Cloud.

The high-level overview is that you need to do the following:

Create a Service Account in Google Cloud
Create an Access Token for that Service Account
Add the Service Account as user with access to your Google Sheet
Create an Access Token for MotherDuck

That being said, importing a table into MotherDuck is simple as this bit of code. It should be noted this assumes that you store your Tokens in your .env file.

Copy code
import pandas as pd
from google.oauth2 import service_account
from googleapiclient.discovery import build
import duckdb
import os
import json

# create & load creds
creds_dict = json.loads(os.getenv('GOOGLE_CREDENTIALS_JSON'))
creds = service_account.Credentials.from_service_account_info(
    creds_dict,
    scopes=['https://www.googleapis.com/auth/spreadsheets.readonly']
)

# create the service
service = build('sheets', 'v4', credentials=creds)
sheet = service.spreadsheets()
# note that we use literal tab name instead of gid
result = sheet.values().get(spreadsheetId='1unpDUkTx8UVhuO0bo2yyC4RrAHNhxGnzJziLu5jeXvw', range = 'Constructors').execute() 

# create the df with the column headers based on the values in the first row of the sheet
df = pd.DataFrame(result.get('values', [])[1:], columns=result.get('values', [])[0])

# create a duck connection
con = duckdb.connect(database='md:my_db?motherduck_token=' + os.getenv('MOTHERDUCK_TOKEN'))

# create a table
con.query("create or replace table main.google_sheets as select * from df")

You will note that in this case that tables, not views, are used - because python runtime is outside of MotherDuck, views are not possible, as they will reference objects that the user's scope will not have access to.

Getting started with MotherDuck

Try out MotherDuck for free, explore our integrations like the one with Google Sheets, and keep coding and quacking!

CONTENT

Quack Notes

Publicly-Shared Sheets

Private Sheets

Getting started with MotherDuck

Start using MotherDuck now!

Try 21 Days Free

2024/08/29 - Sheila Sitaram

Small Data SF: The Agenda is now live…with NEW hands-on workshops

We had such an awesome response to Small Data SF after launch: It was so great that we decided to add an additional day of hands-on workshops! Learn more about the full lineup on 9/23 - 9/24 and grab a ticket before it's too late.

2024/10/31 - Mehdi Ouazza, Ryan Boyd

DuckDB Tutorial For Beginners

Get up to speed quickly with DuckDB, including installation, VSCode workflow integration and your first SQL analytics project.

View all

Swimming in Google Sheets with MotherDuck

Quack Notes

Publicly-Shared Sheets

Private Sheets

Getting started with MotherDuck

CONTENT

Subscribe to motherduck blog

PREVIOUS POSTS

Small Data SF: The Agenda is now live…with NEW hands-on workshops

DuckDB Tutorial For Beginners

Swimming in Google Sheets with MotherDuck

Quack Notes

Publicly-Shared Sheets

Private Sheets

Getting started with MotherDuck

CONTENT

Subscribe to motherduck blog

PREVIOUS POSTS

Small Data SF: The Agenda is now live…with *NEW* hands-on workshops

DuckDB Tutorial For Beginners

Small Data SF: The Agenda is now live…with NEW hands-on workshops