From BigQuery to DuckDB and MotherDuck : Efficient Local and Cloud Data Pipelines
2025/05/30 - 7 min read
BYBigQuery has been a cornerstone OLAP database for over a decade. However, today we have more optionsβespecially for local developmentβthat offer a smoother and more flexible experience.
DuckDB stands out for local workflows, but it can also interoperate with BigQuery during the development phase and offload some of the compute to MotherDuck, DuckDB's cloud backend.
In addition, BigQuery hosts several well-maintained public datasets like PyPI download statistics and Hacker News activity.
In this blog post, weβll explore two great options for seamlessly loading data from BigQuery into DuckDB and MotherDuck.
We'll use the DuckDB CLI for demonstration, but any client (e.g., Python) will work:
Copy code
import duckdb
# Create an in-memory DuckDB connection
conn = duckdb.connect()
# Run SQL queries
conn.sql('SELECT * FROM my_table;')
DuckDB BigQuery community extension
The duckdb-bigquery community extension is one of the most downloaded DuckDB extensions!
You can inspect the download stats from the last week (e.g., May 19, 2025) using:
Copy code
UNPIVOT (
SELECT 'community' AS repository, *
FROM 'https://community-extensions.duckdb.org/downloads-last-week.json'
)
ON COLUMNS(* EXCLUDE (_last_update, repository))
INTO NAME extension VALUE downloads_last_week
ORDER BY downloads_last_week DESC;
Copy code
ββββββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββββββββββ
β repository β _last_update β extension β downloads_last_week β
β varchar β timestamp β varchar β int64 β
ββββββββββββββΌββββββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββββββ€
β community β 2025-05-21 07:28:50 β arrow β 163603 β
β community β 2025-05-21 07:28:50 β shellfs β 71496 β
β community β 2025-05-21 07:28:50 β h3 β 26729 β
β community β 2025-05-21 07:28:50 β zipfs β 22344 β
β community β 2025-05-21 07:28:50 β bigquery β 21678 β
The BigQuery extension is in the top 5, with over 21k downloads last week.
INFO: Extensions Types Core extensions are officially maintained by DuckDB Labs and are usually auto-installed/loaded. Community extensions are built and published by the community. To install them, use:INSTALL <community_extension> FROM COMMUNITY
and then LOAD <community_extension>
Prerequisites and Installation
To use the BigQuery extension, you'll need valid Google Cloud credentials. You can either:
- Set the
GOOGLE_APPLICATION_CREDENTIALS
environment variable to point to a service account file. - Or run
gcloud auth application-default login
to generate credentials stored at$HOME/.config/gcloud/application_default_credentials.json
In terms of permission, the user or service account should have at least role of BigQuery Data Editor and BigQuery Job User.
After launching a DuckDB session with the CLI :
Copy code
$ duckdb
You can then install the DuckDB community extension by :
Copy code
INSTALL bigquery FROM community;
LOAD bigquery;
Now ATTACH
a BigQuery project like any other database:
Copy code
ATTACH 'project=my-gcp-project' as bq (TYPE bigquery, READ_ONLY);
INFO: How to find your Project ID?
A Project ID is a globally unique identifier for your GCP project. It's visible in the Google Cloud Console project picker or by running: gcloud projects list
Once attached, querying your dataset is simple:
Copy code
SELECT * FROM bq.<dataset_name>.<table_name> LIMIT 5;
Example: querying the PyPI public dataset
Let's query the PyPI public dataset, which logs Python package downloads. Since it's a public dataset, you must set a billing project (=your own GCP project with billing enabled):
Copy code
ATTACH 'project=bigquery-public-data dataset=pypi billing_project=my-gcp-project' AS bigquery_public_data (TYPE bigquery, READ_ONLY);
Then query:
Copy code
SELECT
timestamp,
country_code,
url,
project,
file,
details,
tls_protocol,
tls_cipher
FROM
bigquery_public_data.pypi.file_downloads
WHERE
project = 'duckdb'
AND "timestamp" = TIMESTAMP '2025-05-26 00:00:00'
LIMIT 100;
WARNING: Large Table Warning
This table is very large. Filter by both "project" and "timestamp" (it's partitioned on timestamp) to avoid high costs.
Behind the scene, this is doing a scan, you have actually explicitly two functions to query Bigquery : Now you can start querying data from your project. You have two main options
- bigquery_scan() β Best for reading a single table efficiently with simple projections:
Copy code
SELECT * FROM bigquery_scan('my_gcp_project.quacking_dataset.duck_tbl');
- bigquery_query to run custom GoogleSQL read queries within your BigQuery project. Recommended for large table with filter pushdowns
Copy code
SELECT * FROM bigquery_query('my_gcp_project', 'SELECT * FROM `my_gcp_project.quacking_dataset.duck_tbl`');
Load data into MotherDuck
Now if you want to load your data to MotherDuck, simply connect to MotherDuck with another attach command using ATTACH 'md:'
, assuming that you have a motherduck_token
set as an environment variable.
Copy code
ATTACH 'md:'
Let's create a cloud database to store our data :
Copy code
CREATE DATABASE IF NOT exists pypi_playground
Now you can do a simple copy data to MotherDuck using a CREATE TABLE ... AS
or INSERT INTO ... SELECT
if you want to insert data into an existing table :
Copy code
CREATE TABLE IF NOT EXISTS pypi_playground.duckdb_sample AS SELECT
timestamp,
country_code,
url,
project,
file,
details,
tls_protocol,
tls_cipher
FROM
bigquery_public_data.pypi.file_downloads
WHERE
project = 'duckdb'
AND "timestamp" = TIMESTAMP '2025-05-26 00:00:00'
LIMIT 100;
Using Google's Python SDK for BigQuery
Google has a Python SDK for BigQuery which supports fast data transfer into Arrow tables. If you want to optimize performance for your ETL pipelinesβespecially when working with large tables and filter pushdownβusing Arrow results can be significantly faster, as they enable zero-copy interaction with DuckDB.
Here are the high-level steps when using the Python SDK : BigQuery -> PyArrow table -> DuckDB and/or MotherDuck
You can install the Python library with :
Copy code
$ pip install google-cloud-bigquery[bqstorage]
The "extras" option [bqstorage]
install google-cloud-bigquery-storage
. By default, the google-cloud-bigquery
client uses the standard BigQuery API to read query results. This is fine for small results, but much slower and less efficient for large datasets.
When you install the bqstorage
extra, you're enabling use of the BigQuery Storage API, which:
- Streams large query results in parallel.
- Uses Apache Arrow (via
pyarrow
package) for fast in-memory columnar data access. - Supports high-throughput data transfers directly into Pandas or NumPy structures.
Let's start by creating some helper functions to get the BigQuery client get_bigquery_client()
and run a given SQL and return an arrow table get_bigquery_result()
Copy code
import os
from google.cloud import bigquery
from google.oauth2 import service_account
from google.auth.exceptions import DefaultCredentialsError
import logging
import time
import pyarrow as pa
import duckdb
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
def get_bigquery_client(project_name: str) -> bigquery.Client:
"""Get Big Query client"""
try:
service_account_path = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS")
if service_account_path:
credentials = service_account.Credentials.from_service_account_file(
service_account_path
)
bigquery_client = bigquery.Client(
project=project_name, credentials=credentials
)
return bigquery_client
raise EnvironmentError(
"No valid credentials found for BigQuery authentication."
)
except DefaultCredentialsError as creds_error:
raise creds_error
def get_bigquery_result(
query_str: str, bigquery_client: bigquery.Client
) -> pa.Table:
"""Get query result from BigQuery and yield rows as dictionaries."""
try:
# Start measuring time
start_time = time.time()
# Run the query and directly load into a DataFrame
logging.info(f"Running query: {query_str}")
pa_tbl = bigquery_client.query(query_str).to_arrow()
# Log the time taken for query execution and data loading
elapsed_time = time.time() - start_time
logging.info(
f"BigQuery query executed and data loaded in {elapsed_time:.2f} seconds")
# Iterate over DataFrame rows and yield as dictionaries
return pa_tbl
except Exception as e:
logging.error(f"Error running query: {e}")
raise
Once we get a Pyarrow
table, loading data to DuckDB and/or MotherDuck is similar to what we did above with the duckdb-bigquery
extension. We'll use an attach command (ATTACH 'md:'
) to connect to MotherDuck, then either use a CREATE TABLE ... AS
or INSERT INTO ... AS
statements to load data.
The Pyarrow table object can directly be query as it would be a DuckDB table.
Copy code
def create_duckdb_table_from_arrow(
pa_table: pa.Table,
table_name: str,
database_name: str = "bigquery_playground",
db_path: str = None
) -> None:
"""
Create a DuckDB table from PyArrow table data.
Args:
pa_table: PyArrow table containing the data
table_name: Name of the table to create in DuckDB
database_name: Name of the database to create/use (default: bigquery_playground)
db_path: Database path - use 'md:' prefix for MotherDuck, file path for local or just :memory: for in-memory
"""
try:
# Connect to DuckDB
if db_path.startswith("md:"):
# check env var motherduck_token
if not os.environ.get("motherduck_token"):
raise EnvironmentError(
"motherduck_token environment variable is not set")
conn = duckdb.connect(db_path)
# Create database if not exists
conn.sql(f"CREATE DATABASE IF NOT EXISTS {database_name}")
conn.sql(f"USE {database_name}")
# Create table from PyArrow table
conn.sql(
f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM pa_table")
logging.info(
f"Successfully created table '{table_name}' in database '{database_name}' with {len(pa_table)} rows to {db_path}")
except Exception as e:
logging.error(f"Error creating DuckDB table: {e}")
raise
we can now create the pipeline and calling the above functions :
Copy code
if __name__ == "__main__":
bigquery_client = get_bigquery_client("my-gcp-project")
pa_table = get_bigquery_result("""SELECT *
FROM
`bigquery-public-data.pypi.file_downloads`
WHERE
project = 'duckdb'
AND timestamp >= TIMESTAMP("2025-05-19")
AND timestamp < TIMESTAMP("2025-05-20")""", bigquery_client)
create_duckdb_table_from_arrow(
pa_table, "pypi_file_downloads", db_path="md:")
Running the full pipeline with python ingest_bigquery_data.py
, we loaded 873k rows from BigQuery to MotherDuck in less than 20s
!
Copy code
2025-05-27 09:45:52 - INFO - Running query: SELECT *
FROM
`bigquery-public-data.pypi.file_downloads`
WHERE
project = 'duckdb'
AND timestamp >= TIMESTAMP("2025-05-19")
AND timestamp < TIMESTAMP("2025-05-20")
2025-05-27 09:46:03 - INFO - BigQuery query executed and data loaded in 7.20 seconds
2025-05-27 09:46:11 - INFO - Successfully created table 'pypi_file_downloads' in database 'bigquery_playground' with 837122 rows to md:
Check the full Python gist here.
BigQuery loves ducks
Both the duckdb-bigquery
extension and Google's Python SDK make it incredibly easy to move data from BigQuery into DuckDB or MotherDuck.
Check out also the https://duckdbstats.com/ projects with its source code for another example on how to ingest, transform and serve data in MotherDuck from a BigQuery source dataset.
Keep codingβand keep quacking! π¦
CONTENT
- DuckDB BigQuery community extension
- Using Google's Python SDK for BigQuery
- BigQuery loves ducks
Start using MotherDuck now!
