From BigQuery to DuckDB and MotherDuck : Efficient Local and Cloud Data Pipelines

2025/05/30 - 7 min read

BY

BigQuery has been a cornerstone OLAP database for over a decade. However, today we have more optionsβ€”especially for local developmentβ€”that offer a smoother and more flexible experience.

DuckDB stands out for local workflows, but it can also interoperate with BigQuery during the development phase and offload some of the compute to MotherDuck, DuckDB's cloud backend.

In addition, BigQuery hosts several well-maintained public datasets like PyPI download statistics and Hacker News activity.

In this blog post, we’ll explore two great options for seamlessly loading data from BigQuery into DuckDB and MotherDuck.

We'll use the DuckDB CLI for demonstration, but any client (e.g., Python) will work:

Copy code

import duckdb # Create an in-memory DuckDB connection conn = duckdb.connect() # Run SQL queries conn.sql('SELECT * FROM my_table;')

DuckDB BigQuery community extension

The duckdb-bigquery community extension is one of the most downloaded DuckDB extensions!

You can inspect the download stats from the last week (e.g., May 19, 2025) using:

Copy code

UNPIVOT ( SELECT 'community' AS repository, * FROM 'https://community-extensions.duckdb.org/downloads-last-week.json' ) ON COLUMNS(* EXCLUDE (_last_update, repository)) INTO NAME extension VALUE downloads_last_week ORDER BY downloads_last_week DESC;

Copy code

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ repository β”‚ _last_update β”‚ extension β”‚ downloads_last_week β”‚ β”‚ varchar β”‚ timestamp β”‚ varchar β”‚ int64 β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ community β”‚ 2025-05-21 07:28:50 β”‚ arrow β”‚ 163603 β”‚ β”‚ community β”‚ 2025-05-21 07:28:50 β”‚ shellfs β”‚ 71496 β”‚ β”‚ community β”‚ 2025-05-21 07:28:50 β”‚ h3 β”‚ 26729 β”‚ β”‚ community β”‚ 2025-05-21 07:28:50 β”‚ zipfs β”‚ 22344 β”‚ β”‚ community β”‚ 2025-05-21 07:28:50 β”‚ bigquery β”‚ 21678 β”‚

The BigQuery extension is in the top 5, with over 21k downloads last week.

INFO: Extensions Types Core extensions are officially maintained by DuckDB Labs and are usually auto-installed/loaded. Community extensions are built and published by the community. To install them, use: INSTALL <community_extension> FROM COMMUNITY and then LOAD <community_extension>

Prerequisites and Installation

To use the BigQuery extension, you'll need valid Google Cloud credentials. You can either:

  • Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to a service account file.
  • Or run gcloud auth application-default login to generate credentials stored at $HOME/.config/gcloud/application_default_credentials.json

In terms of permission, the user or service account should have at least role of BigQuery Data Editor and BigQuery Job User.

After launching a DuckDB session with the CLI :

Copy code

$ duckdb

You can then install the DuckDB community extension by :

Copy code

INSTALL bigquery FROM community; LOAD bigquery;

Now ATTACH a BigQuery project like any other database:

Copy code

ATTACH 'project=my-gcp-project' as bq (TYPE bigquery, READ_ONLY);
INFO: How to find your Project ID? A Project ID is a globally unique identifier for your GCP project. It's visible in the Google Cloud Console project picker or by running: gcloud projects list

Once attached, querying your dataset is simple:

Copy code

SELECT * FROM bq.<dataset_name>.<table_name> LIMIT 5;

Example: querying the PyPI public dataset

Let's query the PyPI public dataset, which logs Python package downloads. Since it's a public dataset, you must set a billing project (=your own GCP project with billing enabled):

Copy code

ATTACH 'project=bigquery-public-data dataset=pypi billing_project=my-gcp-project' AS bigquery_public_data (TYPE bigquery, READ_ONLY);

Then query:

Copy code

SELECT timestamp, country_code, url, project, file, details, tls_protocol, tls_cipher FROM bigquery_public_data.pypi.file_downloads WHERE project = 'duckdb' AND "timestamp" = TIMESTAMP '2025-05-26 00:00:00' LIMIT 100;
WARNING: Large Table Warning This table is very large. Filter by both "project" and "timestamp" (it's partitioned on timestamp) to avoid high costs.

Behind the scene, this is doing a scan, you have actually explicitly two functions to query Bigquery : Now you can start querying data from your project. You have two main options

  1. bigquery_scan() – Best for reading a single table efficiently with simple projections:

Copy code

SELECT * FROM bigquery_scan('my_gcp_project.quacking_dataset.duck_tbl');
  1. bigquery_query to run custom GoogleSQL read queries within your BigQuery project. Recommended for large table with filter pushdowns

Copy code

SELECT * FROM bigquery_query('my_gcp_project', 'SELECT * FROM `my_gcp_project.quacking_dataset.duck_tbl`');

Load data into MotherDuck

Now if you want to load your data to MotherDuck, simply connect to MotherDuck with another attach command using ATTACH 'md:' , assuming that you have a motherduck_token set as an environment variable.

Copy code

ATTACH 'md:'

Let's create a cloud database to store our data :

Copy code

CREATE DATABASE IF NOT exists pypi_playground

Now you can do a simple copy data to MotherDuck using a CREATE TABLE ... AS or INSERT INTO ... SELECT if you want to insert data into an existing table :

Copy code

CREATE TABLE IF NOT EXISTS pypi_playground.duckdb_sample AS SELECT timestamp, country_code, url, project, file, details, tls_protocol, tls_cipher FROM bigquery_public_data.pypi.file_downloads WHERE project = 'duckdb' AND "timestamp" = TIMESTAMP '2025-05-26 00:00:00' LIMIT 100;

Using Google's Python SDK for BigQuery

Google has a Python SDK for BigQuery which supports fast data transfer into Arrow tables. If you want to optimize performance for your ETL pipelinesβ€”especially when working with large tables and filter pushdownβ€”using Arrow results can be significantly faster, as they enable zero-copy interaction with DuckDB.

Here are the high-level steps when using the Python SDK : BigQuery -> PyArrow table -> DuckDB and/or MotherDuck

You can install the Python library with :

Copy code

$ pip install google-cloud-bigquery[bqstorage]

The "extras" option [bqstorage] install google-cloud-bigquery-storage. By default, the google-cloud-bigquery client uses the standard BigQuery API to read query results. This is fine for small results, but much slower and less efficient for large datasets.

When you install the bqstorage extra, you're enabling use of the BigQuery Storage API, which:

  • Streams large query results in parallel.
  • Uses Apache Arrow (via pyarrow package) for fast in-memory columnar data access.
  • Supports high-throughput data transfers directly into Pandas or NumPy structures.

Let's start by creating some helper functions to get the BigQuery client get_bigquery_client() and run a given SQL and return an arrow table get_bigquery_result()

Copy code

import os from google.cloud import bigquery from google.oauth2 import service_account from google.auth.exceptions import DefaultCredentialsError import logging import time import pyarrow as pa import duckdb # Configure logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S' ) def get_bigquery_client(project_name: str) -> bigquery.Client: """Get Big Query client""" try: service_account_path = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") if service_account_path: credentials = service_account.Credentials.from_service_account_file( service_account_path ) bigquery_client = bigquery.Client( project=project_name, credentials=credentials ) return bigquery_client raise EnvironmentError( "No valid credentials found for BigQuery authentication." ) except DefaultCredentialsError as creds_error: raise creds_error def get_bigquery_result( query_str: str, bigquery_client: bigquery.Client ) -> pa.Table: """Get query result from BigQuery and yield rows as dictionaries.""" try: # Start measuring time start_time = time.time() # Run the query and directly load into a DataFrame logging.info(f"Running query: {query_str}") pa_tbl = bigquery_client.query(query_str).to_arrow() # Log the time taken for query execution and data loading elapsed_time = time.time() - start_time logging.info( f"BigQuery query executed and data loaded in {elapsed_time:.2f} seconds") # Iterate over DataFrame rows and yield as dictionaries return pa_tbl except Exception as e: logging.error(f"Error running query: {e}") raise

Once we get a Pyarrow table, loading data to DuckDB and/or MotherDuck is similar to what we did above with the duckdb-bigquery extension. We'll use an attach command (ATTACH 'md:') to connect to MotherDuck, then either use a CREATE TABLE ... AS or INSERT INTO ... AS statements to load data. The Pyarrow table object can directly be query as it would be a DuckDB table.

Copy code

def create_duckdb_table_from_arrow( pa_table: pa.Table, table_name: str, database_name: str = "bigquery_playground", db_path: str = None ) -> None: """ Create a DuckDB table from PyArrow table data. Args: pa_table: PyArrow table containing the data table_name: Name of the table to create in DuckDB database_name: Name of the database to create/use (default: bigquery_playground) db_path: Database path - use 'md:' prefix for MotherDuck, file path for local or just :memory: for in-memory """ try: # Connect to DuckDB if db_path.startswith("md:"): # check env var motherduck_token if not os.environ.get("motherduck_token"): raise EnvironmentError( "motherduck_token environment variable is not set") conn = duckdb.connect(db_path) # Create database if not exists conn.sql(f"CREATE DATABASE IF NOT EXISTS {database_name}") conn.sql(f"USE {database_name}") # Create table from PyArrow table conn.sql( f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM pa_table") logging.info( f"Successfully created table '{table_name}' in database '{database_name}' with {len(pa_table)} rows to {db_path}") except Exception as e: logging.error(f"Error creating DuckDB table: {e}") raise

we can now create the pipeline and calling the above functions :

Copy code

if __name__ == "__main__": bigquery_client = get_bigquery_client("my-gcp-project") pa_table = get_bigquery_result("""SELECT * FROM `bigquery-public-data.pypi.file_downloads` WHERE project = 'duckdb' AND timestamp >= TIMESTAMP("2025-05-19") AND timestamp < TIMESTAMP("2025-05-20")""", bigquery_client) create_duckdb_table_from_arrow( pa_table, "pypi_file_downloads", db_path="md:")

Running the full pipeline with python ingest_bigquery_data.py, we loaded 873k rows from BigQuery to MotherDuck in less than 20s !

Copy code

2025-05-27 09:45:52 - INFO - Running query: SELECT * FROM `bigquery-public-data.pypi.file_downloads` WHERE project = 'duckdb' AND timestamp >= TIMESTAMP("2025-05-19") AND timestamp < TIMESTAMP("2025-05-20") 2025-05-27 09:46:03 - INFO - BigQuery query executed and data loaded in 7.20 seconds 2025-05-27 09:46:11 - INFO - Successfully created table 'pypi_file_downloads' in database 'bigquery_playground' with 837122 rows to md:

Check the full Python gist here.

BigQuery loves ducks

Both the duckdb-bigquery extension and Google's Python SDK make it incredibly easy to move data from BigQuery into DuckDB or MotherDuck.

Check out also the https://duckdbstats.com/ projects with its source code for another example on how to ingest, transform and serve data in MotherDuck from a BigQuery source dataset.

Keep codingβ€”and keep quacking! πŸ¦†

CONTENT
  1. DuckDB BigQuery community extension
  2. Using Google's Python SDK for BigQuery
  3. BigQuery loves ducks

Start using MotherDuck now!

blog subscription icon

Subscribe to motherduck blog

PREVIOUS POSTS

The Open Lakehouse Stack: DuckDB and the Rise of Table Formats

2025/05/23 - Simon SpΓ€ti

The Open Lakehouse Stack: DuckDB and the Rise of Table Formats

Learn how DuckDB and open table formats like Iceberg power a fast, composable analytics stack on affordable cloud storage

Breaking the Excel-SQL Barrier: Leveraging DuckDB's Excel Extension

2025/05/27 - Jacob Matson

Breaking the Excel-SQL Barrier: Leveraging DuckDB's Excel Extension

Now in MotherDuck & DuckDB, its never been easier to join in your data from spreadsheet sources.