Data-based: Going Beyond the Dataframe
2025/11/20TL;DR: Feeling the limits of Pandas with large datasets? This guide shows how DuckDB acts as a turbocharger for your existing workflow, not a replacement. Learn to query DataFrames with SQL, achieve 10-40x faster performance with 90% less memory, and scale from your laptop to the cloud with MotherDuck in one line of code.
If you work with data in Python, you almost certainly use and love Pandas. Its DataFrame API expresses ideas powerfully and has become the standard for data manipulation.
But as your datasets grow, you've probably started feeling the pain. Your laptop fans spin up, your Jupyter kernel crashes with an OutOfMemoryError, and simple groupby operations take forever. You've hit the "DataFrame wall."
Many developers assume the next step is a massive leap to a complex distributed system like Spark. But you can break through that wall without leaving your favorite tools behind.
DuckDB and MotherDuck solve this problem by augmenting your existing workflow. In this post, we'll walk through a practical guide, based on a recent live demo, showing you how to:
- Query Pandas DataFrames and local files with the full power of SQL.
- Achieve massive performance gains—often over 10x faster and with 90% less memory.
- Build a hybrid ML pipeline that uses the best of DuckDB, Pandas, and Scikit-learn.
- Scale from local to the cloud with a single line of code.
How to Query Pandas DataFrames Directly with DuckDB SQL
DuckDB's zero-copy integration with Pandas runs on Apache Arrow. You can query a Pandas DataFrame with SQL without any costly data conversion.
We'll create a sample 500,000-row DataFrame and run a standard aggregation in both Pandas and DuckDB.
Copy code
import pandas as pd
import duckdb
import numpy as np
# Create sample sales data - 500k rows
np.random.seed(42)
n_rows = 500_000
sales_data = {
'product': np.random.choice(['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Phone', 'Tablet', 'Headphones', 'Cable'], n_rows),
'region': np.random.choice(['North', 'South', 'East', 'West'], n_rows),
'amount': np.random.uniform(10, 2000, n_rows).round(2),
'customer_id': np.random.randint(0, 20000, n_rows),
'quantity': np.random.randint(1, 10, n_rows)
}
df = pd.DataFrame(sales_data)
The standard Pandas approach is familiar:
Copy code
# Pandas approach: Standard groupby aggregation
result_pandas = df.groupby('product').agg({
'amount': ['sum', 'mean', 'count']
}).round(2)
# Time: ~0.02 seconds
Now, let's do the exact same thing with DuckDB, querying the df object directly:
Copy code
# DuckDB approach: SQL query on the same DataFrame
result_duckdb = duckdb.sql("""
SELECT
product,
SUM(amount) as total_sales,
ROUND(AVG(amount), 2) as avg_sales,
COUNT(*) as num_transactions
FROM df -- Notice we are querying the DataFrame 'df' directly
GROUP BY product
ORDER BY total_sales DESC
""").df()
# Time: ~0.01 seconds
DuckDB automatically discovers the df object in your Python environment and lets you query it as if it were a database table. For SQL fans, this exposes the full declarative power of SQL on your DataFrames.
Querying Files Directly from Disk
DuckDB can query files like CSVs and Parquet directly from disk, streaming the data instead of loading it all into memory. This works well with datasets that don't fit in RAM.
Copy code
# Save to CSV for demonstration
df.to_csv('sales_data.csv', index=False)
# Query CSV directly without loading into pandas
result = duckdb.sql("""
SELECT
region,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(amount) as total_revenue
FROM 'sales_data.csv' -- Query the file path
GROUP BY region
ORDER BY total_revenue DESC
""").df()
# Note: File was streamed from disk, never fully loaded into memory
DuckDB supports wildcards, so if you have a folder full of log files or monthly sales reports, you can query them all at once:
Copy code
SELECT * FROM 'sales_data_*.csv'
This saves you from writing tedious Python loops to read and concatenate files.
The Performance Showdown: DuckDB vs. Pandas
Let's scale things up and see where DuckDB truly shines. We'll generate a 5-million-row dataset of e-commerce transactions and perform a few common analytical queries.
First, we generate the data and store it in a local DuckDB database file (my_analysis.db).
Copy code
con = duckdb.connect('my_analysis.db')
con.sql("""
CREATE OR REPLACE TABLE transactions AS
SELECT
i AS transaction_id,
DATE '2024-01-01' + INTERVAL (CAST(RANDOM() * 365 AS INTEGER)) DAY AS transaction_date,
-- ... additional columns
FROM range(5000000) t(i)
""")
# Generated and wrote to disk 5M rows in 1.56 seconds
# DuckDB database size: 115.3 MB
Notice the compression: 5 million rows with 7 columns take up only 115 MB on disk.
The Pandas Approach
To analyze this data with Pandas, we first load the entire dataset into memory.
Copy code
# Load into pandas from DuckDB
df = con.sql("SELECT * FROM transactions").df()
# Load time: 0.54 seconds
# Memory usage: ~1035 MB (due to Pandas' in-memory object overhead)
Now, let's run three queries: a complex group-by, a top-N aggregation, and a moving average calculation. These operations take around 1 second in total, including the initial load, and consume over 1 GB of RAM.
The DuckDB Approach
With DuckDB, we don't load the data. We query the 115 MB file on disk directly.
Copy code
# Query 1: Sales by category and region
q1_duckdb = con.sql("""
SELECT
category, region,
SUM(amount) as total_sales,
AVG(amount) as avg_sale,
COUNT(*) as num_transactions,
COUNT(DISTINCT customer_id) as unique_customers
FROM transactions
GROUP BY category, region
""").df()
# ... (similar SQL for Query 2 and 3)
The total time for the same three queries is just 0.07 seconds, with minimal RAM usage.
The Results
| Approach | Load Time (s) | Total Query Time (s) | Total (s) | Peak RAM Usage |
|---|---|---|---|---|
| Pandas | 0.54 | 0.45 | 0.99 | ~1035 MB |
| DuckDB | N/A (streamed) | 0.07 | 0.07 | Minimal (data streamed from 115 MB file) |
DuckDB is 14x faster and uses over 90% less memory. This difference becomes even more dramatic as datasets grow.
In a test with a simulated 50-million-row dataset (by reading a Parquet file 10 times), DuckDB was over 40x faster and used 99% less memory than the Pandas equivalent, which required over 10 GB of RAM.
This isn't just about speed. DuckDB's memory efficiency lets you analyze datasets on your laptop that would be impossible with an in-memory tool like Pandas.
A Real-World Scenario: An ML Feature Engineering Pipeline
Performance benchmarks are useful, but let's see how this hybrid approach works in a practical, end-to-end machine learning pipeline. Our goal: predict customer churn based on 1 million transactions from 50,000 customers.
The strategy is to use the right tool for the right job.
Step 1: Heavy Lifting with DuckDB SQL
We'll use DuckDB for what it does best: fast, complex aggregations over large datasets. We can write a single SQL query to generate a rich set of features for each customer, including recency, frequency, and monetary value.
Copy code
# This single query engineers 20+ features from 1M raw transactions
features = duckdb.sql("""
WITH customer_stats AS (
SELECT
customer_id,
MAX(transaction_date) AS last_purchase_date,
DATE '2024-11-28' - MAX(transaction_date) AS days_since_last_purchase,
COUNT(*) AS total_transactions,
SUM(amount) AS total_spend,
AVG(amount) AS avg_transaction_value,
-- ... and many more aggregations
FROM customer_transactions
GROUP BY customer_id
)
SELECT
*,
-- ... additional calculations on the aggregated data
FROM customer_stats
""").df()
# Engineered 20 features for 50,000 customers
# Time: 0.05s
In just 50 milliseconds, we've crunched 1 million raw transactions down to a clean, 50,000-row feature set with one row per customer.
Step 2 & 3: Preprocessing and Training with Pandas & Scikit-learn
Now that we have a smaller, manageable DataFrame (features), we can hand it off to the rich ecosystems of Pandas and Scikit-learn for the final steps.
Copy code
# Step 2: Use Pandas for final type conversions and target variable creation
features['last_purchase_date'] = pd.to_datetime(features['last_purchase_date'])
# ... create the 'will_purchase_next_30d' target variable
# Step 3: Use Scikit-learn to train a Random Forest model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X = features[feature_columns]
y = features['will_purchase_next_30d']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1)
model.fit(X_train, y_train)
The Power of the Hybrid Approach
The entire pipeline, from generating 1 million rows of data to training a highly accurate model, takes less than a second.
| Stage | Time (seconds) | Tool Used |
|---|---|---|
| 1. Data Generation | 0.07 | DuckDB |
| 2. Feature Engineering | 0.05 | DuckDB (SQL) |
| 3. Preprocessing | 0.01 | Pandas |
| 4. Model Training | 0.34 | Scikit-learn |
| 5. TOTAL PIPELINE | 0.47 | Hybrid Stack |
We used DuckDB for the heavy data crunching, then passed a smaller, aggregated DataFrame to Pandas and Scikit-learn for their specialized APIs. This approach lets you iterate faster and, as developer advocate Jacob likes to say, "be wrong faster," which is critical for effective data science.
Scaling from Local to Cloud: How MotherDuck Works
Everything we've done so far has been on a local machine. But what happens when your data grows too large even for DuckDB's out-of-core processing, or you need to share your analysis with a team?
MotherDuck provides a path to the cloud. Connecting to a local, persistent DuckDB database looks like this:
Copy code
# Connect to a local database file
local_con = duckdb.connect('my_analysis.db')
To run the exact same code against MotherDuck's serverless cloud platform, you just change the connection string:
Copy code
# Connect to MotherDuck - just change the connection string!
cloud_con = duckdb.connect('md:my_db?motherduck_token=YOUR_TOKEN')
That's it. By adding md: and your token, you get serverless compute that can scale up to instances with over a terabyte of RAM, persistent shared storage, and a full data warehouse feature set.
You can develop and test your pipelines locally with DuckDB, then deploy them to MotherDuck for production-scale jobs with no code changes.
Conclusion: A Faster, Simpler Data Workflow
By integrating DuckDB and MotherDuck into your Python workflow, you don't have to abandon the tools you love. You augment them, creating a hybrid stack that gives you the best of all worlds: the interactivity of Pandas, the raw power of SQL, and a path to the cloud.
You get speed, simplicity, and scale. You can iterate on your analyses faster and tackle problems that were previously out of reach for your laptop.
Ready to try it yourself? Sign up for a free MotherDuck account, grab the Jupyter Notebook from this tutorial, and see how much faster your own workflows can be.
Frequently Asked Questions
What is the difference between DuckDB and Polars?
Polars is another high-performance DataFrame library with its own powerful expression API. Like DuckDB, Polars is vectorized and very fast. The key difference is philosophy: DuckDB is an embedded analytical database that brings a powerful SQL engine to your local files and DataFrames. Polars is a Rust-powered DataFrame library designed to be a faster, more parallel alternative to the Pandas API. DuckDB also has mature out-of-core capabilities, which allow it to process datasets larger than your available RAM by spilling to disk. Many choose DuckDB for its robust SQL capabilities and out-of-core processing, while Polars is excellent for those who prefer a DataFrame-style API.
How does MotherDuck handle scaling for large datasets?
MotherDuck scales both vertically and horizontally. Vertically, you can choose from different instance sizes for your workload, up to massive machines with over a terabyte of RAM. Horizontally, you can run multiple, isolated jobs in parallel using service accounts, all sharing the same underlying storage.
What are the data size limits for DuckDB?
Locally, the main limiting factor for DuckDB isn't RAM, it's your available disk space. We've seen users process hundreds of billions of rows on a single node. RAM primarily affects speed, while the disk handles the scale.
Related Videos
2026-01-21
The MCP Sessions - Vol 2: Supply Chain Analytics
Jacob and Alex from MotherDuck query data using the MotherDuck MCP. Watch as they analyze 180,000 rows of shipment data through conversational AI, uncovering late delivery patterns, profitability insights, and operational trends with no SQL required!
Stream
AI, ML and LLMs
MotherDuck Features
SQL
BI & Visualization
Tutorial
2026-01-13
The MCP Sessions Vol. 1: Sports Analytics
Watch us dive into NFL playoff odds and PGA Tour stats using using MotherDuck's MCP server with Claude. See how to analyze data, build visualizations, and iterate on insights in real-time using natural language queries and DuckDB.
AI, ML and LLMs
SQL
MotherDuck Features
Tutorial
BI & Visualization
Ecosystem


