WebinarPythonAI, ML and LLMs

Data-based: Going Beyond the Dataframe

2025/11/20

TL;DR: Feeling the limits of Pandas with large datasets? This guide shows how DuckDB acts as a turbocharger for your existing workflow, not a replacement. Learn to query DataFrames with SQL, achieve 10-40x faster performance with 90% less memory, and scale from your laptop to the cloud with MotherDuck in one line of code.

If you work with data in Python, you almost certainly use and love Pandas. Its DataFrame API expresses ideas powerfully and has become the standard for data manipulation.

But as your datasets grow, you've probably started feeling the pain. Your laptop fans spin up, your Jupyter kernel crashes with an OutOfMemoryError, and simple groupby operations take forever. You've hit the "DataFrame wall."

Many developers assume the next step is a massive leap to a complex distributed system like Spark. But you can break through that wall without leaving your favorite tools behind.

DuckDB and MotherDuck solve this problem by augmenting your existing workflow. In this post, we'll walk through a practical guide, based on a recent live demo, showing you how to:

  1. Query Pandas DataFrames and local files with the full power of SQL.
  2. Achieve massive performance gains—often over 10x faster and with 90% less memory.
  3. Build a hybrid ML pipeline that uses the best of DuckDB, Pandas, and Scikit-learn.
  4. Scale from local to the cloud with a single line of code.

How to Query Pandas DataFrames Directly with DuckDB SQL

DuckDB's zero-copy integration with Pandas runs on Apache Arrow. You can query a Pandas DataFrame with SQL without any costly data conversion.

We'll create a sample 500,000-row DataFrame and run a standard aggregation in both Pandas and DuckDB.

Copy code

import pandas as pd import duckdb import numpy as np # Create sample sales data - 500k rows np.random.seed(42) n_rows = 500_000 sales_data = { 'product': np.random.choice(['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Phone', 'Tablet', 'Headphones', 'Cable'], n_rows), 'region': np.random.choice(['North', 'South', 'East', 'West'], n_rows), 'amount': np.random.uniform(10, 2000, n_rows).round(2), 'customer_id': np.random.randint(0, 20000, n_rows), 'quantity': np.random.randint(1, 10, n_rows) } df = pd.DataFrame(sales_data)

The standard Pandas approach is familiar:

Copy code

# Pandas approach: Standard groupby aggregation result_pandas = df.groupby('product').agg({ 'amount': ['sum', 'mean', 'count'] }).round(2) # Time: ~0.02 seconds

Now, let's do the exact same thing with DuckDB, querying the df object directly:

Copy code

# DuckDB approach: SQL query on the same DataFrame result_duckdb = duckdb.sql(""" SELECT product, SUM(amount) as total_sales, ROUND(AVG(amount), 2) as avg_sales, COUNT(*) as num_transactions FROM df -- Notice we are querying the DataFrame 'df' directly GROUP BY product ORDER BY total_sales DESC """).df() # Time: ~0.01 seconds

DuckDB automatically discovers the df object in your Python environment and lets you query it as if it were a database table. For SQL fans, this exposes the full declarative power of SQL on your DataFrames.

Querying Files Directly from Disk

DuckDB can query files like CSVs and Parquet directly from disk, streaming the data instead of loading it all into memory. This works well with datasets that don't fit in RAM.

Copy code

# Save to CSV for demonstration df.to_csv('sales_data.csv', index=False) # Query CSV directly without loading into pandas result = duckdb.sql(""" SELECT region, COUNT(DISTINCT customer_id) as unique_customers, SUM(amount) as total_revenue FROM 'sales_data.csv' -- Query the file path GROUP BY region ORDER BY total_revenue DESC """).df() # Note: File was streamed from disk, never fully loaded into memory

DuckDB supports wildcards, so if you have a folder full of log files or monthly sales reports, you can query them all at once:

Copy code

SELECT * FROM 'sales_data_*.csv'

This saves you from writing tedious Python loops to read and concatenate files.

The Performance Showdown: DuckDB vs. Pandas

Let's scale things up and see where DuckDB truly shines. We'll generate a 5-million-row dataset of e-commerce transactions and perform a few common analytical queries.

First, we generate the data and store it in a local DuckDB database file (my_analysis.db).

Copy code

con = duckdb.connect('my_analysis.db') con.sql(""" CREATE OR REPLACE TABLE transactions AS SELECT i AS transaction_id, DATE '2024-01-01' + INTERVAL (CAST(RANDOM() * 365 AS INTEGER)) DAY AS transaction_date, -- ... additional columns FROM range(5000000) t(i) """) # Generated and wrote to disk 5M rows in 1.56 seconds # DuckDB database size: 115.3 MB

Notice the compression: 5 million rows with 7 columns take up only 115 MB on disk.

The Pandas Approach

To analyze this data with Pandas, we first load the entire dataset into memory.

Copy code

# Load into pandas from DuckDB df = con.sql("SELECT * FROM transactions").df() # Load time: 0.54 seconds # Memory usage: ~1035 MB (due to Pandas' in-memory object overhead)

Now, let's run three queries: a complex group-by, a top-N aggregation, and a moving average calculation. These operations take around 1 second in total, including the initial load, and consume over 1 GB of RAM.

The DuckDB Approach

With DuckDB, we don't load the data. We query the 115 MB file on disk directly.

Copy code

# Query 1: Sales by category and region q1_duckdb = con.sql(""" SELECT category, region, SUM(amount) as total_sales, AVG(amount) as avg_sale, COUNT(*) as num_transactions, COUNT(DISTINCT customer_id) as unique_customers FROM transactions GROUP BY category, region """).df() # ... (similar SQL for Query 2 and 3)

The total time for the same three queries is just 0.07 seconds, with minimal RAM usage.

The Results

ApproachLoad Time (s)Total Query Time (s)Total (s)Peak RAM Usage
Pandas0.540.450.99~1035 MB
DuckDBN/A (streamed)0.070.07Minimal (data streamed from 115 MB file)

DuckDB is 14x faster and uses over 90% less memory. This difference becomes even more dramatic as datasets grow.

In a test with a simulated 50-million-row dataset (by reading a Parquet file 10 times), DuckDB was over 40x faster and used 99% less memory than the Pandas equivalent, which required over 10 GB of RAM.

This isn't just about speed. DuckDB's memory efficiency lets you analyze datasets on your laptop that would be impossible with an in-memory tool like Pandas.

A Real-World Scenario: An ML Feature Engineering Pipeline

Performance benchmarks are useful, but let's see how this hybrid approach works in a practical, end-to-end machine learning pipeline. Our goal: predict customer churn based on 1 million transactions from 50,000 customers.

The strategy is to use the right tool for the right job.

Step 1: Heavy Lifting with DuckDB SQL

We'll use DuckDB for what it does best: fast, complex aggregations over large datasets. We can write a single SQL query to generate a rich set of features for each customer, including recency, frequency, and monetary value.

Copy code

# This single query engineers 20+ features from 1M raw transactions features = duckdb.sql(""" WITH customer_stats AS ( SELECT customer_id, MAX(transaction_date) AS last_purchase_date, DATE '2024-11-28' - MAX(transaction_date) AS days_since_last_purchase, COUNT(*) AS total_transactions, SUM(amount) AS total_spend, AVG(amount) AS avg_transaction_value, -- ... and many more aggregations FROM customer_transactions GROUP BY customer_id ) SELECT *, -- ... additional calculations on the aggregated data FROM customer_stats """).df() # Engineered 20 features for 50,000 customers # Time: 0.05s

In just 50 milliseconds, we've crunched 1 million raw transactions down to a clean, 50,000-row feature set with one row per customer.

Step 2 & 3: Preprocessing and Training with Pandas & Scikit-learn

Now that we have a smaller, manageable DataFrame (features), we can hand it off to the rich ecosystems of Pandas and Scikit-learn for the final steps.

Copy code

# Step 2: Use Pandas for final type conversions and target variable creation features['last_purchase_date'] = pd.to_datetime(features['last_purchase_date']) # ... create the 'will_purchase_next_30d' target variable # Step 3: Use Scikit-learn to train a Random Forest model from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split X = features[feature_columns] y = features['will_purchase_next_30d'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1) model.fit(X_train, y_train)

The Power of the Hybrid Approach

The entire pipeline, from generating 1 million rows of data to training a highly accurate model, takes less than a second.

StageTime (seconds)Tool Used
1. Data Generation0.07DuckDB
2. Feature Engineering0.05DuckDB (SQL)
3. Preprocessing0.01Pandas
4. Model Training0.34Scikit-learn
5. TOTAL PIPELINE0.47Hybrid Stack

We used DuckDB for the heavy data crunching, then passed a smaller, aggregated DataFrame to Pandas and Scikit-learn for their specialized APIs. This approach lets you iterate faster and, as developer advocate Jacob likes to say, "be wrong faster," which is critical for effective data science.

Scaling from Local to Cloud: How MotherDuck Works

Everything we've done so far has been on a local machine. But what happens when your data grows too large even for DuckDB's out-of-core processing, or you need to share your analysis with a team?

MotherDuck provides a path to the cloud. Connecting to a local, persistent DuckDB database looks like this:

Copy code

# Connect to a local database file local_con = duckdb.connect('my_analysis.db')

To run the exact same code against MotherDuck's serverless cloud platform, you just change the connection string:

Copy code

# Connect to MotherDuck - just change the connection string! cloud_con = duckdb.connect('md:my_db?motherduck_token=YOUR_TOKEN')

That's it. By adding md: and your token, you get serverless compute that can scale up to instances with over a terabyte of RAM, persistent shared storage, and a full data warehouse feature set.

You can develop and test your pipelines locally with DuckDB, then deploy them to MotherDuck for production-scale jobs with no code changes.

Conclusion: A Faster, Simpler Data Workflow

By integrating DuckDB and MotherDuck into your Python workflow, you don't have to abandon the tools you love. You augment them, creating a hybrid stack that gives you the best of all worlds: the interactivity of Pandas, the raw power of SQL, and a path to the cloud.

You get speed, simplicity, and scale. You can iterate on your analyses faster and tackle problems that were previously out of reach for your laptop.

Ready to try it yourself? Sign up for a free MotherDuck account, grab the Jupyter Notebook from this tutorial, and see how much faster your own workflows can be.

Frequently Asked Questions

What is the difference between DuckDB and Polars?

Polars is another high-performance DataFrame library with its own powerful expression API. Like DuckDB, Polars is vectorized and very fast. The key difference is philosophy: DuckDB is an embedded analytical database that brings a powerful SQL engine to your local files and DataFrames. Polars is a Rust-powered DataFrame library designed to be a faster, more parallel alternative to the Pandas API. DuckDB also has mature out-of-core capabilities, which allow it to process datasets larger than your available RAM by spilling to disk. Many choose DuckDB for its robust SQL capabilities and out-of-core processing, while Polars is excellent for those who prefer a DataFrame-style API.

How does MotherDuck handle scaling for large datasets?

MotherDuck scales both vertically and horizontally. Vertically, you can choose from different instance sizes for your workload, up to massive machines with over a terabyte of RAM. Horizontally, you can run multiple, isolated jobs in parallel using service accounts, all sharing the same underlying storage.

What are the data size limits for DuckDB?

Locally, the main limiting factor for DuckDB isn't RAM, it's your available disk space. We've seen users process hundreds of billions of rows on a single node. RAM primarily affects speed, while the disk handles the scale.

Related Videos

"The MCP Sessions - Vol 2: Supply Chain Analytics" video thumbnail

2026-01-21

The MCP Sessions - Vol 2: Supply Chain Analytics

Jacob and Alex from MotherDuck query data using the MotherDuck MCP. Watch as they analyze 180,000 rows of shipment data through conversational AI, uncovering late delivery patterns, profitability insights, and operational trends with no SQL required!

Stream

AI, ML and LLMs

MotherDuck Features

SQL

BI & Visualization

Tutorial

" The MCP Sessions Vol. 1: Sports Analytics" video thumbnail

2026-01-13

The MCP Sessions Vol. 1: Sports Analytics

Watch us dive into NFL playoff odds and PGA Tour stats using using MotherDuck's MCP server with Claude. See how to analyze data, build visualizations, and iterate on insights in real-time using natural language queries and DuckDB.

AI, ML and LLMs

SQL

MotherDuck Features

Tutorial

BI & Visualization

Ecosystem

"Building a Serverless Lakehouse with DuckLake" video thumbnail

2025-12-26

Building a Serverless Lakehouse with DuckLake

In this hands-on workshop, Jacob Matson walks through building a serverless lakehouse from scratch using DuckLake.

Webinar

Tutorial