DataFrame

Back to DuckDB Data Engineering Glossary

Overview

A DataFrame is a two-dimensional data structure that organizes data into rows and columns, similar to a spreadsheet or database table. DataFrames have become the standard way to work with structured data in Python, R, and other data analysis languages, with pandas being the most popular DataFrame implementation in Python.

Key Characteristics

DataFrames store data in labeled columns where each column can contain a different data type (like integers, text, dates, etc). Unlike simple tables, DataFrames provide built-in methods for data manipulation, filtering, grouping, and analysis. Column names allow for intuitive access to data, while index labels help identify specific rows.

DuckDB Integration

DuckDB seamlessly integrates with pandas DataFrames through its Python API. You can query DataFrames directly using SQL with duckdb.sql(), or convert DuckDB query results to DataFrames using .df(). This allows you to combine the performance benefits of DuckDB's query engine with the familiar pandas interface.

For example, you can query a pandas DataFrame directly:

Copy code
import duckdb
import pandas as pd

df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
result = duckdb.sql("SELECT * FROM df WHERE age > 25")

Or convert DuckDB results to a DataFrame:

Copy code
duckdb_result = duckdb.sql("SELECT * FROM my_table")
pandas_df = duckdb_result.df()

Common Implementations

Beyond pandas, other popular DataFrame implementations include:

polars - A fast DataFrame library written in Rust
Apache Arrow - A cross-language development platform for in-memory analytics
R data.frame - The original DataFrame implementation in R
Spark DataFrame - Distributed DataFrames for big data processing

DuckDB can interact with most of these DataFrame implementations, making it a versatile tool in the modern data stack.