DataFrame
Back to DuckDB Data Engineering Glossary
Overview
A DataFrame is a two-dimensional data structure that organizes data into rows and columns, similar to a spreadsheet or database table. DataFrames have become the standard way to work with structured data in Python, R, and other data analysis languages, with pandas being the most popular DataFrame implementation in Python.
Key Characteristics
DataFrames store data in labeled columns where each column can contain a different data type (like integers, text, dates, etc). Unlike simple tables, DataFrames provide built-in methods for data manipulation, filtering, grouping, and analysis. Column names allow for intuitive access to data, while index labels help identify specific rows.
DuckDB Integration
DuckDB seamlessly integrates with pandas DataFrames through its Python API. You can query DataFrames directly using SQL with duckdb.sql()
, or convert DuckDB query results to DataFrames using .df()
. This allows you to combine the performance benefits of DuckDB's query engine with the familiar pandas interface.
For example, you can query a pandas DataFrame directly:
Copy code
import duckdb
import pandas as pd
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
result = duckdb.sql("SELECT * FROM df WHERE age > 25")
Or convert DuckDB results to a DataFrame:
Copy code
duckdb_result = duckdb.sql("SELECT * FROM my_table")
pandas_df = duckdb_result.df()
Common Implementations
Beyond pandas, other popular DataFrame implementations include:
- polars - A fast DataFrame library written in Rust
- Apache Arrow - A cross-language development platform for in-memory analytics
- R data.frame - The original DataFrame implementation in R
- Spark DataFrame - Distributed DataFrames for big data processing
DuckDB can interact with most of these DataFrame implementations, making it a versatile tool in the modern data stack.