DuckDB Tutorial For Beginners

2023/04/04Featuring:

Imagine a world where the efficiency of database management meets the simplicity of in-process analytics. Sounds like a data analyst's dream, doesn't it? In the fast-paced era of big data, professionals often grapple with the trade-off between powerful analytics and ease of use. A staggering 80% of data scientists report spending a significant portion of their time on data preparation rather than analysis—highlighting a critical need for more efficient tools.

Enter DuckDB, an in-process SQL OLAP database engineered for analytical tasks. This article promises a deep dive into DuckDB, shedding light on its unique capability to blend database efficiency with straightforward management. Readers will embark on a journey through DuckDB's installation process, explore its core features, and apply its functionalities to real-world data analytics scenarios. Expect to uncover how DuckDB stands out in the world of data analytics, making it a compelling choice for developers and data analysts alike.

Are you ready to explore how DuckDB can revolutionize your data analytics workflow? Let's delve into the intricacies of this powerful database system.

Introduction to DuckDB - A Deep Dive into the In-Process SQL OLAP Database

DuckDB emerges as a beacon of innovation in the database management landscape. As an in-process SQL OLAP database, it is specifically tailored for analytical tasks, offering a unique blend of performance and simplicity. Unlike conventional databases that demand extensive setup and management, DuckDB simplifies data analysis by running directly within user applications. This in-process nature eliminates the need for separate server processes, dramatically reducing overhead and streamlining data operations.

The allure of DuckDB lies in its seamless integration and ease of management. Developers and data analysts, often burdened by the complexity of database configuration and maintenance, will find solace in DuckDB's straightforward approach. Without sacrificing database efficiency, DuckDB enables immediate access to powerful analytics, directly within the application environment.

This unparalleled combination of efficiency and ease positions DuckDB as a valuable tool in the arsenal of any data professional. Whether conducting complex queries, performing large-scale aggregations, or managing data with precision, DuckDB offers a robust solution that caters to the dynamic needs of modern data analytics.

In setting the stage for a comprehensive exploration of DuckDB, we highlight its potential to redefine data management and analysis. For developers seeking to harness the power of SQL analytics without the overhead of traditional databases, and for data analysts aiming to expedite data preparation and dive deeper into analysis, DuckDB presents an enticing proposition.

As we venture further into the capabilities and practical applications of DuckDB, keep in mind this singular objective: to empower you with a tool that simplifies your data tasks while amplifying your analytical capabilities.

Setting Up DuckDB - Step-by-Step Installation Guide

DuckDB's installation process is a testament to its design philosophy: making powerful data analytics accessible to everyone. Whether you're navigating through the setup on Windows, macOS, or Linux, DuckDB ensures a smooth and efficient installation process. This guide will walk you through the steps required to get DuckDB up and running on your machine, including the use of Homebrew for macOS users, highlighting the ease with which you can start leveraging DuckDB for your data analysis needs.

Windows:

  1. Download the DuckDB binary suitable for Windows from the official DuckDB releases page.
  2. Extract the downloaded archive to your desired location.
  3. To make the DuckDB Command Line Interface (CLI) easily accessible, add the binary's folder to your system's PATH environment variable. This step enables you to launch DuckDB from the command prompt without having to navigate to the binary's directory.

macOS:

  • For macOS users, the process is simplified through the use of Homebrew, a popular package manager. If you have Homebrew installed, setting up DuckDB is as simple as opening a terminal and running:

    Copy code

    brew install duckdb
    This command not only downloads DuckDB but also ensures that the CLI is directly accessible from the terminal. Homebrew handles the PATH configuration automatically, streamlining the installation process.

Linux:

  1. Similar to Windows, start by downloading the DuckDB binary for Linux from the official releases page.
  2. Extract the archive to a directory of your choosing.
  3. Add the binary's directory to your PATH environment variable to make the DuckDB CLI globally accessible from any terminal session.

For users across all operating systems, ensuring that the DuckDB CLI is directly accessible from the terminal or command prompt is a crucial step. It not only simplifies the workflow but also allows for quick experimentation and exploration of DuckDB's features.

After installation, you can verify that DuckDB is correctly set up by opening a terminal or command prompt and entering:

Copy code

duckdb

If the installation was successful, this command will launch the DuckDB CLI, presenting you with a prompt where you can start executing SQL commands directly.

This guide serves as a practical walkthrough for setting up DuckDB, designed to get you from installation to experimentation in minimal time. With DuckDB installed, you’re now poised to explore its robust features, from in-memory processing to persistent storage, and begin your journey into efficient data analysis.

DuckDB Workflow and Features - An Overview

DuckDB emerges as a potent SQL OLAP database, engineered with a keen focus on analytical tasks. Its architecture harmoniously blends the speed of in-memory processing with the resilience and durability of persistent storage, establishing a versatile environment for data analysis.

In-Memory Processing: At its core, DuckDB thrives on an in-memory processing model. This framework ensures rapid query execution, pivotal for interactive analytics where response time is critical. Unlike traditional databases that may rely solely on disk-based operations, DuckDB's in-memory prowess allows for swift data manipulation and exploration, significantly enhancing the user experience.

Switching to Persistent Storage: DuckDB distinguishes itself with the seamless transition from in-memory to persistent storage. This capability is not just about data durability; it's about offering flexibility. Users can start with in-memory datasets for quick, ad-hoc analysis and then, with minimal effort, migrate these datasets to a persistent format. This dual-mode operation caters to a broad spectrum of analytical needs, from transient exploratory tasks to long-term data storage requirements.

Custom File Format: The foundation of DuckDB's efficient data storage and retrieval is its custom file format. This format employs a compressed columnar storage mechanism, which is a game-changer for large-scale aggregations. By storing data in columns rather than rows, DuckDB can achieve higher compression rates, reduce I/O operations, and expedite analytical queries, especially those involving aggregations across vast datasets.

SQL Capabilities and the 'FROM FIRST' Syntax: DuckDB not only excels in data storage and processing but also shines in its SQL capabilities. A notable feature is the support for the 'FROM FIRST' syntax. This innovation simplifies the SQL query structure, allowing users to fetch all columns of a dataset without the verbose SELECT * FROM command. Such enhancements in query simplicity empower both novice and experienced users to interact with data more efficiently, making DuckDB an attractive tool for a wide array of analytical projects.

In essence, DuckDB's architecture and features are meticulously designed to cater to the modern data analyst's needs. By balancing in-memory speed with the reliability of persistent storage and enhancing SQL interaction, DuckDB stands out as a robust solution for complex data analytics tasks.

Interacting with Data in DuckDB - From Import to Analysis

The journey from data import to insightful analysis in DuckDB unfolds with remarkable simplicity and efficiency, a testament to its design catered towards empowering analytics. This section delves into the steps involved in importing data from CSV and Parquet files, creating tables, and exporting data. We will navigate through these processes using a real-world dataset from Kaggle, examining DuckDB's analytical capabilities in action.

Step 1: Importing Data

Importing data into DuckDB is straightforward, whether from CSV or Parquet files. For CSV files, DuckDB offers the READ_CSV_AUTO command, which automatically infers the schema of the file. This functionality eliminates the need for manual schema definition, accelerating the data import process. For Parquet files, known for their efficiency in storing large, columnar data, DuckDB provides seamless integration, allowing direct queries on Parquet-stored data without the need for explicit loading or conversion.

Example:

  • CSV Import: READ_CSV_AUTO('path/to/your/file.csv') automatically detects and applies the correct data types for each column.
  • Parquet Import: Directly query a Parquet file using SELECT * FROM 'path/to/your/file.parquet'.

Step 2: Creating Tables

After importing data, creating a table is essential for performing further operations like queries and aggregations. DuckDB simplifies this process with commands that allow for the creation of tables directly from imported data. This step is crucial for structuring the data in a way that optimizes query performance and enables more complex analytical tasks.

Example:

  • CREATE TABLE your_table AS SELECT * FROM READ_CSV_AUTO('path/to/your/file.csv')

Step 3: Querying and Analyzing Data

With the data imported and tables created, the next step is to unleash DuckDB's analytical power. DuckDB's SQL engine supports a wide array of functions and operators, making it adept at handling complex analytical queries. From simple aggregations to sophisticated analytical functions, DuckDB provides the tools necessary for deep data analysis.

Example:

Consider a dataset from Kaggle detailing Netflix viewing trends. A query to find the most popular show during a specific period might look like this:

  • SELECT show_name, COUNT(*) AS view_count FROM netflix_data WHERE view_date BETWEEN '2020-03-01' AND '2020-04-30' GROUP BY show_name ORDER BY view_count DESC LIMIT 1;

This query would return the most-watched show on Netflix during the first lockdown phase of the COVID-19 pandemic, showcasing DuckDB's ability to handle real-world analytical questions efficiently.

Step 4: Exporting Data

After analysis, exporting data for further use or reporting is often necessary. DuckDB supports exporting data to various formats, including CSV and Parquet, allowing for easy integration with other tools and platforms.

Example:

  • CSV Export: COPY (SELECT * FROM your_table) TO 'path/to/export/file.csv' WITH (FORMAT CSV)
  • Parquet Export: COPY (SELECT * FROM your_table) TO 'path/to/export/file.parquet' (FORMAT PARQUET)

Through these steps, DuckDB demonstrates its suitability for a wide range of data analytics projects. From importing to analyzing and finally exporting data, DuckDB streamlines the workflow, making it accessible for users with varying levels of expertise. The hands-on example with the Kaggle dataset not only illustrates DuckDB's analytical prowess but also highlights its practical application in answering real-world questions. DuckDB's ability to handle diverse data types and complex queries, all while maintaining high performance, underscores its value as a tool in the modern data analyst's toolkit.

Extensions and Customization in DuckDB - Enhancing Functionality

DuckDB's architecture not only excels in delivering high-performance analytics but also shines in its extensibility and customization capabilities. A pivotal feature of DuckDB is its robust support for extensions, which significantly broadens its functionality and adaptability to specific use cases. This flexibility allows users to enhance the core capabilities of DuckDB, making it a versatile tool for a wide array of data tasks.

Core Extensions: An Overview

Among the myriad of extensions available, the HTTP FS extension stands out for its utility in reading and writing remote files. This extension exemplifies DuckDB's commitment to addressing modern data engineering challenges, enabling seamless access to data stored across various cloud platforms. Such functionality is crucial for organizations leveraging cloud storage solutions like AWS S3 or Google Cloud Storage, facilitating direct interactions with data without the need for intermediary steps.

Installing Extensions

The process of enriching DuckDB with additional capabilities through extensions is straightforward:

  1. Identify the Extension: Start by exploring the available extensions. For instance, the HTTP FS extension caters to the need for handling remote files.
  2. Installation Command: Use the INSTALL command followed by the name of the extension, such as INSTALL 'http_fs'; This command fetches and installs the extension, integrating it with DuckDB.

Loading Extensions

Once installed, loading the extension into your current DuckDB session is the next step:

  • Load Command: Execute the LOAD command with the extension name, like LOAD 'http_fs'; This action makes the functionalities of the extension available for use.

Practical Application

Consider the scenario where a data analyst needs to access a dataset stored on AWS S3. With the HTTP FS extension, DuckDB can directly query this remote data, streamlining the workflow significantly. This capability eliminates the need for downloading large datasets to local storage before analysis, thereby optimizing both time and resource utilization.

Customization for Project Requirements

The true power of DuckDB's extensions lies in their potential for customization. Users can tailor the database to their specific project needs, whether it involves working with unique data formats, integrating with cloud services, or enhancing analytical functions. This level of customization ensures that DuckDB remains a relevant and powerful tool, regardless of the evolving data landscape.

  • Explore and Utilize: Users are encouraged to explore the vast library of DuckDB extensions. Experimentation and integration of these extensions can lead to improved efficiency, novel analytical capabilities, and solutions perfectly adapted to unique project requirements.

DuckDB's support for extensions is a testament to its design philosophy—combining the efficiency and power of a traditional database with the flexibility and user-friendliness of modern data tools. By leveraging these extensions, users can significantly enhance DuckDB's functionality, adapting it to meet and exceed project demands. Whether it's through the integration of remote file systems or the addition of custom analytical functions, DuckDB's extensibility ensures it remains a potent tool in the data analyst's arsenal.

The COVID-19 pandemic, a period marked by lockdowns and social distancing, saw a significant shift in entertainment consumption patterns. With millions confined to their homes, streaming platforms, especially Netflix, experienced unprecedented viewership. DuckDB, with its agile data processing capabilities, presents an ideal solution for analyzing these shifting trends. This section delves into a practical application of DuckDB by examining the top-viewed Netflix shows during the lockdown, illustrating the database's prowess in handling real-world data analytics.

Step 1: Data Import

The initial phase involves importing the Netflix viewership dataset into DuckDB. The dataset, sourced from Kaggle, encompasses the United States' daily top 10 TV shows and movies on Netflix from March 2020 to March 2022. Utilizing DuckDB's READ_CSV_AUTO command makes the import process seamless, automatically inferring the schema of the dataset. This simplicity accelerates the setup, allowing analysts to focus on the analysis rather than the preliminaries of data preparation.

Step 2: Query Execution

With the data imported, the next step involves crafting SQL queries to dissect the viewership patterns. DuckDB's efficient in-memory processing shines here, enabling rapid execution of complex queries. Analysts might seek to identify the longest-standing shows in the top 10, necessitating aggregation and sorting operations. DuckDB's SQL capabilities, particularly its support for the FROM FIRST syntax, simplify these tasks, making the data more accessible for analysis.

Step 3: Unveiling Insights

The execution of these queries yields intriguing insights into the viewing habits during the lockdown. For example, findings might reveal a dominance of children's programming in the top ranks, suggesting a high demand for family-friendly content as parents sought to entertain their children during the lockdown. Such insights not only inform content strategy but also offer a glimpse into societal behaviors in unprecedented times.

Step 4: Exporting Data

After gleaning insights, DuckDB facilitates the export of results for further analysis or presentation. Whether the preference is for CSV, Parquet, or another format, DuckDB's export capabilities ensure that the data is easily sharable and ready for any next steps, from reporting to more in-depth analysis with other tools.

Demonstrating DuckDB's Efficacy

This practical use case underscores DuckDB's suitability for data analytics projects, from the ease of importing and querying data to the efficient extraction of insights. The analysis of Netflix viewing trends during the COVID-19 lockdown not only showcases the database's technical prowess but also highlights its potential to drive valuable business decisions. Through this example, DuckDB proves itself as a powerful tool in the data analyst's toolkit, capable of turning vast datasets into actionable knowledge.

0:00right you've been hearing all the facts

0:02about duckdb and want to get up to speed

0:04we'll cover everything from installation

0:06working flow understanding direct

0:09vehicle system and doing our first

0:11analog to grab your favorite deck

0:13a hot beverage and let's get into it

0:19you're not familiar with drb in the

0:21first place it's a in-process SQL olap

0:24database roughly speaking it has all the

0:27benefits of a database with none of the

0:29assets of managing one but understanding

0:32software concept without practice is

0:34like trying to swim without jumping into

0:37the water so let's dive into the code

0:39and the balance

0:42installing dacdb is pretty

0:43straightforward you only need the binary

0:45depending on your operating system and

0:48start the process on Mac OS you can also

0:50use onbrew package manager which is

0:53going to make duckdb CLI directly

0:55available to your path and upgrade is

0:57going to be easier to manage duckdb can

0:59now be launched by just calling the DB

1:01you can follow along by checking the

1:03GitHub repository the link will be in

1:05the description but before we start

1:06printing a bunch of commands a nice and

1:09light setup would be to work with an

1:11editor a SQL file and a terminal and to

1:14just send the comments from the editor

1:17to the terminal that way you have

1:19visibility on all the commands you're

1:21running you can safely version them and

1:23get them plus you can enjoy all the

1:25features of your editor or IDE like

1:28formatting linking and yes you can even

1:30have go Pilots if you are missing an AI

1:33friend gosh they are everywhere these

1:35days I'm using vs code and I just

1:37configured a simple shortcut to send the

1:39command from the editor to the terminal

1:41you can do so by opening the keyboard

1:43shortcut Json file had any key binding

1:46that you would like to with this command

1:49of course you can probably achieve the

1:51same workflow using any other ID or

1:53editor yes I'm looking at you veeman TMX

1:56nerds now that we have our gears ready

1:58let's play around with the CLI

2:01as duck DB is an in-memory process by

2:04default it will not persist any data so

2:06if I create a simple table based on the

2:09select statement as it is and by the way

2:11this mender index literally looks like

2:13Pokemon I can see now my table using the

2:16short table command and I can query it

2:19and by the way this is a neat feature

2:21from duckdb it supports from first

2:23syntax which enables you to forget about

2:26the painful select star if you just want

2:29to query all the columns so now if I

2:31exit the process and launch techdv Again

2:34by duckstable it's gone

2:40so there is two ways of persisting the

2:43data first simply put a path where you

2:45want the database to be resist when you

2:47launch the duckdb process any extension

2:50to the file name will do but commonly

2:52you can use dot DB dot duckdb or dot ddb

2:56or dot mandarins if there is no DB it

3:00will create it

3:01this is one existing when you launch a

3:03process it will attach it so that you

3:06can read from it and you can also pass

3:09some parameter to have a read-only mode

3:12if it's a critical DB that you don't

3:14want to mess with the second way of

3:15persisting data hold on hold on why is

3:19it this file format all about but

3:22actually I'm glad you asked this is a

3:24custom dacdb file format it's a single

3:26file database meaning all tables are

3:29included supporting update in

3:31transactional acid compliant Fashion

3:33Store the data in compressed columnar

3:36format that plays very well for

3:38large-scale aggregation this contrast

3:40for example to a transactional database

3:43which is optimized for high frequency

3:45rights and typically stores the data as

3:47rows or doubles to support that anyway

3:50it's a great custom file format and if

3:52you want to read more about this file

3:54format I'll put a few blogs in the

3:56description does that make any sense I

3:58guess you can go now right

4:01so second way to present the data if

4:04your dark DB process is already launched

4:06is to use the open command now that we

4:09have our database persisted let's run a

4:11few command commands let's play with

4:13reading and writing CSV and parquet file

4:16locally I have a small data set I took

4:18from kaggle which contain the daily

4:21Netflix top 10 TV shows and movies data

4:24covers only the United States from the

4:26period 2020 to March 2022 so we are

4:31going to be able to answer the most

4:33existed show question which is what were

4:37the US binge watching during the covet

4:40lockdown let's load first our CSV using

4:43the read CSV Auto command this would

4:46infer the schema automatically but of

4:49course you can pass the schema and also

4:51the delimiter by using the read CSV

4:54command with a few parameters when we

4:57use this command we just read the data

4:58set we don't actually create a table to

5:01do so we'll need to create table

5:03statement and you can Nest this one with

5:05a select statement that we just used

5:07or the from command as we just want to

5:10have all columns anyway to write to CSV

5:13you can use the copy command we do a few

5:15parameter for CSV you're gonna pass the

5:17delimiter and the path where you want to

5:20Output the file for Barclay you just

5:22specify the file path and to read the

5:24done from parquet you guess it you can

5:26use the read Barclay command let's

5:29discuss about displaying the data and

5:31how we can tweak that the dot mode

5:32command can be used to change the

5:34appearance on how the tables return to

5:36the terminal output in addition to

5:38customize the appearance these modes

5:41have additional benefits there are

5:43multiple ways you can display the data

5:45one convenient for example is to use the

5:48mode lines if you're reading Json file

5:50and you want to display and inspect this

5:53long nested season a whole have a long

5:56nested cheese and thread

5:59can pipe the dark DB output from the

6:02terminal to a file for example one

6:04interesting way to Output the result is

6:06through a markdown file so I can use for

6:09example dot mode markdown and combine

6:12this display mode with DOT output or dot

6:15once that output we write all the

6:17outputs to the file I'm giving and that

6:21once you guess it will just do it once

6:24and revert back to the normal behavior

6:27so let's say we wanted to Output the

6:29result to our marathon we do dot mode

6:31markdown then dot output

6:35myofile.md for example

6:37and here he goes nothing because I

6:39haven't run any query yet so let's run

6:42the query

6:43and as you can see I have some results

6:45in my markdown file last but not least I

6:49want to show you that you can run a

6:51comment and then exit directly the

6:53darkdb process by using the C flag when

6:56you run techdb and that's pretty neat

6:59because then you can basically do a

7:00select to any CSV or parquet file

7:04directly into lightweight and easy to

7:06use and you could even wrap up common

7:08function like for example if you want to

7:11have one that converts CSV to parquet

7:13you can also do that finally duckdb has

7:16a set of flag of configuration that you

7:18can use to find you understanding

7:20extensions duckdb has a number of

7:23extensions available and ready to use

7:26however not all of them are packaged by

7:29default but docdb as a mechanism that

7:31allows remote extension installation

7:34let's see what kind of core extension

7:36are available in dacdb by using one of

7:38the metadata table function which is

7:41duckdb extensions so here we can see

7:45which one are actually installed by by

7:48default and which one I can just

7:50directly install because they are listed

7:52in the core let's install a popular one

7:55HTTP FS which enables us to read and

7:58write from remote files over https it's

8:01also work with AWS S3 and Google Cloud

8:03Storage to install an extension which is

8:06available in the core list just to

8:08install and the name of the extension

8:10when this command is run dark DB is

8:12actually gone on the loads and put it

8:14into your own directory under dot dacdb

8:18folder slash extensions you can change

8:21the default destination folder by using

8:23the configuration flag which is the

8:25extension directory so now that our

8:28extension is download we need to load it

8:30and that's where we use the load command

8:32note that if you are using a third-party

8:34extension Library which is not listed

8:36there or you are building your own

8:38extension you're gonna need to pass the

8:40flag unsign when you launch the

8:43productdb process extension are awesome

8:45and you can also build your own and the

8:47DB Labs team as per the template to help

8:50you get started link

8:53in the description I mean if by that

8:55time you haven't checked the description

8:57I don't know why you're here so now that

8:59you have gone through the basics of

9:00techdb let's cover a simple use case end

9:02to end so I have a remote file sitting

9:05on AWS S3 still the same Netflix data

9:08and I want to do some analytics remember

9:10the existential question in SQL and

9:13store the result locally as we saw I'm

9:15gonna need to install the extension HTTP

9:18fs and load it for a public S3 bucket I

9:22only need to specify the region but most

9:24of the case your bucket will be private

9:26and you will need to pass AWS credential

9:28that means the key ID and the secret

9:31access key while we do use the https

9:34protocol you need to pass the s3u air IP

9:38which means that it has to start with S3

9:41now that I have my data loaded let's

9:43have a look using the Asom from command

9:47we can also create a schema using the

9:50Shell method

9:51okay so we have the number of days of a

9:54show staying in the top 10 so let's find

9:56which shows stays the longest

9:59of course that's a daily snapshots I

10:02will need a group Buy on the name and a

10:04Max on the number of days

10:06all right

10:08stackdb is really faster is really not a

10:10long suspense but here is the result

10:12Coco melon wow a kids show ranking

10:16number one I guess the parents had some

10:18hard time doing covet

10:20plus kids can just watch the same thing

10:23over and over without getting bored

10:25let's now look at the movies

10:29and wow with no surprise again ranking

10:32number one a kid movie alright now that

10:35I have my data I can export it to CSV

10:38let's say using the copy command I can

10:40either create a table first or just Nest

10:43the copy command with the select

10:45statement I just used and voila

10:49alright hope you enjoyed this video we

10:51got to cover the CLI the extensions and

10:54do our first analytics and plus you

10:56don't know what to watch next on Netflix

11:00if you have any kids

11:02and if you don't just take ranking

11:04number two or three and I guess you're

11:06fine if you enjoyed this video be sure

11:08to subscribe they're gonna be a lot more

11:10productivity video related and probably

11:12more Ducks too and if you're curious

11:14about how dark TV was created check out

11:17this video where we interview hannes the

11:19co-creator of that TV and made a quack

11:22BBQ

11:24[Music]

FAQS

How do you install DuckDB and set up a development workflow?

On macOS, the easiest way is brew install duckdb, which puts the CLI on your PATH and makes upgrades simple. On Windows and Linux, download the binary from the releases page and add it to your PATH. A good workflow: open a SQL file in your editor (VS Code, Vim, etc.) alongside a terminal, and set up a keyboard shortcut to send commands from the editor to the terminal. You get version control, formatting, and autocomplete while still running queries interactively. Check our getting started guide for more setup details.

What is the DuckDB database file format and how does data persistence work?

DuckDB runs in-memory by default, meaning everything disappears when you exit. To persist data, pass a file path when launching DuckDB (e.g., duckdb mydata.db). The DuckDB file format is a single-file database that stores all tables in a compressed columnar format, which works well for analytical queries. It supports ACID transactions, updates, and deletes. If your process is already running, you can also use the .open command to attach a database file.

How do you read and write CSV and Parquet files with DuckDB?

Use read_csv_auto('file.csv') to read a CSV with automatic schema inference, or read_csv with explicit parameters for delimiters and types. For Parquet files, use read_parquet('file.parquet'). These commands read data without creating a table. Wrap them in CREATE TABLE ... AS SELECT * FROM read_csv_auto(...) to persist. To export, use the COPY command: COPY (SELECT ...) TO 'output.csv' (DELIMITER ',') for CSV, or just specify a .parquet path for Parquet output.

How do DuckDB extensions work and how do you install them?

Some extensions come bundled with DuckDB, and others you install on demand. Run SELECT * FROM duckdb_extensions() to see what's available. Install one with INSTALL httpfs and load it with LOAD httpfs. The httpfs extension, for example, lets you read and write files from S3 and Google Cloud Storage. Extensions are downloaded to your ~/.duckdb/extensions/ directory. You can also build custom extensions using the template from DuckDB Labs.

Can DuckDB query files directly from S3 without downloading them first?

Yes. After installing and loading the httpfs extension, you can query remote files directly with SELECT * FROM read_parquet('s3://bucket/file.parquet'). For public buckets, you only need to set the AWS region. For private buckets, pass your AWS access key ID and secret. The URL must use the s3:// protocol. You can run SQL on remote data without downloading anything first.

Related Videos

" Preparing Your Data Warehouse for AI: Let Your Agents Cook" video thumbnail

2026-01-27

Preparing Your Data Warehouse for AI: Let Your Agents Cook

Jacob and Jerel from MotherDuck showcase practical ways to optimize your data warehouse for AI-powered SQL generation. Through rigorous testing with the Bird benchmark, they demonstrate that text-to-SQL accuracy can jump from 30% to 74% by enriching your database with the right metadata.

AI, ML and LLMs

SQL

MotherDuck Features

Stream

Tutorial

"No More Writing SQL for Quick Analysis" video thumbnail

0:09:18

2026-01-21

No More Writing SQL for Quick Analysis

Learn how to use the MotherDuck MCP server with Claude to analyze data using natural language—no SQL required. This text-to-SQL tutorial shows how AI data analysis works with the Model Context Protocol (MCP), letting you query databases, Parquet files on S3, and even public APIs just by asking questions in plain English.

YouTube

Tutorial

AI, ML and LLMs

"The MCP Sessions - Vol 2: Supply Chain Analytics" video thumbnail

2026-01-21

The MCP Sessions - Vol 2: Supply Chain Analytics

Jacob and Alex from MotherDuck query data using the MotherDuck MCP. Watch as they analyze 180,000 rows of shipment data through conversational AI, uncovering late delivery patterns, profitability insights, and operational trends with no SQL required!

Stream

AI, ML and LLMs

MotherDuck Features

SQL

BI & Visualization

Tutorial