Webinardbt

From Raw Data to Insights with Datacoves & MotherDuck

2025/09/25

Building a complete, end-to-end data pipeline, from raw files to polished insights, has traditionally been a complex and costly endeavor. It often requires stitching together numerous services, managing intricate infrastructure, and navigating the high costs associated with traditional cloud data warehouses. However, a modern stack centered around DuckDB, dbt, and new open-source technologies is changing the landscape, making powerful analytics more accessible and affordable.

In this webinar, Noel Gomez of Datacoves and Jacob Matson from MotherDuck, explore how to construct a full data pipeline using an integrated set of tools. It demonstrates how to ingest, transform, orchestrate, and visualize data, all while leveraging the power of a data lake architecture without the typical complexity. The key to this simplified approach is DuckLake, an open table format that brings the scalability of a data lake to the speed and simplicity of DuckDB.


Understanding the Core Components

Before building the pipeline, it's important to understand the role of each technology. While many are familiar with dbt and DuckDB, DuckLake is the component that unlocks a new architectural pattern.

  • dbt (data build tool): This is the transformation workhorse. It allows data teams to apply software engineering best practices like version control, testing, and documentation to their SQL and Python data models. Instead of managing a chaotic collection of scripts, dbt provides a structured framework for building reliable and maintainable transformation workflows.
  • DuckDB: An incredibly fast, in-process analytical database. Because it's a columnar database, it's highly optimized for OLAP (Online Analytical Processing) workloads, which involve complex queries over large datasets. Its lightweight nature means it can run anywhere from a laptop to a large server.
  • MotherDuck: A modern cloud data warehouse that provides a serverless, managed version of DuckDB. It simplifies infrastructure management, enables collaboration, and enhances DuckDB with features built for the cloud, such as easy secret management and massively parallel data loading from object storage.
  • DuckLake: This is the game-changer for this architecture. DuckLake is an open table format that decouples storage from compute. While a standard DuckDB database stores both its data and metadata in a single file, DuckLake stores data as efficient Parquet files in object storage (like Amazon S3) and keeps the metadata in a separate database, which can be DuckDB, PostgreSQL, or others. This simple but powerful design effectively creates a data lake, allowing datasets to scale to billions of rows without being constrained by a single database file.

This separation is what allows for features like time travel and massive scalability, which were previously the domain of expensive, proprietary data platforms.


Building an End-to-End Data Pipeline

The demonstration, hosted on the Datacoves platform, shows just how seamlessly these tools integrate to form a complete solution. Datacoves provides a managed environment with VS Code in the browser, Airflow for orchestration, and all the necessary dependencies, eliminating setup friction.

Step 1: Ingesting Raw Data with dlt

The first step in any pipeline is getting the data in. The process uses dlt (data load tool), a simple and efficient Python library for data ingestion. The video demonstrates loading a raw CSV file into a MotherDuck database. The configuration is minimal, requiring only a MotherDuck token to authenticate and direct the data to the correct destination. This simplicity allows developers to focus on the data itself, not on complex connection and credential management.

Step 2: Transforming Data with dbt and DuckLake

With raw data available, the next step is transformation using dbt. The real power here is how dbt interacts with DuckLake to build a structured, multi-layered data lake on S3.

The configuration involves a single, simple change in the dbt profiles file to enable the DuckLake integration. Once set, dbt can perform powerful operations:

  1. Reading from Object Storage: The dbt project defines its sources to read raw CSV files directly from an S3 bucket. DuckDB's ability to query files in place means the data doesn't need to be loaded into a database before transformation can begin.
  2. Materializing Models as Parquet: As dbt build is executed, it runs the staging, intermediate, and final mart models. Instead of writing these to a traditional database, it materializes them as Parquet files within the S3 bucket, neatly organized into folders corresponding to the different layers of the data model.
  3. Updating the Metadata Store: Simultaneously, the metadata (table schemas, file locations, and version information) is written to the DuckLake database managed by MotherDuck.

The result is a fully-formed data lake built with a single dbt command. The data lives in an open format (Parquet) on S3, while the metadata provides the structure needed to query it efficiently.


Unlocking Advanced Data Lake Capabilities

With the pipeline built, the data is ready for analysis and orchestration. This is where the benefits of the DuckLake architecture become clear.

Querying and Visualizing the Data Lake

Even though the data resides in S3, it can be queried directly through MotherDuck or any DuckDB-compatible tool. The video shows connecting a Streamlit dashboard to the DuckLake database to create visualizations. To the end-user or the BI tool, it looks and feels like a standard database, but it has the scalability of a data lake behind it.

Time Travel for Data Auditing and Recovery

One of the most powerful features demonstrated is time travel. A DELETE statement is run on one of the final tables. A subsequent query shows the rows are gone. However, by using the AS OF syntax in the SQL query, it's possible to query the table as it existed before the deletion, and all the original records reappear.

This is possible because DuckLake is version-aware. When the deletion occurred, it didn't overwrite the existing Parquet file. Instead, it created a new snapshot of the data and updated the metadata to point to it. The old versions remain in S3, making it easy to audit changes or recover from accidental modifications.

Orchestration and Automation with Airflow

To complete the picture, the entire workflow is automated using Airflow. A simple DAG (Directed Acyclic Graph) is configured with two steps: one to run the dlt ingestion script and a second to execute dbt build. This orchestrates the full pipeline, ensuring that data is consistently ingested and transformed on a schedule, creating a reliable, production-ready system.


Best Practices and Architectural Considerations

While this stack is powerful, it's important to use it correctly. The discussion highlighted several key best practices:

  • When to Choose DuckLake: For smaller tables (under roughly 1 terabyte), native DuckDB tables might offer slightly faster read performance. However, for larger tables where scalability is a concern, DuckLake is the clear choice. Its ability to partition data and leverage object storage makes it far more suitable for petabyte-scale analytics.
  • Use Incremental Models: When using dbt with DuckLake, it is highly recommended to use incremental materializations for large tables. The default table materialization in dbt drops and recreates the entire table on every run, which would create a full new copy of the data in S3 each time. Incremental models, in contrast, only process new or changed records, which is far more efficient for both performance and storage.
  • Design for OLAP, Not OLTP: DuckDB is an OLAP engine designed for analytical queries. It is not a replacement for transactional (OLTP) databases like PostgreSQL or SQLite. Architectural patterns from OLTP systems, such as persistent primary keys and indexes, often slow down writes in DuckDB and should be avoided in favor of its natural columnar performance.

A Simplified Future for Data Analytics

The combination of dlt, dbt, DuckDB, and DuckLake, all unified on platforms like MotherDuck and Datacoves, represents a significant step forward. It dramatically lowers the barrier to entry for building a sophisticated data lakehouse, providing features like time travel and massive scalability without the vendor lock-in or prohibitive costs of traditional systems.

This modern, open-source stack proves that data analytics can be both exceptionally powerful and refreshingly simple. By decoupling storage from compute and leveraging best-in-class tools for each part of the process, organizations can build flexible, cost-effective data platforms that can scale with their ambitions.