AI, ML and LLMsMotherDuck FeaturesSQLTalkPythonBI & Visualization

LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics

2025/11/19

Keynote at AI By the Bay

Transcript

I ordinarily wouldn't wear a hat during a talk, but (a) this is an awesome hat that I love, and (b) there's very bright spotlights up here. But let's talk about agents. Agents can use LLMs along with analytics databases to analyze business data, find patterns, speed up reporting and more. The title of the talk is talking about business data and I give some examples on business data. But customers and users of DuckDB and MotherDuck use the same technologies for all sorts of different types of data. So just replace some of the tables in my examples with game data or package delivery data, logistics, etc.

First of all, I want to just give a shout out. Obviously, LLMs are magical. They've advanced so much in such a short period of time. Amazing at improving our writing, amazing at understanding human language and sometimes even at generating human language as long as you're not trying to generate any sort of novel ideas. They're generally awesome at composing pieces of knowledge together. And of course at generating images of ducks. They're very good at generating images of ducks.

There are things though that make the duck sad. There are things where the duck magician can't get the LLMs to do what it wants it to do—things that sometimes they aren't even designed to do. However, LLMs are kind of designed to understand facts, but they sometimes fail. And when they fail, they're often failing in the same way that us humans fail. They make things up. They don't have current knowledge and they don't have your personal information or your personal context in order to work. How many of you as humans have these same problems at times? I certainly do. That's where LLMs are like humans. They're fallible. We're working at making them less fallible as an industry. Hundreds of billions of dollars have flown into this industry.

We've been able to augment LLMs with capabilities to make them even more magical, to make our ducks here happier. We can feed them information on recent facts through RAG. We started doing repeated calling of LLMs in order to emulate the human chain of thought reasoning. And now that is built into most LLMs, but they're still bad at some things that humans are decent at. I might not be able to count exactly the number of stars here because some are overlapping, but I'm not going to go from five, six to eight and skip the number seven. I think it's an unlucky number for some people. So or maybe it's lucky, I don't know. But LLMs are bad at counting and they're really bad at data aggregation. Of course, if you're bad at facts, you're probably not going to be very good at aggregating facts. It's kind of an exponential badness that happens here.

People originally were saying, "Well, RAG is a solution for everything." This was a while ago, right? RAG is a solution for everything. But RAG only helps encoding facts, though. LLMs still can't scan, aggregate, and compute data.

What if you want your LLM to answer business questions, and you want those LLMs to be quasi accurate? If you have questions like, "How many of my customers are spending greater than $1,000 a month?" or "Which of my customers are most likely to churn and why?" or "What is my company's annualized revenue?"—you're going to want some degree of accuracy on those. You're not going to want to just trust the LLM to pull the data off of its training data set because you might get someone else's annualized revenue or someone else's most likely to churn customers.

For that, you need an analytics database. Analytics databases can be very happy when you're using ducks at least as your analytics database. But you need something that can store large amounts of analytical information and efficiently compute aggregations to answer these questions.

Now how many of you in the room would classify yourselves as data people primarily focused on data? All right. How many would classify yourselves as software engineers—you're building things but maybe with data at times? Okay, so pretty much every software engineer is probably like, "Well what about Python? What about our elephant here?" It's funny that elephant in the room joke did not come up as I was preparing this, but the elephant in the room is Postgres does everything. Why don't we just use Postgres?

There are different types of databases. I literally have one slide on this to catch the folks up, but the different types of databases are basically transactional and analytics databases and it's largely based off of how the underlying data is stored and retrieved. If your underlying data is stored as rows, it's really easy to do inserts of a whole row at a time and it's really easy to do point lookups. But when you try to do like an average age in your data stored as rows, you end up reading all the data of every single row in order to do that and it's very inefficient. On a columnar storage database, an analytical database, you would just read the column age and it's much more efficient. That's why maybe you don't want to use Postgres as your analytical database. It will work, it just might not be the most efficient. We have tons of customers of MotherDuck who end up coming to MotherDuck out of using Postgres as their analytical database and realizing it hit scaling limits.

How do LLMs interact with data warehouses, databases, data analytics tools? RAG was sort of the version zero for LLMs interacting with external data, but like I said you could just basically feed in facts, not aggregations. Now we have various different ways of interacting with our databases.

First of all, there are built-in tools in LLMs nowadays. LLMs now actually do have calculators. You can feed basic math questions to an LLM and they will use the built-in tool which is a calculator. They have web search. They have tools for generating files and diagrams, things like that. The first thing that you can do if you want to interact with the database is you can build a custom tool that says what to do with a database and you do that using the agent SDKs. In my case, examples here, I'm using the OpenAI SDK but there's Claude ones and there's ones I'm sure for other LLMs.

And then of course many of these analytics databases do have MCP servers. MCP servers are essentially the tools but exposed over a wire protocol instead of exposed directly running locally on the server. This gives you basically ultimate in flexibility. You can ask LLMs anything. It just interprets SQL and that's all that the MCP server is really going to do. Whereas if you want more control, you build your own custom tools.

Regardless, if you are working with an analytics database nowadays and for the last four decades or something, it's likely to use SQL as its language for interacting with it. It's one of the biggest staying power technologies out there I think that has just remained for decades.

Your biggest question comes down to who is going to generate the SQL for interacting with your database. Is it going to be generated by the LLM? This is a tool example here, just a function in a Python file showing, "Hey, I'm going to execute SQL and I'm going to define how that executing SQL happens." I can define it at that level and then let the LLM generate the SQL. Or if I do want more control over it, I can define a function and have that function be a very specific thing. So getting the sales by region for instance.

Now you might think that you really want to let just the LLM use SQL. Why are you manually encoding the SQL that the LLM is able to generate? That would be a reasonable conclusion. But the state-of-the-art of SQL generation from LLMs is still pretty poor. I think that—my vision's not amazing, but I think the numbers in the high 70% here at the best LLM, which actually uses multiple LLMs together to get to that score. So 78% I think is the accuracy versus 92% as a human. Now how many of you would be comfortable producing your board reports and producing maybe your public analyses for your company, your quarterly earnings reports with 78% accuracy? Not me. Heck, I won't even be comfortable with the 92% that a human is accurate. But that is the nature of humans. We are fallible.

We want to talk about different ways that we can improve the accuracy to supplement what these things are going to do out of the box. Now we can add additional information to the context window. This is basically as you're typing a question to an LLM or sending a question programmatically, you can add additional information to that window. You can use a semantic tool to define the details about what your models look like or you can basically build a very budget semantic tool yourself and we're going to show each of these.

But first of all, I want to talk about kind of the sample data sets that I'm using. First of all, you all are going to see revenue numbers. Those are not real revenue numbers. They have nothing to do with MotherDuck as a company. However I did try to use a very artificial data set when I created this presentation. I tried to use Northwind data. Northwind is a canonical data set from Microsoft that SQL Server used back in the day. And it's amazing. It is very well laid out. It's basically customers and their orders and what items were in those orders and what products the company sells. The problem is that data set is super well documented on the internet. I think Michael from Shadow Traffic emulated that data set in a few hours for us and he was very kind to do it. So I do want to give a shout out to Shadow Traffic. Awesome at generating data sets.

However, our actual internal data warehouse at MotherDuck I think is 217 different tables. We're a three-year-old company. We have 217 tables in our data warehouse and I think 41,000 columns in those 217 tables. It's a lot more complex and it's not well understood because it's not documented on the internet. So that is the data set that I'm using, but I'm fudging all the numbers. So don't take any meaning from the numbers.

But one of the biggest things that we do at MotherDuck is ask what is our annualized weekly revenue? This is similar to ARR that you might have heard of. It's called AWR. A lot of us are engineering background folks, so we don't really like to use a term that's not accurate. And annualized weekly revenue is a fairly accurate term for us. If you ask OpenAI what it is, it's basically saying it would be one week's of data repeated for all 52 weeks and that's annualized weekly revenue. Pretty good. It's the right definition, of course.

Then I went to the LLM and said, "What is our annualized weekly revenue for today?" And the LLM said, well, the formula is the annualized weekly revenue equals revenue on a day times 365 divided by 7. Anyone know what the problem with that is? What is this divided by seven thing? It's basically taking a day's revenue, figuring out what the whole year's revenue would be based off of that day, and then saying, "Oh, we'll just take a month and a half of that," or something like that. This is a completely ridiculous definition. But this is the definition that it used and then it computed a result.

Now what we are happy for is that it explained what it did. So as a smart human you can then say, "Oh wait, this seems a little bit wrong." But we don't want it to get it wrong. So we add additional information to the context window. In this case we basically just did markdown format files with all of our go-to-market definitions and saying annualized weekly revenue is defined as this, which is basically the current date's revenue and summed with the last 6 days of revenue multiplied times 52. You'll notice that my definition isn't very precise. It could be more precise. But it's additional helper data for the LLM.

So then I say what is our annualized weekly revenue for each day in November and I get this. At the top there 1.5 million, at the bottom 11.962 million, the original numbers before I fudged it. I looked at the bottom number and I was like, "This is right." All right. But why are the other numbers—why does it grow so rapidly at the beginning?

What happened is the LLM understood what annualized weekly revenue was. It generated a window function in SQL, but it said, "Oh, I'm talking about November." So it excluded all data from prior to November in its calculation. Even as it was doing let's say November 1st, obviously it needs to include the six prior days which are in a different month or in October, but it didn't have access to that data. So it just assumed zeros and we got our annualized weekly revenue—a really bad calculation that someone may actually miss depending on how much data that they're looking at.

Again we say let's update the context window and say always use a full seven days of data when calculating AWR. I did that. I ran it. It worked well. I patted myself on the back. Life was good. The next day as I continued preparing for this talk, I ran it again. It again got the first six days wrong. My feeding at that additional context worked in one run but not in another run. Very intermittent. Then I ran it again and all days were wrong. It generated really crappy SQL. I have no idea actually. I didn't even dive into what it did. But I did run it again and it worked fine. Very frustrating experience, especially if you're expecting to have accurate business data.

How many of you have kids? I think working with an LLM is like trying to teach calculus to your 2-year-old. It can be very frustrating at times. You don't know actually what is sinking in and what is not sinking in. But we do like our ducks to teach all the LLMs how to do accurate SQL. It just won't always happen if you don't have all of the right context. And the context that I provided obviously was not enough.

What can you do to provide additional context? You can actually use semantic modeling tools like Cube or the semantic modeling capabilities of BI and analysis tools like Omni and Malloy and you can feed it additional data in a lot of detail. So you can feed it basically what are your measures? What are the types of things that you want to compute? What are the dimensions? How do you want to categorize those measures? What are the filters? Do you want to be able to do by date range or by geography in these examples? And what are the calculated measures? What are the measures based off of other measures? You can go and spend a heck of a lot of time out there just kind of defining all of these things and life will be glorious. Maybe. But it is a lot of time and energy involved in doing that.

Another way that you can do it is you can take these 200-some odd tables and you can say, "I'm going to make a very simple view of this data. I'm going to make a view that looks more like that Northwind data set so that the system can understand something that's well documented." And luckily, it's really easy to create views in most databases.

The other thing though you can do is start commenting on the databases and basically describing your columns. This is sort of the poor person's semantic layer and I actually generated this from the LLM to start with and then you can go in and modify it and provide additional details. So if you wanted to say for instance that this is the join ID or the foreign key basically between this table and another table and describe how they join and what the name of the join key is. You can do all sorts of things like that to make the LLM slightly better.

The LLMs are still generating SQL not necessarily knowing what SQL syntax that it should generate though. So even if you get it accurate, you might have a lot of queries. We run a database company that is based off of a consumption-based billing model. I personally love LLMs because they'll write 20 queries because a human will only have to write one. So the LLM is able to rack up the bills faster. But here I'm going to give you some indications about how to prevent that—which is teach the model about your SQL dialect and provide performance optimizations of course and tell the model what not to do.

We actually do publish a full markdown file describing DuckDB SQL and the MotherDuck editions that you can feed to your LLM to teach your LLM in the context window how DuckDB works and how to do some types of operations which may be a little bit more esoteric in a particular variant of SQL.

You can also feed it things about how to do efficiency. You can teach, show an example window function here and show how to use the qualify function in order to limit the computation there and do other efficient patterns for when you're trying to figure out things beyond ordinary SQL that it might recognize.

Lastly, I mentioned you could tell it to avoid certain things or—what I started to realize is this looks pretty beautiful. I know the LLM can generate SQL. I could use LLM to generate my get sales by region SQL. But I as the person that understand how our data works can then say okay that is the SQL that represents get sales by region and essentially create a set of tools for each of our major business metrics and those set of tools can then define how to interact with the analytics database.

It still comes down to what are you doing as an agent analytic—what are you doing as an agent interacting with the analytics database?

The capabilities of an analytics database are things like transformation, ingesting data from a huge variety of different sources, everything from CSV and Parquet files to JSON and random things off of S3. You can ingest that all into the database and then transform that data. It can act as your memory if you're doing a lot of analytical functions. So if you're building an agent that needs memory and you're doing a lot of facts, you use a vector database if you're trying to store a lot of facts. But if you're trying to actually store data that you're going to analyze as numbers, an analytical database is really great for that memory. And then obviously the computation, the ability to scan across a bunch of data and do analytics.

To give you an example, if we did something like, "Which of my customers currently spending over $10,000 are likely to churn?"—you'd want an analytics database that can understand different data sources because you'd bring in different data from your business, from HubSpot, from Salesforce, from all of your various business systems. And if you have a product, you'd bring in product data. You want to be able to store temporary state and derive data because you want between different runs of the LLM to have some consistency there. Scaling up and down as needed—you can easily have runaway costs and be able to run a bunch of things at once of course and keep costs under control.

What would a great analytical database look like for that? First of all, what would it not look like? This is what a lot of data warehouses are. A lot of data warehouses are this monolithic very large engine that is scaled for peak traffic, peak performance. A lot of businesses overspend on idle resources for their data warehouses and you don't really want to do that. There's autoscaling and things like that that you may have used. A bunch of people actually came up to our booth and talked about how autoscaling on some of the major databases didn't really do what they need. You really don't have your cost scaling with the amount of usage. And you're just worried about—these are users, but let's say there's agents spinning up and running into your database. They can execute tons of queries.

This is where we talk about open-source DuckDB. How many of you actually used DuckDB? Okay, maybe a third. Super easy to install. It runs locally as a library on your computer. It does analytics much the same as SQLite might be used to do transactional data. It is very easy to use and nowadays occupies maybe 100 megabytes of RAM but has full SQL analytical capabilities. It is very popular and overall DuckDB gets high scores from everyone in the community in terms of the experience that it provides. Improved SQL dialect. It's a great CSV parser. It's an awesome library to use within your applications. It's in process. It's pretty fast. Go to ClickBench, which is the benchmark from the ClickHouse folks, and DuckDB oftentimes performs much better than ClickHouse in its own usage. And it's a tiny little library. It's easy to use, it scales, and it's free. And that's probably the best thing for a lot of people.

Of course, analytics often requires heavy resources. It's kind of difficult to predict how much resources you need. You're on your own to scale it if you're using the library for DuckDB. And do you really want to manage a fleet? And that's really where MotherDuck comes in as a cloud data warehouse. I don't want to pitch you on all the stuff about MotherDuck, but what I do want to talk about is kind of the differences in design versus what you might be used to for an analytical database.

With MotherDuck, we give every user their own duckling. Every user that goes and attaches to a data warehouse has their own compute. That's what the duckling is. It's essentially its own database instance. They're sharing data but have access to their own compute. They can scale vertically as needed. And this extends to applications.

A lot of usage of MotherDuck is customer-facing applications where the end users is another company that's a customer of ours or a customer of a customer. And each of those people also get their own databases in many of the use cases. What really helps there is that's highly partitioned data. These folks really want to make sure each of their customers' data stays completely separate and this is great for that. But it also helps preventing collision on resources. You get predictable performance when each of the users have their own data that spins up and spins down really quickly.

Now another thing that MotherDuck does which is really awesome for these types of things is something we call dual execution which is basically a user comes in and they send a SQL query to their local DuckDB. The local DuckDB open source DuckDB is the client for MotherDuck. They send a SQL query and the SQL query planner basically decides which parts of this data is local and which parts of this data is remote. So you write a very similar like a normal SQL statement here. But it knows that hey our super top secret data is only here in this local computer and the other data is in the cloud and it sends that part of the SQL query for the join over to the cloud and pulls out the data from there.

You can think of this for this architecture as being really useful for agents especially if the agent wants to do a lot of low-cost easy data analytics, low amounts of data. You could just have the agent use DuckDB locally, zero cost associated with that, but then when the agent needs the power of the cloud and scaling up, it is literally just changing one line to say I'm going to connect to "md:" instead of connecting to your local file system. DuckDB is really versatile that way. You can use your local file system. You can even use a local in-memory and then with one line you're actually using the cloud with this dual execution.

There are also ways, if you have an agent that you really want that agent to have its own playground, its own thing to work with, you can do that with zero copy clones. You can have this shared data set that everyone has access to, all the different agents have access to, and you can very easily say, "Okay, give me a copy of that for my local agent to work with." And it's basically a free operation there. And then your local agent can say, "I'm going to add some additional metadata. I'm going to add some additional data and then I'm going to eventually push it back so that other folks have access to that." But that cloning capability allows you to provide full write access to your agent to your underlying data if it needs that.

And of course you can scale out. If you have for instance an agent that needs access to a database but that agent needs a lot of power, you can basically use these read replica ducklings and say hey I'm going to have four other instances, eight other instances, 12 other instances—I think we're up to 16 in the UI right now—but I'm going to have all of these other instances out there to do my analytical workload and only this agent has access to do that. The other agents do not.

And you choose all of that stuff in the configuration. Basically I have three agents on the bottom and then the core data warehouse service. The core data warehouse service would be writing all the data that the agents need to access and then the three different agents we can configure differently. The core data warehouse has a jumbo instance. We configure one of the agents here that really needs a lot of power to have jumbos and then our churn predictor agent maybe needs to look at a lot of things at once and so we give it many ducklings or many instances to be able to do that.

All the cool kids are building agents nowadays. You want an analytical database that can help answer different types of questions. You want hyper-tenancy that allows the AI applications to have a sandboxed amount of compute and prevent runaway costs. And then DuckDB and MotherDuck can really help there. That's all I have for today but thank you very much.

Related Videos

"Watch Me Deploy a DuckLake to Production with MotherDuck!" video thumbnail

2025-12-10

Watch Me Deploy a DuckLake to Production with MotherDuck!

In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!

YouTube

Data Pipelines

Tutorial

MotherDuck Features

SQL

Ecosystem

"Data-based: Going Beyond the Dataframe" video thumbnail

2025-11-20

Data-based: Going Beyond the Dataframe

Learn how to turbocharge your Python data work using DuckDB and MotherDuck with Pandas. We walk through performance comparisons, exploratory data analysis on bigger datasets, and an end-to-end ML feature engineering pipeline.

Webinar

Python

AI, ML and LLMs

"Empowering Data Teams: Smarter AI Workflows with Hex & MotherDuck" video thumbnail

2025-11-14

Empowering Data Teams: Smarter AI Workflows with Hex & MotherDuck

AI isn't here to replace data work, it's here to make it better. Watch this webinar to see how Hex and MotherDuck build AI workflows that prioritize context, iteration, and real-world impact.

Webinar

AI, ML and LLMs