AI, ML and LLMs SQL MotherDuck Features Stream Tutorial

Preparing Your Data Warehouse for AI: Let Your Agents Cook

Q: How do you use DuckDB's SUMMARIZE function to profile data for AI?

DuckDB's built-in SUMMARIZE function returns detailed metadata for any table in a single command: column types, min/max values, unique counts, sample values, and quantiles. You run it once per table, persist the results as JSON, and then use that profiled data to enrich your SQL comments. This information helps AI models understand what the database actually contains, which is the most important factor for generating correct SQL queries.

2026/01/27Featuring:

Jerel Navarrete

Jacob Matson

Preparing Your Data Warehouse for AI: Let Your Agents Cook

The key insight: AI models are smart, but they need context. We walk through practical techniques including database profiling with DuckDB's summarize function, query history analysis to understand actual usage patterns, and storing metadata directly in SQL comments. We also share what doesn't work—from overly complex prompts to expensive multi-candidate generation approaches.

The webinar includes a live demo of the MotherDuck MCP (Model Context Protocol) with Claude, showing how properly enriched databases enable AI agents to write accurate queries, self-correct mistakes, and generate insights with minimal human intervention.

Transcript

0:00Hey there everyone. Welcome. I think we got the timing a little off on this event that I have, but it looks like we have people trickling in. Okay, good.

0:10Had me word for a second. I think I accidentally set the time in our uh video hosting to 9:30 p.m. But looks like you're all here, which is which is good.

0:20 >> Amazing. >> Uh it wasn't too wasn't too much of a mistake. Uh um let we'll give it a little bit for people to trickle in. Uh but yeah, I'll say I'll say hello. I'll introduce myself. My name is Gerald. I'm on the marketing team here at Motherduck. And Jacob, you want to do a quick intro for yourself?

0:36 >> Hey everybody, I'm Jacob. Uh I work in Devril here at Motherduck. Really, really excited to show you uh been up to uh with this webinar today.

0:46 >> Yeah. >> Awesome. Hey, Nigel. Yeah. You want to say hi in the chat over there and can tell us where you're calling in from or not calling, tuning in from. Uh, and you know, love to see you. Jacob and I are both up in currently cloudy, drizzly Seattle area. U, but looks like we have Gabriel from Brazil.

1:05Awesome. Excited for you. We have great webinar today on how to prepare your data warehouse for AI. Uh, you know, we we ran some tests. We did some evals.

1:15Uh, we're trying things out. I think everyone's trying things out. Who knows how AI will change in the future, but we'll hopefully we'll have some good uh tips and tricks for you all today.

1:24Uh, all right. And Amy, hey there from Canada. All right, let me uh let me get going. Actually, I'll start with a little bit of housekeeping. Uh, so yeah, we'll be uh recording this and uh if you registered for it, we'll send out a link afterwards for it. So, you know, if you can't take take all of it, feel free,

1:41you know, drop off do what you need to do. We'll send out a link afterwards.

1:43We'll also probably post this uh on our on our YouTube or our website later on as well. Um we'll have time for um a Q&A afterwards. So I guess just a little bit of high level agenda. I'll do a little quick just intro into what Motherduck is if you haven't really heard or tried out Motherduck. Uh James has a few slide

2:01well actually a bunch of slides on kind of his research uh of kind of how to prepare your data warehouse for AI and then we'll dive into a demo as well after that. So uh and then after the demo like I said uh we'll have Q&A if you have questions uh throughout feel free to put them in the chat uh if

2:18something is you know applicable in the moment we'll get to it. Otherwise I will afterwards I'll scan through uh and we can get to the questions afterwards. So awesome. Let me try to maximize this and yeah. So with that let me just give you a quick introduction into what mother duck is if you are not familiar with

2:36mother duck. Um what is mother duck? Uh kind of to I really to explain what mother duck is we actually had to take a little bit of a step back into time into maybe I don't know 15 years. I don't even know kind of at the beginning of uh the big data uh kind of era when we had you know Hadoop

2:56and Spark and then we had BigQuery and Snowflake all of these really large distributed uh um you know data warehouses query engines uh were created to say hey like hey we need to be able to process humongous amounts of data uh and to do that really what we need to do is we actually we need to take all of your

3:14queries we need to split them out uh into smaller queries and we need send those out to multiple nodes uh so that we you know so that we can kind of take take all that and do something massively in parallel bring them all back together and then give you the result. Um but the problem with that is that you know

3:31introduces a significant amount of you know overhead and and latency as well. you know, you have multiple round trips to the cloud, you have to shuffle things, you know, it brings things back together, you have to coordinate all of that and it it really causes you to build a system kind of built on some of these um restrictions. Um but you know

3:50fast forward to kind of where we're at today and we've kind of reached what we're calling I think I forget who called it first but you know the data singularity which is basically point at which uh you know 99% of of data set the

4:02size of data sets or at least data sets that you're scanning on a on a typical query can fit in a single node. uh which is very different from what it was again in the past when these you know these MPP systems were kind of created is like hey they were created as a necessity of hey you can't fit most of that data on a

4:19single node uh but we're I mean we're probably even farther in some ways past this data singularity um today I mean we have these super huge beefy nodes in like in EC2 instance uh that has you know you know over 100 cores and terabytes of data and even on you know my laptop for example I have uh you know

4:40I I don't know 36 gigs of of RAM and uh you know dozens of cores and all I can do a lot of processing just on you know just on my laptop as well. So um and so with that kind of we've built motherduck in this new post big data era where know there really isn't you know an idea of

4:57what is big data and what is small data really it's just all it's just all data right because a lot of it can fit uh you know most of it can fit on a single node. So we are building a data warehouse um that uh really eliminates the need for all that distri distributed overhead that you would normally have in

5:12these know traditional data warehouses. Uh we can focus really on how long does it take you to get from having a question to getting an answer. Uh and so and that that is isn't even just the time that it takes to run the query. A lot of that is hey how long does it how how long does it take you to iterate on

5:27a query? How long does it take to do some maybe even exploratory data analysis on a data set? uh you know iterate on that query, try things over and over again and then get an insight and an answer back. Uh and then we're building on you know the best what we have. So you know again a lot of the

5:42infrastructure today is uh extremely advanced. You have you know lots of cores you have it's extremely fast u and and they're also very large. And then we're built on top of um ductb which is u you know itself is an an openlap or sorry open source OLAP uh query engine.

5:59uh it's [snorts] uh managed out of a team in Amsterdam uh same place where they created Python and they are just uh you know all they care about is building a really extremely fast uh and usable uh database um and some of the kind of architecture uh distinctions that Motherduck has compared to in these traditional data warehouses is um you know we we are what

6:23we call effortless effortless and serverless so you only pay for what you're using uh because we're using duct db and duct db can spit up and down uh so quickly you know you have these sub you know 100 millisecond cold start responses uh we can spin uh your resources down uh basically you know down to zero uh when you're not using it

6:43and so you're only really paying you know for for what you're using similarly you can basically just pay for you know as many nodes as you need so you can scale you know out uh with dedicated instances per per user or per service account you can scale in some ways kind of up not in the tradition Uh yeah, you

6:59can scale up where you have hey now I just want uh you know a beefier node because I have you know maybe I need to do I need to resort an entire table and so we we have you know larger instances to do these really data you know these intensive workloads. Um [clears throat] and you know really because of that

7:15because you can separate out these individual nodes per uh or instances per user service account uh it allows you to eliminate you know resource contention and noisy neighbor problems. uh so you're not kind of fighting for resources with with other queries or other users. Um and lastly we have what call we call the dual query uh the dual

7:33execution query engine uh which allows you really to combine the power of duct DB locally uh as well as what we're managing ductb in the cloud and you can you know basically run a query across data set locally and in the cloud at the same time. uh you can you can query just you know on your local machine if you

7:51need to and and really just save uh those cloud compute time. Um you also have a local environment that it can exactly match with the cloud so you can do development locally on your computer and push it to the to the cloud when you're ready um for production. Um and we have a lot of uh we've built in a lot

8:09of you know query plans that will kind of be adaptive to understanding where the data lives and what is the most efficient um and it will execute it kind of you know wherever it is best um automatically. Uh you know how does motherduck compare with some of these other um data ware data warehouses that are out here? Uh this is a I think this

8:28is a slightly old now screenshot from clickbench. Uh but we're kind of just use it to show hey both like you know mother duck is uh one you know the shorter the bar is the faster it it takes and also kind of if you can compare like mother duck here again it's a really short bar but is also you know

8:45if you compare it to maybe say a snowflake 3XL uh our cost is you know significantly slower so you know on a price performance perspective uh we are you know cheaper as well as just more you know more performant as well. So um and then lastly you know kind of motherduck uh fits into the normal modern data stack as you'd think. So

9:04again we're kind of in our view in the middle as as at the data warehouse and we connect with all the normal ingestion orchestration transformation tools that you uh would normally uh you know use and love today. So with that let me how do I minimize this part? I'm done with my part and I will hand it over to Jacob

9:24for his >> All right. Thanks J. I'm going to add mine to the stage. Wait, remove you. Oh, there you go. You got it. Okay. And I will add to the stage. Perfect. Got some slides here. We're going to talk about slides and we'll jump into a demo. Um [clears throat] um let's see where we're at. I do see a

9:43comment uh breaking for anyone. I'm not sure what that means. How do we sound?

9:47Do we sound okay in the chat here? >> We're fine. Sometimes the Wi-Fi in the office, my Wi-Fi is a little >> spotty, but it looks good right now. So, but yeah, let us know uh you know if if we're cutting in or out. So, >> okay, great. All right. Um so, let's start with uh a little bit of a spicy

10:05take here, which is um

10:09uh I think Texas SQL is basically solved. um taking natural language and writing SQL with it is is a solved problem and uh the benchmarks don't show it. Um I'm going to explain today why the gap is fake and how you make it uh

10:28how you make it work awesome by five. Okay, I have no idea what that means. Anyways, I should stop reading the comments. All right, let's talk about the benchmark. I spent some time with a very famous SQL Texas Texas SQL benchmark called um birdbench and uh

10:47it's a massive massive massive benchmark uh 12,000 plus questions across many data sets lots of gigs of data um so I broke that down there's a mini dev set you can really that you can get it's really easy to use um and uh I started

11:04just using hey like what does it look like to use the mother duck MTP against these data sets sets um [clears throat] and you know what kind of what kind of results can I get with just a little bit of tuning on my side and I rapidly got to about 70% in terms of in terms of in

11:20terms of accuracy um which was not where I wanted to get the kind of you know uh state-of-the-art is is around 82% and the human eval set is 90% about um so I

11:33really couldn't figure out how they were getting better in terms of this benchm benchmark didn't really make sense to me. So, I got really frustrated [clears throat] and and I said, "Okay, I'm going to actually look at the data here." Like, what the heck's going on?

11:45Like, why why am I getting so many wrong and you know uh uh these these other folks are are not? Um just just kind of show you where we were at. Um uh this was early in the process. It looks different now, but uh you know, this is what I got correct wrong. and partials. I have a partial in there

12:03because the benchmark fails you if you return an extra column or if you return extra rows. Um I uh modified it slightly

12:12to check different states um to to allow be a little bit looser there so we could check check against partials. Um honestly just to like eliminate hey [clears throat] when we have the right SQL query but maybe an extra column was really kind of the usual usual suspect there. Um, so I started looking at the data. Uh, I

12:33built a little harness that I'm going to show you in a little bit to look at the data. Um, and straight straight up the SQL in the models is is wrong. The question asks a specific thing. The SQL written in the gold uh SQL is incorrect. So therefore, the answer is incorrect and my predicted result is right. So uh that happened

12:56quite a bit. In fact, since since I started working on this, there have been some papers published about how this benchmark has a problem on it, which is fine. Um, I'll share those in the chat later. Um, so let me give you an example what this looks like, right? Um, so this is a uh a question in the benchmark set

13:12uh question 1243. Um, and it's asking us

13:16about abnormal uh abnormal measures uh for patients. And it uses this SQL statement. Um, and this is fine. Uh,

13:28except there's a bug in this which is if you uh if you count and return a zero on

13:36account, it's still going to return a value. Um, so we need to return nulls in that case. And so actually this needs to be a sum or a count without the else.

13:45This doesn't actually return the right value for the question. Um, so the gold set gives us an answer of 1.2%.

13:53Uh, the actual answer to this question is 7 point or 78.4%. Literally the the answer is just straight up wrong given the question.

14:02Um, and uh, that is a problem that I found repeated. Um, keeps happening. Um,

14:11you know, I kept looking at the data. The closer I looked at it, the more that I just got frustrated and said, I'm going to build a platinum answer set that is corrections to the gold. I have about a hundred of them. That's 20 about 20% of the questions that that fit into this out of the 500. So, let me just

14:29show you what I did to do a little bit of analysis on this. Uh, let me see if I can share this tab.

14:36You don't see that? Why do not see that? Oh, wait. You do see that. Okay. All right. So this is a little um a little eval like uh system that I built so I can actually look. I'm going to show you one. It's a little small I know because there's a lot of data in here. Um okay.

14:53So this is an example of a question. We're going to zoom in here so you can see it. So this is the information that we give to the model to answer the question. Um so we ask this question.

15:02Name the card and artist with the most ruling information also state if the card is a promotional printing. And it gives us some information about this, right? Um, and then here's the schema of the data set, right? So, very nice. It gives us this information. So, we can write a SQL query. The gold SQL is right here. And I want to show you one thing

15:18that has an issue with, which is the the predicate has where promo equals 1. If we go back and look at the question, we can see that it says state if the card is a promotional printing. Well, this is already filtering on only showing us uh showing us data where the promo is one.

15:34Okay. Um there's actually some other issues in this query too, [clears throat] but that is an example of the types of issues that are in this.

15:41So as a result, the gold query which looks like the result looks like this, which is different. And the my query looks like this, which is incorrect as per the benchmark. I actually never added this one to the platinum set, so it shows as um as incorrect in this case, but that's fine. Um and then we can actually see our chain of thought

15:59trace here, too. So, um, here you can kind of see here's the prompts I'm giving it. Here is the tool calls it's making. Uh, we can scroll down and actually see all of the logic here. And I've been looking at these kind of um question by question to make sure that we have uh we can be certain in our in

16:20the conclusions that we're drawing uh as it relates to this. So, let's go back.

16:25 >> Hey, hey, Jacob. Uh, question here in the chat. Sundu was wondering, do you do you know how they created the the benchmark questions? Were they using like was >> I am 98% sure they used AI to create this.

16:38 >> Yeah. Yeah. >> And this benchmark is two years old. The models were much worse. Um uh there's a subtle conclusion here which is like we probably need a better benchmark for for this stuff.

16:49 >> Yeah. But like uh yeah um there's

16:53definitely I mean if you're writing 13,000 questions as a grad student like you're not looking at all of them.

16:59That's just the truth. Um all right so what's happening? What's happening? Why why am I so far off from from state-of-the-art?

17:07 >> Um well uh basically the teams that that are scoring well trained on broken answers and give bad answers to questions.

17:17um and the humans [clears throat] do the same thing. Everyone's just overfitting to bugs in the data set. Um which is a very funny uh observation and basically my my kind of back of the napkin number is the benchmark saturates somewhere around 75%.

17:32And everything after that is uh straight up, you know, uh overfitting. But who

17:40knows? Um we I I'll have more more publishing on that too soon. nothing kind of like academic per se but pretty rigorous. Um anyways, good news for all of us here, right? Um that benchmark is saturated and that means AI is ready for SQL but you do need evals. Okay, so we're going to talk about a little bit

18:01about what that means and um what I learned through the process of of running this with our MCP and latest models. So let's talk about what actually helps.

18:13Um so the what's helpful about birdbench

18:17is the teams that did um submit all all also wrote papers and uh the top scoring paper has some really awesome techniques. They just use GPT40 with no fine-tuning. Um and the insight that I just want to call out here is difficult part of query development relies on understanding what's actually in the database. So how do we do that?

18:38Um so I tested I tested this assumption right? Does it work? Um when I I spent a bunch of time uh prompt engineering, I didn't get too much gains on that. Uh I spent some time giving in the the model column stats and then all a sudden my accuracy doubled. So let me just show you what that looks like. Here

18:57we go. Um I ran this on 500 questions. If I just said here's the prompt to the MCP, this is what I got. I got 30%.

19:05Okay. If I add uh if I add a database profiling, I go to 64%. If I add comments and prop tuning, I go to 74% which is basically saturation point of this benchmark. So, um very very interesting um kind of where we're at today with adding these techniques and I'm going to talk a little bit more about where we go there. So, um oh also

19:31I guess one thing I would call out is is um we also got cheaper as we got as our metadata improved. It got cheaper to run this. Um, these numbers seem really expensive, by the way. Uh, and they kind of are. That's because I was using uh I ran this across three models. I ran this across chatbt 5.2 and across um

19:51uh Opus 4.5 and across Gemini 3 flash.

19:56And uh Opus is significantly more expensive when you use the API than uh those other models. And so that's waiting a lot of the cost here. um much cheaper in reality, but just so you know, that's kind of a uh that's where that artifact comes from.

20:12Um okay, so how do we actually get to this 74%. Um how do we get this metadata and give it to the model? So I have three pieces that I'm thinking about. Uh database profiling, query history, and SQL comments. So these three things go together to kind of make this all work.

20:30um database profiling. DuctTB has this great oneliner called summarize um that returns a bunch of awesome data for us.

20:38Um and you just run that once per for every table and you know persist the results. Um query history, right? So there's something subtle here. Um the schema tells you what's possible. The query history kind of tells you what people are actually using, right? Um uh so if your schema has 50 tables in it, uh but analysts only ever use 10, right? that's

20:59valuable information. You should figure out how to apply that to your AI model. Um, and then what we can do is take those two pieces of information and we can pack that data into our uh SQL comments, right? Um, uh, one note here is obviously like uh, some of these have row counts in them. I don't see row counts in these ones, but

21:19um, from time to time we do we do want to add row counts. Uh, maybe we add that to the table comments, not the column comments. Um, and that might be dynamic.

21:27Um, so we can tell we we can kind of improve our instructions there too. And you know, obviously this is something you probably actually want to regenerate on like maybe a nightly basis or something. Um, to kind of keep the latest knowledge, you know, in in your model. Um, so this is kind of what it looks like. So,

21:45uh, we tried a bunch of different things here. Um, uh, I'm actually going to skip this slide. Um, but but really comments are the best. And the reason why I love comments on columns and tables is it's a SQL standard. Anyone can read it. For any of you who've looked at like let's say the um uh you know SQL server data

22:06sets that they that they give you those test data sets um uh Kontoso or whatever

22:13they include comments and now they're useful. I don't think they were ever useful before they're very useful now.

22:18Um so let me talk a little bit about what I tried and didn't work. Um this is very interesting. Um I read a bunch of the papers related to this. Um and ran some experiments and techniques and kind of you know uh clotted my way through uh ways to ways to apply them and see what works, what doesn't work. Um you know

22:37look at the outcomes. Um one of the common patterns is self-consistency. Uh so generate five SQL candidates, run them all, pick the most common result.

22:46This is a great way to just 5x your cost. Um, if your metadata is bad, it's going to get bad results. If it's good, you're going to get good results. Uh, there really is no advantage to doing this unless, you know, maybe if you're using really cheap really cheap small models that that that could work, but in

23:03my in my experience, didn't work. Um, LLM generated description. So, if you just pass the schema to it to to write uh descriptions based on table definitions, I found that not very very helpful. Um, it hallucinates a bit, right? uh it will make up what the customer status means. Um you giving it the stats and then descriptions and

23:23query history leads to much much tighter more accurate uh descriptions of those columns. That being said uh it is important to make sure your column descriptions are good. Um way more important than writing you know SQL that's getting consumed by a BI tool. Um another thing we tried is like really long system prompts. I set up a

23:41optimizer using a framework called uh DSPI to uh run a bunch of optimization on the prompt and we ended up with a really long and verbose prompt. Um and it tried to handle all these different edge cases, you know, hey, like when I say this, I actually mean this um type type stuff that was specifically in the

24:00test set and uh all that did was make it more expensive to run and less accurate.

24:08So uh you know simple simple worked better. Um the last thing was you know hey like what about prompt injection?

24:15Can we do stuff here? I think there's more to explore in this case but I found that um you know having a separate source for my comments you know a codebase not not my database and then injecting that was uh very difficult to kind of maintain um even honestly even with static data sets that are in the

24:34bench in the benchmark. So moving things to comments really simplified everything. So you know kind of the summary here I would say this is the pattern right? Every every failed approach added complexity successful approaches were just adding information about the data. The models are pretty smart. Uh it just needs to know what it's working with. Simple, right?

24:52Profile your tables, store the stats, let the model read them. Um it works. So I'm going to jump into a little bit of a demo next um and show you um some code that I've written to to build this. Um so we'll see uh query history analyzer generate and apply comments and then we'll do some querying with uh claude.

25:13So uh yeah I'm going to stop sharing and we will share now my VS code window.

25:22Um let me see here I can find the window. Here we go. Oh, that looks pretty good, right Gerald?

25:35 >> Um, yeah. Yeah, I think that size size is good. You know, in the chat, let us know if you need us to zoom in a tad, but >> terminal is a little small, but that's okay. Um, so I'm going to run this first query, and then I'm going to show you while this uh what I have here. I will

25:49make this code available to everybody. Um, it's not open source yet. There's still some work to do. Um, h, but I'm happy to kind of, you know, share it uh share it as an email post post this call. Um, so I'm going to run this. And so I'm just going to explain what's happening here. So we are um using UV to run a

26:09Python metadata generator library. Uh, I have this -v here. This tells me verbose. It's just going to print more to my console. Um, and then I'm choosing this esake database. This is a custom data set we built and that we've been running um for internal uh internal kind of evals. And then uh I'm running my history function on schema main, right?

26:28Um, and so that is what is happening here. And so what it's actually doing is, oh, I'm in the wrong, I'll just do this source in the wrong account. Let's try that again. Um,

26:42there we go. Okay, great. So, it's running queries. It's running on a thousand queries. You can actually filter this. There's functions to filter this on user and there's function to to set how long far back you want to do and what your query limit is. You can look at more than a thousand. um you would have to kind of look at your evals and

26:56maybe make a judgment call on what makes the most sense there. But um all right, so while that's in there, I'm just going to show you what it generates. Um and we'll we'll actually look at it in the output in a in a minute once it's done.

27:07Um okay, so Okay, good. It finished. Great. Um so we generate the a um some

27:14tables actually it's just JSON for now which is um contains join patterns field usage predicates derived metrics samples and query and query use cases and so what we're doing here is really important so we basically take all that query queries and query history that match our criteria and then we uh break them into their components actually I'm

27:34using SQL blot in this case right because we want to exclude things like CTE actually want to use those to kind of figure out where what tables we maybe need to create. That's a separate analysis we could do based on this. Uh but we kind of have to like link to the actual raw tables. We don't want to

27:48pollute it with kind of fake joins and things like that. So we have to do a bunch of processing on that on those queries. And thankfully SQL Glot is really really good at query parsing. And so I'm just going to jump into history and I'll show you what this looks like.

28:01So you can see in my um it looked in my main schema quered a thousand uh looked at a thousand queries and in that we had um here's our here's our most common join products and categories that's joined 306 times in this data set right um so then and then we have uh 35 you

28:21know joins on this on this table 17 joins on this and this is obviously biased by the types of questions we're asking this I'm we're probably asking I mean not probably we're definitely asking a lot of questions to this data set that are about products, products and categories, right? I can definitely tell that right now. Um, and so, uh, we

28:39can see that. Then we also have fields. Um, so there's a there's field usage, right? Uh, how often is this order ID used? What where is it used? It's used in select, where, join, you know, and then an important score. This actually comes from the paper. Um, it's all linked in the repo. Uh, to kind of evaluate, you know, what how important

28:57is this field? And then we look at, you know, a bunch of fields. Um there's some more data in here. Let's see. We also look at predicates, right?

29:05Um so here are our most common predicate is today. So we're looking at time series data when we're doing this. Um and then you can see someone, this is extremely funny. I did this. I did a bunch of research on this limit ginger tea product and uh now you can see it shows up here as a common uh common

29:26predicate. That is super funny. Um uh

29:31and then it looks like we also have um uh you know the first the first day of uh 2025 and the last day of 2025 right so less than tw you know 126 greater than so it's looking at you know ranges of dates you know hey this year previous year previous year data um we've got some other stuff in here

29:52let's see um so we've got derived metrics this is really cool so here you can see here's the expression Um, and then here's what we're calling it. We call it all these things, right?

30:04Um, so it's used a bunch. It's used 436 times. Um, it's on the order details table and it calculates uh with this, right? Um, this is our revenue calculation, which is great. Um, we know what this means. Um, and then here's our next next common metric is quantity, right? Quantity sold. And we can see what table is involved, how often it

30:25occurred, and so on, right? Um I'm going to jump to the bottom here uh

30:32to queries at the very bottom. Um so we also have query samples. So we this has a sampling method to look basically at 30 diverse queries based on tables and columns that are used. And we're actually going to use these to hydrate um we're going to use this to hydrate our our prompts, right? Or sorry our comments. So, um, one the way that this

30:56does is it basically just says, "All right, we can we can look at these at these and we can see what the common relationships are based on this diverse set of queries that we're sampling.

31:05Basically, it's weighted based on how often like how diverse the data is and then how commonly those tables are used and so that we get like a good set that we can use to uh add comments.

31:17Um, all right. I'm going to run the next step here, which is my

31:23um generating my comments. And so, this is actually going to do my profiling and some other stuff. Um, we want to look at one more, which is our translations. I'm just going to show you. So, this translations, um, this takes those 30 queries that we're looking at, and, um, it gives us a a natural language description, it gives

31:42us a short question, and it gives us a long question, right? And why this is important is um at least like for this phase it's used just for figuring out uh what relationships are important and overloading our comments. But my plan is and this is why I haven't released it yet. My plan is to actually also um uh

31:59do a little bit of prompt injection with this where when a question is asked we check against the this natural language short and long question with an embedding and then we we do like a little bit of semantic search and say hey this this is not the answer to your question but this is a similar query about a similar question. So start with

32:16this as a base in addition to all the other information you have. And again, what we want to do is we want to maximize our probability of success here, right? Um, all right. So you can see that's in there too. Uh, we've now generated profiles, which is really cool. So we're going to go look at the profiles. That

32:32should be Where is that? Output profiles. Here it is. Um, so you can see we've profiled the data. Um, so here is a uh this is the categories table. Here's category ID. We tells us what the type is, minimum, max values, number of uniques, uh sample values in the table, right? U min link, max link, etc. Looks

32:56like these aren't getting filled. I'm not sure. Maybe they're not used. I don't remember. Um and so you can see we have, you know, here's our here's our category name, right? Here's some sample values of categories. Um, and again, we can use all of this to populate information back into um back into our uh data set there. So, let

33:19me go um what else we got? What else we got going on here? All right. Now, we're generating LM descriptions. By the way, I'm using um Gemini 3 Flash um in all of this with Open Router. um it seems to be the fastest and cheapest way to uh use an LLM for this type of enrichment activity where you have something that

33:38very narrowly scoped. Um and so I've been using that um you know your mileage may vary. You may have a different relationship. I'm sure those models work fine. Uh but you know most importantly you know is is building some sort of way to evaluate you know um uh is this actually proving the ability of your queries or not which uh I'm writing

33:59about and will publish soon. Um,

34:03cool. So, we have more. Again, it's just the same. Uh, we've ran this across all of the tables in our data set. And right now, it's all just in JSON so that we can easily read it and consume it with Python to um do something. Oh, yes. We also have in here, this is really cool.

34:15I love this. Um, we are looking at tables and keys and basically figuring out what the relationships are, right? So we're looking we're doing jakard similarity between columns to determine is this an actual joinable key right um which is really great and uh helps us to understand you know do can we actually join on these things and we can use that in our

34:40comments to say hey we think table A column one joins with table B column 2 and again this is all just you know uh being built on the fly here for us okay so let's see where we are in our process here. Lots of things happening. We are generating comment table descriptions.

34:59It's great. Let's keep rolling here. You can do it. Um, let's take a look at what these look like actually.

35:08Um, let's not go there. Use cases. I can't remember. Did we use this one?

35:15I don't think so. Um, all right. So, these descriptions are generating. I actually already have a copy of it. So, while it's running, let's just look at it.

35:24Um, I'm going to turn on the wrap. Wrap wrap wrap. Oh, that's not what I want to do. Sorry, everyone. Just blinded you.

35:30Uh, wrap. There we go. Um, okay. So, here's an example of what we are getting added to these um descriptions.

35:42This is actually not the right description. Did I run it wrong? Oh, no. Okay, it looks like it's running right now. So well this is this is a previous version um that I have in here for some reason doesn't have all the information but that's okay. Um you can see what it does include though is join hints in the description for the table.

36:00Hey this is the primary key join for product grouping and inventory segmentation right. Um so this [clears throat] is really great. Um so we can see what this looks like.

36:10[snorts] Um and then uh let's see this is still building. We're on table 10 out of 11. Oh I I remember these are the descriptions. Excuse me. These are kind of like the text descriptions coming from uh coming from the LLM, excuse me.

36:24Um and so we can, you know, we're building these on a table. There we go.

36:28Just just updated uh on a table by table basis. There is a little prompt engineering you need to do here to make these really tight. Um and then the last thing is of course we get comments. So let's wrap this one. We can just literally copy paste run this. I'm not going to do it right now, but um let's

36:45do this. Okay. So, uh, here's the example of the comments on the categories table, right? So, it looks here's that comment and then here's the the data profiling information, right?

36:56Um, and then we have primary key um, uh,

37:01flagged as a comment. Um, and again, this is helpful because these things may change over time, right? As your model changes, you want to maybe update the way your joins work, right? They're it's not a static thing that only exists at one point in time. Um, so that's very very important distinction. Um, I want to call out one really fun thing that I

37:19identified in here. Let's see if it figured it out. Okay, let's go to Oh, no, it isn't there. Okay, self self-referencing. Perfect. So, it recognized that the employee table is self-referenced by via reports to, right? So, this includes both the list of employees and it tells us who reports to who. So, this is really cool. Um, we've added these. So I'm going to jump

37:41into a demo and we'll just do a little bit of querying in Claude. All right. So let's do this.

37:50Let me stop sharing this one and let's start sharing this one. Okay, I'm going to just start a new chat. Okay, cool. All right, so I'm using Opus 4.5. Sonnet works okay too, but I find Opus is just more smarter in terms of working with uh working with data. So, I'm telling you to look at the Eastlake data v2 data set. Um, we have

38:15the Mother Duck MCP installed, so it has access to that. And let's just see what it does. Let's see how smart it is, I guess. Let's see. Let's see if it actually knows to use the MCP based on this. It does. Okay, I need to connect.

38:25That makes sense. Um, let's go. I'm authenticating in another tab. Accept. Close. Great.

38:35All right. So, it is doing this list tables action. And what's really cool in this and then I Okay, it's already authenticated, but it's it's a bug in cloud. Interesting. You'll see that these comments are all returned as part of this list function. So, it's listing the columns uh and the tables, and it's getting all of the context it needs to

38:53actually write the SQL queries as part of this, right? So, it's getting all this information back. So, it's learning getting context really, really rapidly.

39:01We're shooting a ton of context really quickly into this, right? Which is really great. Okay, so we can see that um we can see gross and net sales. Um let's

39:14let's look at gross and net sales in

39:192025 versus 2024. Oops. Stop that.

39:25Edit. There we go. Okay, so we can ask questions. Uh, these are indeed the right numbers, which is great. Um, and so it's just building these queries on the fly. Actually, let's go look at the query that it wrote.

39:40Let's see. Yep, unit price times quantity is gross sales. Net sales is this full uh full formula here. And uh

39:52discount. Yeah, this makes sense. Um then you can see oh it's just using where year order date is this. That makes sense. That makes sense. Um and then it's returning data as JSON, right?

40:07And so then we can see it here. Um all right. So here's a yearover-year comparison here. This is really great.

40:15Um and then we can do something like um what was our what was our T? Let's let's look at the T category. the tea products.

40:26 >> Ginger lemon tea or something. I forget what it was. >> Yeah, [laughter] ginger lemon tea.

40:30[clears throat] Yeah, let's see what it's doing here. Is it smart enough? Um, let's look at our join here. Where product name like tea? Hey, that seems good. Picked up some stakes, too. Let me filter the actual tea products. Hey, that's great. Thank you. Let's say and not stake. I wonder if there was Oh, that makes sense because it's just

40:48looking at both ends of at of this. This would definitely match stake, right? Um, cool. So, it fixed its own query. I'm curious how it did it. Let's go look.

40:58How did it fix the predicate here where like t or

41:05Okay, so it's doing a lower to do an exact match. That makes sense. Okay. So, there's a space before it. So, we don't have any generic T. Hopefully hopefully it's smart enough to know that. That's a really interesting question and um something that we can ask if we want.

41:19Um, let's do that. Uh, are there any products starting with T?

41:31Right? Because we want to like audit this, too. We're not just like blindly going to accept some of this information, right? Great.

41:39And how did it do this? Let's just look. Okay, this is exactly what I would want it to do, right? So, what I like to do is like start to um treat this like an analyst, you know? Um, so I'm going to straight up just say uh let's make a uh

41:54report on E in 2024 versus 2025 and then

42:00uh use MVIS to show it. And so I've built a little library as well that is public already called MVIS which we can use to plug into Claude as a skill and it will create nice looking charts for us um right inside of our model. So we can kind of really quickly get from looking at the data to visualizing it.

42:19Um of course we can also use the model to say like what should we do about this? How do we remove randomness from our sales process etc etc. So those are the types of things that we can do once we actually have this you know in our data set which is really great.

42:33We can also do uh it's also actually fairly good at auditing itself. So when you do see things that are unusual you can ask you can say hey like you know check check to make sure this is right.

42:41Um, and it will show you the exact logic in a way that is very straightforward.

42:46Um, and you can really, you know, use your your power your powers as an analyst to um to to get the right answers and really draw really cool insights out of it. All right, come on CL, let's go faster.

42:58Apparently, they just put me on the slow they put me on the slow train here. All right. Well, while this runs, um, let's start looking at questions.

43:06 >> Yeah. So, there was one that they wanted I think it was back earlier. was trying to uh go back on the duct ductb summarize.

43:14 >> Yeah. >> Um just kind of can go back over that. >> Was was there a specific >> I'm trying to find I think they just missed what was on the slide. Where did it go? Uh can you summarize how how it

43:29works? I guess. Well, clarify it. Clarify how it works. >> Summarize. Oh yeah. Okay. Okay. Okay.

43:36Sure. Um so summarize is just a function. uh you just say summarize table name. Uh it's a basically a macro built into duct DB. And so uh we can use we can use summarize to um we can use summarize to build all the metadata.

43:50Hey, that's a nice looking chart. A little little tight down here. That's okay though. Um anyway, sorry distracted by my own my own charting here. Um

44:01uh yeah, so summarize you just do it's just uh just a function. So you just summarize the table name and it returns all the metadata. It is fairly data intensive. It does do pull things like quantiles and so um if you have a particularly large data set there might be a you may need to be more targeted

44:16with how you use it. >> Yeah. [clears throat] Yeah. Um where do we go uh where could it be you I think you covered this a little bit but you know where could it be useful to look at the history of queries and the stats. Um, I mean you get basically all Yeah, >> I mean I think like you know um

44:40there's two ways I'm just going to drop a comment in here with summarize by the way. Um there's two ways to think about getting feedback around your data model, right?

44:50One of them is talking to customers, talking to your users. Um and but if you have query history, you can you can have a much more refined conversation, right?

44:58You understand how users are actually using your data sets, right? Um, and I think that is a key a key thing to think about.

45:06 >> Yeah. [sighs and gasps] Um, oh, they're saying, could you I'm not sure what these are, but could uh could you timestamp some of these to effectively give some sort of temporality to the scenarios?

45:23 >> Um, okay. >> I think that was on your analysis, your earlier analysis when you're running your script.

45:28 >> Yes, you absolutely could. Um, so like it's actually this is actually a really interesting question. It's a little bit of a feature gap, right? So SQL comments are temporal, right? Um, I think one thing I'm contemplating is I may I may do something where we add a metadata table to duct DB or motherduck and we [clears throat] inject comments in there

45:47and then and we so that we have comment history over time, right? Um because the other thing where this matters is you're not always going to probably be using the same model to generate comments. So at some point you need to basically make sure hey when do I when does it make sense to go to the next model right um

46:02or I guess you you know what what happens in a lot of cases is like you know what we deployed this we deployed this workflow on Gemini 2.5 it works uh

46:12we don't have an eval that really works to tell us if we should go to the next one so we're just going to run 2.5 until Google turns it off um which could be at any time right I know open AAI just deprecated a bunch of models or they they announced deprecating models. So those workflows. So having some sort of

46:29eval loop uh makes a lot of sense um to to think about how to how to um tie those things together. But it's a good question. Um definitely something I'm thinking about and I will probably add to the project so that you have comment history in a table um and can kind of see what that how that's evolving over

46:44time because I suspect you will also see your customers um meaning your users are

46:51asking different questions of different queries over time as well. That's a really good question.

46:56Cool. Uh, Victor's asking, I forget to show this. Uh, how do you think about where to add context in a comment versus in a cloud skill?

47:05 >> Oh, that's such a good question. So, I think like um uh I'm thinking about if it's about the data, I'm putting it in a comment. But a lot of times there's information that is not really about the data, but is like about the company and the context.

47:21That's a great thing to inject into like a project in Claude where you can have like, you know, some log of maybe like specific words that your company uses that are not the same as uh what they mean elsewhere, right? So, that's a good place to put that or like maybe some sort of event log that says,

47:37hey, you know, in middle of 2025, we had

47:41a hit on revenue because of some tariff changes. um you know make the model aware of that so that when it runs a a a analysis on 2025 revenue it doesn't say hey your revenue dropped in July by 18% every single time you run the analysis right um just kind of you prime it with information like that I think that's how

48:00I kind of think about it is like information doesn't really fit in the database and is like more contextual to your company and the questions that are being asked definitely put into like a skill or even a project as markdown >> yeah exactly yeah exactly Victor you know data context versus you know business >> yeah yeah yeah Great, great, great,

48:15great framing. Yes, love that. >> Cool. Uh, Juan's asking, I know we kind of go gone over a lot. Can you just do a quick recap of kind of what your, you know, top, you know, to-dos and not to-dos are?

48:26 >> Oh, sure. Um,

48:31so top to-dos. So basically um so the

48:36number one thing that I found is like SQL comments have let lead to the most um the biggest improvement the most quickly. So whatever you want to whatever information you have uh SQL comment seems to be like a great way to uh add uh to to improve improve whatever you're doing with AI. Um you know there in terms of things that uh or then what

48:59I guess like what I'm doing here from a big picture standpoint here. Let me stop sharing. We don't need to see this although it is a nice looking report. Um

49:07uh big big picture basically we're profiling the data you know using that summarize function. We're then looking at query history breaking it down into its components summarizing that information uh and then mashing that up together and adding that into a comment. There's a bunch of other techniques mentioned in these papers um that are more like runtime techniques. So, I haven't

49:30covered those because I don't really have access to cloud's runtime, right? And so, if you're building an agent to harness, those techniques are interesting, but they're not really interesting for uh for using like a chat a chat app, which I think is like where I'm seeing things going in the in the medium term. Um although I expect, you

49:47know, people build really cool agentic workflows, too. Um, and then in terms of what didn't work, you know, uh, really long prompts like just like every time you run into an issue, like adding another line to your prompt, like, uh, at some point they just stopped working.

50:01[snorts] Not sure exactly what the break point is. It probably massively depends on the model as well. Uh, at some point it just stops working. Um, and, uh, you know, doing really sophisticated prompt injection also seem to like just make it really hard to debug, right? As I showed you earlier in Birdbench, like I'm looking at those logs uh when I'm

50:21building these types of systems, right? I'm looking at all of the responses and or requests and responses made to AI and like having it be transparent to you as a user is like really important because you know if you can't debug it you you know it's really hard. Well, I would actually say it a different way. Uh I

50:40like to build systems that are only as smart as I can debug, right? And so that means that maybe I'm making trade-offs for simplicity, but also it means I can understand and reason about how it works. And that has served me well. Um will it serve you well? I would hope so too. But um uh yeah, and then the last

50:56thing, which is my funniest technique, is some of these papers are just like, what if we just ran a bunch of stuff and then picked the one that looked the best? Um those techniques don't work and they're really expensive. Um don't don't don't do that. Uh maybe if you're like a top AI lab, you can do that and it'll

51:11work. Um certainly my experience was that was a great way to make uh you know a eval run that cost $100 cost $500. So

51:21that's kind of the summary there. Cool. Awesome. Uh any any other questions? You know I'd love to hear you know from the chat like how are you guys uh using AI?

51:31Are you guys having AI agents query your databases data warehouses? What issues are you running into? Um yeah any other questions?

51:41scroll through the chat and see if I missed anything.

51:50So, you'll So, Jacob, you will share um you know, kind of tidy up some of this uh your your script here and we'll share it on GitHub afterwards. That right?

52:00 >> Uh we're not going to share it on Well, it might be like a gist. I I might just like give you like a gist or something.

52:05Um uh the full repo will be shared when it's ready and it's just not ready yet.

52:11Um so um yeah uh we will we will

52:16definitely share it though. I I'll probably just pack it into like a single a single long Python function and just share it and you all can use it. Um but just know that I will uh I will be open sourcing it as soon as it's ready.

52:27 >> Awesome. Uh this one's kind of long. I'm trying to read through it. Um yeah, it looks good. We'll go through. Yeah. Um you know a bit off topic but how are we uh see the opportunity of data links storing companies questions for internal AI interfaces on its chunks embeddings.

52:43Um thinking about a future necessity to rebuild um the why see the why were this top documents why

52:55this >> top documents answered in our internal docs over time. Yeah, I think so. That's a great question and yes. Um, uh, I think that figuring out I think there's a curation element to this too. Like one thing that I did in the query history is I spent a lot of time thinking about how to curate the query history so that it

53:17shows me the things that matter the most. There's a long tale of things where it's like, oh, this join was used once. I don't really care about that one. um uh to be honest with you for what I'm trying to build here. Um and so I think the same thing applies to storing the company's questions is like

53:34um you don't need to probably store all questions like you should probably like do some cosign similarity on them or whatever and and you know if there's already a similar question you know do some evaluation on if you replace this question with that other question or uh you know you don't you don't need all of them. There's just way too much

53:52information. You want this to be I think this goes back to my thought about like keeping the system simple. You could definitely have um a huge collection of questions, you know, thousands or millions of questions. Um but I think that that's probably uh overkill.

54:06There's probably like maybe singledigit 1,000 uh you know maybe maybe 2,000 questions that actually matter and it's probably less than that. It's probably like it's probably like there's 50 that matter and you know 950 that are good

54:21but don't matter as much. And so if you can figure out what those are and then you know rebuild that. The other thing I would say is like one thing I'm thinking about here and this is a little bit more theoretical and thought leadership is um uh I would um I would think like you actually want to

54:42save your it's important to save those prompts because you almost want to build if you have those prompts from users about what questions they're asking you can then you can then have like a gold answer set and you can run evals on how effective your data model is at answering that Right. So like the the contract shifts

54:59from being hey I always have this set of columns and tables to being I always have this set of answers and I think that is a key thing that lets us keep evolving in the AI space is we think about it more like answers and less about like queries and tables. Say how do we make sure that this question

55:15always gets the right answer and how do we do that? Um and you know that's how I'm thinking about that. Um I do see this question from Scott here. Do you want to throw this one up here too? How would you No, go ahead.

55:26 >> How would I approach that per history? Great question. Um, the script that I have runs both with and without history.

55:31Originally, I built it without history. Um, history adds a lot of, uh, good information on joins um, and a little bit more context around how things are used together. Um, so it still works.

55:43Um, it's just not quite as well. So, I'm actually working on an eval that I will publish which will run this um, in three modes. So it'll run it'll run the bird the bird bench data um kind of raw no just prompts only and then it will run it um with comments that did not include history and then it will run it with on

56:03comments that include history. So you can kind of see hey what does it look like to um [snorts] uh what does it look like to ask questions with and without these different modalities and and you can kind of use that as a a guide to say okay how how much more effective can I get um if I don't have query history.

56:20Yeah. Cool. Thank you. Awesome. I think we have just, you know, one or one or two minutes left. So, time for one last question if you want to squeeze it in.

56:30Um, we'll give it just a little bit. Let me scroll back through here. Yeah, it looks like, you know. All right. Oh, yeah. Juan, one's asking, "Can we share the MVIS library?" Yes, we will share I I'll share a link out to that. I think you have a gist of that, Jacob. Um, >> yeah, we'll share it

56:49 >> we'll share that out uh in the follow-up email um afterwards. So, >> yep. Uh we'll share that um uh and

57:04um I'm just putting my social links in here. Feel free to continue the conversation. Um >> uh we'll love to connect with you all and um see what we can chat about and do some fun stuff together. Uh we will share links on Inviz. with share links on uh this metadata enrichment. We've got some really other other really cool

57:20stuff uh in the hopper uh in this vein uh for everybody as well. >> Yep. Awesome. With that, you know, like I said, you know, if if you you know, you can always sign up for Mother Duck.

57:29It's free. Try it out. You can use our MCP with uh Claude, uh you know, cursor chat, GPT, uh uh you know, whatever.

57:38Play around with it. Uh you know, it's it's fun. We've got a lot of found a lot of use out of the MCP internally. uh and I was just doing some analysis with the AI and seeing like hey we're you know seeing a more and more growing uh queries are being written by by AI compared to to humans. Uh I thought you

57:55know >> not taking over yet but but soon it will be uh and you know our data engineer is uh you know having a having a good time preparing. So thank you all. Uh take care. Thanks for tuning in.

58:06 >> Thanks everyone. Chat later.

FAQS

What is the most effective way to improve AI-generated SQL accuracy?

Adding SQL comments with column-level metadata, including data profiles, sample values, join hints, and common query patterns, produced the biggest accuracy improvement. In testing against the BirdBench benchmark, adding database profiling via DuckDB's SUMMARIZE function and enriched comments boosted accuracy from 30% to 74%. Long system prompts, self-consistency techniques (running multiple candidates), and LLM-generated descriptions without grounding data were far less effective.

How do you use DuckDB's SUMMARIZE function to profile data for AI?

DuckDB's built-in SUMMARIZE function returns detailed metadata for any table in a single command: column types, min/max values, unique counts, sample values, and quantiles. You run it once per table, persist the results as JSON, and then use that profiled data to enrich your SQL comments. This information helps AI models understand what the database actually contains, which is the most important factor for generating correct SQL queries.

How does query history analysis improve text-to-SQL performance?

Analyzing query history reveals which tables and joins analysts actually use, common predicates and date ranges, derived metric calculations, and real-world query patterns. This information is parsed using SQLGlot, weighted by frequency, and injected into column and table comments. When an AI model reads these enriched comments through the MotherDuck MCP, it understands not just the schema but how the data is used in practice, which leads to significantly more accurate SQL generation.

Is text-to-SQL a solved problem for data warehouses?

According to the presenter's analysis, text-to-SQL is effectively solved when the data warehouse metadata is properly prepared. Testing against the BirdBench benchmark revealed that about 20% of the benchmark's "gold" SQL answers are actually incorrect, and the benchmark saturates around 75% accuracy. Teams scoring higher are likely overfitting to bugs in the dataset. The real challenge is not the AI model's capability but making sure it has enough context about the data through profiling, comments, and query history.