The Surprising Truth About AI-Native Semantic Layers

- 14 min read

BY

The Surprising Truth About AI-Native Semantic Layers

- 14 min read

BY

I got DABStep to 100%. Every question in this benchmark was built to be difficult: 445 questions in total, after I set aside 5 with broken answer keys (more on this later). But the number of questions isn't the point. Here's what I learned: the context that got me to 100% isn't a static description of my data, but a fit-for-LLM description of how one specific model reads my data. Said another way - the LLM itself is tightly coupled to the semantic layer. This is a new world of data definitions, and I don't think that's a flaw. I think it's what a semantic layer becomes once you build it with AI.

Approach / agentModelEasyHard
Hierarchical semantic layer (this post)gemini-3-flash100%100%
NVIDIA KGMON Data Explorer — verified LB #1Claude Haiku 4.587.50%89.95%
DataPilot — verified LB #2Qwen386.11%87.57%

I didn't expect to land here. I've spent two posts arguing the semantic layer is smaller than you think — maybe you don't need one, maybe your clean data model already is one. At the end of that second post I promised a harder test. This is it, and it brought the semantic layer back in a different form.

But that form was never really free or open. LookML married you to Looker, DAX to Power BI; dbt metrics, Malloy, MDX — each one is a language plus the engine that runs it, and you were bound to both. I propose that we are still tightly coupled, but in a different way (Model to Semantic Layer vs Semantic layer to language).

INFO: How I got to 100%, in three moves — each its own section below
  1. Take the data as-is. Raw tables in MotherDuck, exactly as DABStep ships them (except in a database instead of csv or json).
  2. Move the domain knowledge into a hierarchical layer. The agent reads it on demand: domains → summaries → full text. Instead of vector search, it's classical indexing.
  3. Let an LLM author, then tune it against the eval. A cheap, fast model provides answers; a smarter model reads the misses and reshapes the layer. Loop until the number of errors is 0.

Exploring a harder data benchmark

DABStep is a synthetic payments dataset, built by Adyen. The schema is straightforward: transactions, merchants, fee rules, a few lookups. You could read the DDL in a couple minutes.

The rules are where it layers complexity in. A single transaction matches many fee rules across nine dimensions: card scheme, account type, ACI (the authorization characteristics indicator, a code for how the payment was initiated), credit flag, intracountry, MCC, capture delay, monthly volume, monthly fraud level. A NULL in any dimension means "matches anything." And there's no "most specific rule wins" — every matching rule applies and the fees sum. Add fraud-steering questions, MCC quirks, and a manual full of definitions that don't quite match the column values, and you have a benchmark where the schema tells you almost nothing and the domain tells you everything.

I reviewed a handful of other benchmarks (e.g. Spider 2) and landed on DABstep to explore next. It's the inverse of BIRD. Clean data but a complex, adversarial ruleset.

The inspiration for this post came from Anthropic's post on how they run self-service analytics with Claude. Their line that stuck with me: offline eval accuracy should be ~100%. Not 95%, not "state of the art." You should be able to answer 100% of the questions you actually evaluate against. So that was the goal — not generalize to some held-out set, but get there on a hard set of questions a real user might ask.

One honesty note before the numbers, because it's the same one from last time: 445, not 450. I set aside 5 questions where the "gold" answer is provably wrong — our SQL is correct and the gold disagrees with the data. I verified each one. Adyen doesn't publish the answer set, so we have to mine it from the submissions on HuggingFace. For those 5 with low certainty, I excluded them from my eval — the logic I reviewed was sound, but the consensus answers didn't match.

A quick history of different approaches

I didn't get to 100% in one jump. The path was 88 → 93 → 100, and each number was a different idea about where the knowledge should go.

88% — semantic search. An early version of my eval pulled context fragments with vector search: embed the question, find the closest chunks, hand them to the model. It worked, sort of, and it capped out around 88%. Pushing past that meant reranking, relevance tuning, evals for my evals — a deep ML problem I'm not an expert in and didn't want to become one for this.

It's worth being precise about why search caps out, because it's the whole reason the next step worked. Vector search asks the model to translate the question into keywords, embed those, compare vectors, return chunks. The weak link is that first step: question to keywords. That's not what these models are trained to be good at. Reading a document and reasoning about it? That they're extremely good at. An LLM is, in a sense, a distillation of all those embeddings already. My mental model is: sending it back to the raw vectors is asking it to do the worse version of a thing it's great at.

93% — stuff everything in. Next, I ripped out search and went the other direction: dump the context into the prompt, and also bake knowledge into the warehouse like I have done historically with dbt. I added schema comments, macros, views, derived tables. This made the data itself carry the domain complexity. That got me to 93% - which indeed would rank first in the leaderboard.

What I noticed is that now the prompt and the schema were my only tuning surfaces — brittle and hard to tune. Additionally, the prompt was vast and I paid for every token on every request. When one question was wrong, there was no surgical fix — just a bigger prompt and a slightly different view. I was making the data smarter and hoping it rubbed off.

The pivot: stop touching the data

The version that hit 100% did the opposite of the 93% version.

I loaded the data as-is. Raw payments, merchants, fees, dropped into MotherDuck exactly as DABStep ships them. No comments, no macros, no views, no derived tables. Then I put all the domain knowledge in a semantic layer beside the data, and let the agent reach into it on demand. The agent's only relationship with the warehouse is plain DuckDB SQL against plain tables.

This is the through-line for all three posts. The real question was never "do you need a semantic layer." It's where does the business knowledge live?

  • BIRD: in the data model itself. Clean schema, nothing else needed.
  • DABStep at 93%: I tried to bake it into the warehouse. Capped out.
  • DABStep at 100%: in a layer, over raw data. The warehouse stays simple.

Once I pulled the knowledge fully out of the data and into the layer, the layer became something that stands alone.

Hierarchical retrieval, not embeddings

The layer is just text. What makes it usable is how the agent reaches it: a single tool, semantic_lookup, with three modes. It's classical indexing — the kind of thing we built before anyone said "embedding." You can see an example by opening the back pages of any textbook.

  1. Call it with no arguments, you get the list of domains — fees, bucketing, SQL patterns, terminology, answer format.
  2. Call it with a domain, you get a one-line summary of every item in it, each with an ID.
  3. Call it with IDs, you get the full text of those items.

Domains, then summaries, then the context itself. The agent reads the summaries and chooses. No similarity gamble, no "closest chunk," just an explicit, transparent walk down a tree. It costs one or two extra tool calls per question. In exchange, the agent finds the right context every time.

My mental model here is: give the model tools that match how it works. It's a reading-and-reasoning machine. So let it read a menu and pick, instead of asking it to guess keywords for a search engine. Same model, better tool, and the last 7 points showed up.

The model can write its own layer

I didn't hand-author the domains and topics. I gave Claude Opus 4.8 a pile of context: the DABStep manual, which reads like CRM or ERP API docs, full of table nuances and how things map. Then I asked it to cluster the whole thing into domains and topics. Call it 95% Opus, 5% me nudging.

With that prior material already written, hierarchical and domain-based, I bootstrapped a working layer to ~98–99% in about two or three hours.

The catch, and it's the same catch Anthropic hit: the model can only write a good layer because it's continuously tested against questions, answers, and traces. When they tried to auto-generate metric definitions from raw tables and logs, they got plausible-looking definitions that encoded the exact ambiguities they were trying to remove. Authorship without the eval set is just confident guessing. Authorship plus a hard eval set is a flywheel, a closed feedback loop.

The rules that matter can't be fetched

One thing that was counterintuitive to me: you want your model lazy, but not too lazy.

The model serving answers is cheap and fast — gemini-3-flash (via OpenRouter), reasoning turned to low. I did this on purpose; as a general principle, I want users getting answers fast and cheap. But this low-reasoning model has a habit: it skips work it can get away with skipping. Specifically, it'll glance at the domain list and jump straight to writing SQL — never fetching the context that would have saved it.

So the most important rules can't live behind semantic_lookup. They have to be in the always-on skill, the part that's in the prompt every single time.

As a concrete example: there's a family of fraud-steering questions where the guideline literally shows an answer format like {card_scheme}:{fee} — and that format is a trap. The real answer is just a single ACI letter, like E. When that rule lived in fetch-gated context, the cheap model never loaded it and confidently answered TransactPlus:27.51 — the right shape, following the misleading guideline, marked wrong. I promoted the rule from a context fragment into the skill (included every time) and the misses disappeared.

Which means the line between "always-on skill" and "fetched context" isn't really "general vs. specific." It's "can I trust this model to go get this, or do I have to use a firmer hand?" And that line moves with the model. A smarter, slower, more expensive model will fetch more on its own. A cheaper one needs more shoved into the prompt up front. Even where you put the knowledge turns out to be a property of the model you're serving with.

The loop is scriptable

The way I actually got from 95% to 100% is a loop, and it's boring in the good way — boring enough to automate.

I would run the eval on the cheap model. I collected the misses and the full traces — what the agent fetched, what SQL it wrote, what it predicted versus the gold. I then handed the failures to the smartest model I had (Opus 4.8 inside of Claude Code) and let it read the context and the traces and propose changes. I re-ran just the failing questions three times until they were stable. Then re-ran the whole set to catch anything I'd broken in the questions that touch the same context. As a side note, this is why it's important to use a fast, cheap model. A single run of all 445 questions costs ~$8 in gemini-3-flash, and opus 4.8 was somewhere in the range of ~$160 (and took 3 times as long).

The fixes come in three flavors, and naming them helped me see the pattern:

  • Under-specified — the context was thin, so add the missing detail.
  • Too similar — two items blurred together, so make them distinct.
  • Over-specified — a rule written too narrowly ("join this key this way for this case") that should have been general.

What I'm really doing is tuning the shape of the data to match how the model uses it. The traces show me how it was thinking. Then Opus reshapes, and we re-run.

Notice the two models doing two jobs. The expensive one authors and refines the layer. The cheap one serves. The whole run lands at 100% for about $7.91, roughly two cents a question. I pay Opus prices to build the map once, Flash prices every time someone reads it.

The layer is coupled to the model

Everything above is tuned to one model's view of the world — gemini-3-flash, at low reasoning. The layer is not generic. Hand it to a different model and it might do fine — 90%, 98%, who knows. But the edge-case tuning that bought the last few points to 100% is 100% specific to the model I tuned against. You cannot pull the three apart. The LLM, the data, and the layer are a single system, and the layer is the artifact where they meet.

The coupling itself isn't new — I said as much up top. What's easy to miss is that this time it's hidden. LookML and DAX wore the lock-in on the outside; you always knew which tool you'd married. The new layer is just plain text — no DSL, no proprietary modeling language — so it looks free, but it's not. You won't feel the binding until the model changes, or a newer, faster / cheaper / more accurate model drops.

That's just the shape of an AI-native semantic layer. It's not a problem to solve, but it has consequences I didn't fully appreciate until I was building on top of them.

You can't let users freely pick models. If finance runs one model and marketing runs another, they get different answers from the same data and the same layer. Consistency was the entire reason semantic layers exist. So you own the model interface, or you maintain one layer per model. There's no third option where everyone picks their favorite and the numbers still agree.

Versioning is a moving target. This is the strange one to me. A third-party model can change underneath you — Opus shifts behavior through system-prompt updates fairly often — with no change to your data and no change to your layer. Questions that were right start coming back differently. There's no way to detect this drift except by continuously running the eval. And there's no "roll back," because the thing that changed isn't yours. The layer is a living thing you run and re-tune, not a definition you write once and freeze.

This is an argument for owning the model. Not because a smaller open-source model is smarter (it's not), but because it's controllable. It won't change out from under you on someone else's release schedule. That's a real reason to consider running your own, and it's a genuine trade-off to weigh against just using the best hosted model and re-tuning when it moves.

I want to be careful here, because I've spent a while arguing that frontier models are basically commodities for SQL work — pick any of them, you land in the same place. Both things are true, at different layers. The recipe is portable: hierarchical retrieval, an LLM-authored layer, a scriptable refinement loop — that works on any model. The tuned artifact is coupled. The serving model is a cheap, swappable commodity right up until you swap it, at which point you're not changing a setting, you're re-running the loop. You should pick the model deliberately.

Where the knowledge lives

So, the trilogy, as one question: where does the business knowledge live?

Maybe you don't need a separate place for it. For clean data, it lives in the data model itself. For hard domains, you can't bake it into the warehouse — I tried, and it capped out at 93% — so it lives in a curated layer over the data, and probably coupled with good domain modeling, too. This layer is one the model can author, that your eval set keeps honest, and that's bound to the specific model you serve with.

The whole thing — the skill, the semantic layer, the refinement loop — is open source. It runs against a MotherDuck warehouse with raw tables and an agent that writes ordinary SQL, so the domain-to-topic-to-context structure is right there to point at your own data and adapt.

But I'd hold onto the reframe more than the code. You're not maintaining a dictionary of your business. You're maintaining a map of how one specific model sees your business — and that's a thing you have to keep re-drawing as both your business and the models evolve.

Subscribe to motherduck blog

PREVIOUS POSTS