Does "AI-Ready Data" simply mean "Good Data Modeling"?

2026/02/06 - 10 min read

Does "AI-Ready Data" simply mean "Good Data Modeling"?

2026/02/06 - 10 min read

A few months ago, I wrote about why we might not need the semantic layer. The argument was that AI could discover business logic from query history instead of requiring humans to predefine every metric. I believed it. But I didn't have the data to prove it.

Now I do.

This started with a question from one of our investors: "How do the different models perform on BIRD with the MotherDuck MCP?" So I ran the experiment. Three frontier LLMs (Claude Opus 4.5, GPT-5.2, and Gemini 3 Flash), each connected to the database via the MotherDuck MCP server, running against BIRD Mini-Dev. That's the official 500-question development split of the BIRD benchmark. 11 databases covering finance, sports, education, and healthcare. The BIRD team curated it for broad coverage across domains and difficulty levels.

The data models are simple. Average of 7 tables per database. None have more than 13. The joins are mostly one-to-many, max two or three hops, zero many-to-many relationships. The kind of schema you could understand in five minutes by reading the DDL.

The result? 95% accuracy. No semantic layer. No query history. No special context. Just the schema.

But that number needs an asterisk, and honestly, the asterisk is the most interesting part.

What 95% really means

Here's what I actually measured.

The BIRD benchmark scores accuracy using execution accuracy (EX): run the predicted SQL and the gold SQL, compare the result sets, binary pass/fail. Under those strict rules, current state of the art is about 76%. My models scored 64% on train and 58% on test.

Sounds bad. But BIRD's strict scoring has a well-documented problem. A 2025 paper introducing the FLEX metric found that BIRD's execution accuracy only agrees with human experts 62% of the time. Nearly 4 in 10 judgments are wrong, mostly false negatives, where the benchmark rejects answers that humans would accept.

That 62% jumped out at me because it almost exactly matches my blended strict-scoring accuracy of 60.5% (64% train / 58% test). Same observation, different direction. FLEX got there with human reviewers. I got there by relaxing the test harness.

Think about what that means for the leaderboard. If the benchmark only agrees with humans 62% of the time, then to score above 62% under strict rules, you have to start reproducing the benchmark's mistakes. You stop learning to write correct SQL. You start learning to match BIRD's specific, sometimes wrong, interpretation of each question. The systems at 76% have baked those judgment errors into their training. They score higher by getting worse at the actual task.

So I built a more realistic evaluation. I split the 500 questions into a train set (151 questions) and test set (349 questions). I used train to calibrate the evaluation: hand-reviewing failures, curating corrected "platinum" answers where BIRD's gold SQL was wrong, and tuning the partial-match rules. The test set was the holdout. Since I did some prompt optimization on train, I'll show both numbers throughout so you can see how much (or how little) that mattered.

Here's what accuracy looks like as you relax the scoring, tier by tier:

Scoring tier	Train	Test	What it adds
Gold match only (≈ official BIRD)	64.0%	58.2%	Strict result set equality
+ Platinum answers	73.1%	58.5%	Corrects known errors in BIRD's gold SQL (see note below)
+ Formatting tolerance	78.8%	65.5%	DISTINCT differences, extra columns, rounding
+ LLM judge	94.9%	94.4%	"Would a human accept this answer?"

The platinum corrections only exist for the train set, since I hand-reviewed those 151 questions. That's why the platinum tier barely moves on test (+0.3pp vs +9.1pp on train). But look at the judge tier: 94.9% train / 94.4% test. Half a percentage point apart. The evaluation holds up on the holdout even without my hand-curated corrections.

Train set (151 questions, all 3 models):

Model	Strict (≈ BIRD EX)	Realistic	Total cost	Tool calls (p5 / median / p95)
Gemini 3 Flash	68.2%	94.0%	$1.80	3 / 6 / 9
Claude Opus 4.5	64.9%	95.4%	$26.37	4 / 6 / 9
GPT-5.2	58.9%	95.4%	$6.87	4 / 7 / 12

Test set (349 questions, 2 models):

Model	Strict (≈ BIRD EX)	Realistic	Total cost	Tool calls (p5 / median / p95)
Gemini 3 Flash	60.7%	94.6%	$3.96	4 / 6 / 9
GPT-5.2	55.6%	94.3%	$15.32	4 / 7 / 11

Claude Opus wasn't run on the test set. After seeing all three models converge to ~95% on train, spending another $60+ to prove the same point on 349 more questions didn't seem worth it.

The median model makes 6-7 MCP tool calls per question with an iteration limit of 10. A typical question looks like: inspect the schema, explore some columns, draft a query, check the results, refine, done. Some models like GPT-5.2 make multiple tool calls per iteration, which is why its p95 of 12 exceeds the iteration limit.

All three models land at 94-95% under realistic evaluation regardless of where they start under strict scoring. On train, the gap between "best" and "worst" shrinks from 12.6 percentage points to 1.4. On test, from 5.1 to 0.3. Pick any frontier model.

The benchmark is wrong sometimes

BIRD is a good benchmark. It also has bugs. In the train set alone (151 questions), I found 49 where the "gold" SQL is demonstrably incorrect. I didn't hand-review the test set, so the real number across all 500 is likely higher.

Here's one that stuck with me. The question asks for a list of schools whose composite test score exceeds 1,500. The gold SQL checks the count of students scoring above 1,500. Completely different query, completely different answer. You read the question, you read the "correct" answer, and you think: wait, that's not what was asked.

I curated corrected "platinum" answers for these cases. On average, about 14 of the 151 train questions per model matched a platinum answer instead of the gold, adding 9.1 percentage points.

Humans don't care about formatting

On train, another +5.7pp comes from accepting results that are substantively correct but fail strict equality:

Extra columns (30 cases): the model returned the requested data plus some additional context. A human would say "thanks, that's helpful." The benchmark says "fail."
DISTINCT mismatches (41 cases): the model used SELECT DISTINCT when the gold didn't, or vice versa. Unique values match perfectly. A human wouldn't even notice.
Rounding differences (3 cases): gold says 24.67, model says 24.6667. Same number, different precision.

None of these are wrong answers. They're formatting differences that only matter to a string comparison function.

The LLM-in-the-loop

The remaining gap (16pp on train, 29pp on test) comes from an LLM judge. I used Gemini 3 Flash to review each "failed" answer and ask: does this SQL actually answer the question?

The judge does more heavy lifting on test because there are no platinum corrections to catch benchmark bugs first. What kinds of things was it rescuing?

Reason	Count	What happened
Missing rows	57	Model filtered more strictly than gold in a defensible way
Extra rows	33	Model interpreted the question more broadly
Values close	19	Numeric results within tolerance
Empty result	14	Model returned nothing, but the logic was sound
Missing columns	11	Fewer columns returned, but the question was answered

These are judgment calls. Should "list all schools in the district" include charter schools? Reasonable people disagree. The strict benchmark picks one interpretation and penalizes everything else. The judge just asks whether the model's interpretation is defensible.

If you're building AI analytics, this matters. Nobody ships a text-to-SQL product where the user sees raw results with no review step. There's always a human or an LLM checking the output. The 94-95% reflects how these products actually work. The 58-64% reflects how benchmarks work.

So what about context?

You'd expect more context to help. Column comments, descriptions, hints about what the data means. That's the intuition behind semantic layers and context engines.

I tested it. Same 500 questions, all models, with and without column comments on every table.

Schema	Train	Test
No comments	94.9%	94.4%
With comments	96.0%	94.6%
Delta	+1.1pp	+0.2pp

A percentage point on train, barely anything on test. Most questions saw zero change in correctness.

Break it down by database and it gets interesting. The harder the schema already is, the more comments help (blended across train and test):

Database	Base accuracy	Comment effect
debit_card_specializing	85.5% (hardest)	+8.7pp
european_football_2	93.2%	+3.4pp
california_schools	95.7% (easiest)	-2.9pp

Comments help when the schema is genuinely confusing. debit_card_specializing (try to guess what that schema looks like) got the biggest boost. But schemas with intuitive names and obvious relationships? Comments made things worse. The models had already formed a correct mental model, and the comments introduced noise.

Every developer knows this about code comments. Useful for genuine ambiguity. Harmful when they state the obvious. // increment i by 1 has never helped anyone.

Why simple data models work

The BIRD databases aren't enterprise data warehouses. They're simple:

7 tables on average (range: 3-13)
9 foreign keys on average, mostly one-to-many
Zero many-to-many relationships across all 11 databases
Max join depth of 2-3 hops, no deep hierarchies
Only 1 self-join in the entire benchmark

No junction tables. No polymorphic associations. No slowly changing dimensions. Table names and column names tell you most of what you need to know.

LLMs read these schemas the way an experienced analyst reads DDL. They see schools with columns school_name, district, and enrollment, and they know what to do. Foreign key from scores to schools? They know how to join. Nobody needs a semantic layer to explain that "enrollment" means "the number of students."

Good data modeling is the semantic layer. When your tables are well-named and your joins are straightforward, the LLM has everything it needs. There's a growing ecosystem of tools promising to make your data "AI-ready" through context layers and metadata platforms. Some of that will matter for genuinely complex domains. But for most orgs? Clean up your data model. That's the highest-ROI investment you can make.

What I'd invest in first

Every environment is different, but here's how I'd prioritize based on what I've seen.

Start with the data model. Clean tables, clear names, straightforward joins. If an experienced analyst can look at your schema and understand it in a few minutes, an LLM can too.
Then add targeted context. Column comments and metadata, but only where confusion actually exists. Document the debit_card_specializing tables, not the schools tables.
Query history comes next. It gets more important as the domain gets complex, especially for discovering undocumented business rules (like "abnormal GOT > 60", which I wrote about last time). The BIRD databases have simple rules. But I'm working on DABstep next, which has a simple data model but very complex domain rules. The kind of knowledge that lives in people's heads, not in column names. Query history and curated context will matter a lot more there. Even then, the clean data model comes first.

Lastly, don't worry about a formal semantic layer - If your data model is clean and your context is targeted, it adds almost nothing for AI use cases. In fact, it seems to get in the way as AI is great at writing SQL and less great at other tools.

Start now

The bar for "AI-ready data" is lower than the industry is telling you.

You don't need a context engine, a semantic layer, years of query history, or a specialized metadata platform. You need a clean data model and an LLM. Find a domain that is ready for this and start there.

The gap between "benchmark accuracy" and "would a human accept this?" was 31pp on train and 36pp on test. That's a huge gap, and it closes the moment you put a human or LLM in the loop. Which is how every AI analytics product works anyway.

If your data model is clean, start today. Point an LLM at your schema via MCP and ask it questions. If your data model isn't clean, now you know where to start.

Follow-up to What If We Don't Need the Semantic Layer?. All accuracy numbers from 500 BIRD Mini-Dev questions across three frontier models on 11 databases. The evaluation framework is open source. It's heavily vibe-coded, so YMMV, but the data is real and I've looked at all of it.

FAQS

AI-ready data simply means a clean data model. Well-named tables, straightforward joins, and clear column names are enough for frontier LLMs to achieve 95% accuracy on SQL generation — no semantic layer, context engine, or specialized metadata platform required.

Not for many use cases. In BIRD benchmark testing, adding column comments to schemas improved accuracy by just 1.1 percentage points on the training set and 0.2 on the test set. Comments helped only when schemas were genuinely confusing (like a database called "debit_card_specializing"), but actually hurt performance on schemas with intuitive names. Good data modeling is the semantic layer — when your tables and joins are clear, the LLM has everything it needs.

Under strict benchmark scoring, frontier models like Claude, GPT, and Gemini score 58–64% on the BIRD benchmark. But that scoring method only agrees with human experts 62% of the time. Under realistic evaluation — where an LLM or human reviews the output, the way real products work — accuracy reaches 94–95% across all three models, with less than 1.5 percentage points separating them.

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) is a widely used benchmark for testing how well AI models convert natural language questions into SQL queries. Its Mini-Dev split contains 500 questions across 11 databases covering finance, sports, education, and healthcare. While BIRD is the standard for measuring text-to-SQL performance, its strict execution accuracy scoring has known limitations — a 2025 study found it disagrees with human judgment nearly 40% of the time.

The data models that LLMs handle best are simple and well-organized: mostly one-to-many joins with a max depth of 2–3 hops, no many-to-many relationships, and descriptive table and column names. LLMs read these schemas the way an experienced analyst reads DDL — if you can understand the schema in five minutes, the LLM can too. Prioritize clean table names, clear foreign keys, and straightforward joins over adding metadata layers.

Start with the data model: clean tables, clear names, straightforward joins. Then add targeted context like column comments, but only where genuine confusion exists — document the confusing tables, not the obvious ones. Next, consider query history for complex domains with undocumented business rules. Skip the formal semantic layer; if your data model is clean and your context is targeted, it adds almost nothing for AI use cases.

Benchmarks like BIRD use strict execution accuracy: they compare result sets byte-for-byte. This penalizes queries that return extra helpful columns, use DISTINCT differently, or round numbers to a different precision — none of which are wrong answers. It also penalizes correct SQL when the benchmark's own gold-standard answer contains errors. In our train set, 49 out of 151 questions had incorrect gold SQL. Real AI analytics products always include a human or LLM review step, which is why practical accuracy (94–95%) far exceeds strict benchmark scores (58–64%).