What 95% really means
Here's what I actually measured.
The BIRD benchmark scores accuracy using execution accuracy (EX): run the predicted SQL and the gold SQL, compare the result sets, binary pass/fail. Under those strict rules, current state of the art is about 76%. My models scored 64% on train and 58% on test.
Sounds bad. But BIRD's strict scoring has a well-documented problem. A 2025 paper introducing the FLEX metric found that BIRD's execution accuracy only agrees with human experts 62% of the time. Nearly 4 in 10 judgments are wrong, mostly false negatives, where the benchmark rejects answers that humans would accept.
That 62% jumped out at me because it almost exactly matches my blended strict-scoring accuracy of 60.5% (64% train / 58% test). Same observation, different direction. FLEX got there with human reviewers. I got there by relaxing the test harness.
Think about what that means for the leaderboard. If the benchmark only agrees with humans 62% of the time, then to score above 62% under strict rules, you have to start reproducing the benchmark's mistakes. You stop learning to write correct SQL. You start learning to match BIRD's specific, sometimes wrong, interpretation of each question. The systems at 76% have baked those judgment errors into their training. They score higher by getting worse at the actual task.
So I built a more realistic evaluation. I split the 500 questions into a train set (151 questions) and test set (349 questions). I used train to calibrate the evaluation: hand-reviewing failures, curating corrected "platinum" answers where BIRD's gold SQL was wrong, and tuning the partial-match rules. The test set was the holdout. Since I did some prompt optimization on train, I'll show both numbers throughout so you can see how much (or how little) that mattered.
Here's what accuracy looks like as you relax the scoring, tier by tier:
| Scoring tier | Train | Test | What it adds |
|---|
| Gold match only (≈ official BIRD) | 64.0% | 58.2% | Strict result set equality |
| + Platinum answers | 73.1% | 58.5% | Corrects known errors in BIRD's gold SQL (see note below) |
| + Formatting tolerance | 78.8% | 65.5% | DISTINCT differences, extra columns, rounding |
| + LLM judge | 94.9% | 94.4% | "Would a human accept this answer?" |
The platinum corrections only exist for the train set, since I hand-reviewed those 151 questions. That's why the platinum tier barely moves on test (+0.3pp vs +9.1pp on train). But look at the judge tier: 94.9% train / 94.4% test. Half a percentage point apart. The evaluation holds up on the holdout even without my hand-curated corrections.
Train set (151 questions, all 3 models):
| Model | Strict (≈ BIRD EX) | Realistic | Total cost | Tool calls (p5 / median / p95) |
|---|
| Gemini 3 Flash | 68.2% | 94.0% | $1.80 | 3 / 6 / 9 |
| Claude Opus 4.5 | 64.9% | 95.4% | $26.37 | 4 / 6 / 9 |
| GPT-5.2 | 58.9% | 95.4% | $6.87 | 4 / 7 / 12 |
Test set (349 questions, 2 models):
| Model | Strict (≈ BIRD EX) | Realistic | Total cost | Tool calls (p5 / median / p95) |
|---|
| Gemini 3 Flash | 60.7% | 94.6% | $3.96 | 4 / 6 / 9 |
| GPT-5.2 | 55.6% | 94.3% | $15.32 | 4 / 7 / 11 |
Claude Opus wasn't run on the test set. After seeing all three models converge to ~95% on train, spending another $60+ to prove the same point on 349 more questions didn't seem worth it.
The median model makes 6-7 MCP tool calls per question with an iteration limit of 10. A typical question looks like: inspect the schema, explore some columns, draft a query, check the results, refine, done. Some models like GPT-5.2 make multiple tool calls per iteration, which is why its p95 of 12 exceeds the iteration limit.
All three models land at 94-95% under realistic evaluation regardless of where they start under strict scoring. On train, the gap between "best" and "worst" shrinks from 12.6 percentage points to 1.4. On test, from 5.1 to 0.3. Pick any frontier model.