Evals

Evals

You cannot improve what you don't measure — and AI outputs are variable and subjective, so measuring is hard. Evals are test suites for AI: a fixed set of cases with known-good answers, run on every change, so "better" is a number instead of a vibe.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

Unit tests for a probabilistic machine

When you change a prompt, swap a model, or tweak retrieval, does the system get better or worse? Without a measurement you're guessing from a handful of examples — "vibes-based" development, where a change that fixes one case silently breaks three others you didn't check.

An eval fixes this the way unit tests fixed software: assemble representative cases with expected outcomes, score every version against them, and never ship a regression. The catch is that a single aggregate score can hide a regression. Toggle two model versions — v2 scores higher overall, yet look closely at what it broke:

Eval suite — v1 vs v2, case by case

Six test cases, each graded pass/fail. Watch the per-case view, not just the total.

—

How you grade a machine that improvises

Programmatic checks. For structured tasks — valid JSON, correct number, a required field present — you can grade with plain code: exact match or a rule. Cheap, deterministic, and the gold standard where it applies.
Reference-based metrics. Compare output to a known-good answer with a similarity measure. Useful, but blunt — two good answers can be worded completely differently, so these miss a lot.
LLM-as-judge. Use a strong model to grade outputs against a rubric ("is this answer faithful to the source? cite yes/no and why"). Scales to subjective quality that code can't check — but the judge has its own biases, so it must itself be validated against human ratings.
Human eval & A/B tests. The ground truth for subjective quality is people — expert ratings offline, or real-user A/B tests in production. Slow and costly, so you reserve them for what the cheaper graders can't settle.

The loop that ties it together is eval-driven development: build a representative set, measure, change one thing, re-measure, keep only what improves the score without regressing a case. It's the difference between engineering and hoping.

The simplest score is accuracy over a suite of $n$ cases:

$$ \text{accuracy} = \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}[\text{case }i\text{ passed}] $$

Because models are stochastic, one score understates them — a model might succeed on the second try. pass@k measures the chance at least one of $k$ samples passes, given a per-sample success probability $p$:

$$ \text{pass@}k = 1 - (1-p)^{k} $$

At $p=0.5$: pass@1 = 50%, pass@3 = 87.5%, pass@5 = 96.9% — why sampling several times and picking the best is a real strategy. And a warning that governs all of it, Goodhart's law: when a measure becomes a target, it ceases to be a good measure. Optimize hard enough against any single eval and the model learns to game it rather than get genuinely better — which is why suites must be broad, held-out, and refreshed.

cases = ["json format", "refusal", "math", "citation", "tone", "edge case"] v1 = [1, 1, 0, 1, 1, 0] # 4/6 v2 = [1, 1, 1, 1, 0, 1] # 5/6 overall... acc = lambda v: sum(v) / len(v) print(f"v1: {acc(v1)*100:.0f}% ({sum(v1)}/{len(v1)})") print(f"v2: {acc(v2)*100:.0f}% ({sum(v2)}/{len(v2)})") regressed = [c for c, a, b in zip(cases, v1, v2) if a == 1 and b == 0] print("regressions:", regressed) # ['tone'] <- v2 broke a case v1 passed def pass_at_k(p, k): return 1 - (1 - p)**k for k in (1, 3, 5): print(f"pass@{k} at p=0.5: {pass_at_k(0.5, k)*100:.1f}%") # v1: 67% (4/6) # v2: 83% (5/6) # regressions: ['tone'] # pass@1 50.0% | pass@3 87.5% | pass@5 96.9%

The discipline that separates products from demos

Measurement → money

Evals are the cheapest expensive thing in AI. Building a good suite costs real effort, but not having one costs far more: silent quality decay, shipped regressions, and months spent chasing improvements you can't prove. Every serious AI team runs on evals because they convert an unmeasurable "is it good?" into a number you can defend to a customer or a board — the line between a demo that impresses and a product that's trusted.

They also govern the whole build-vs-optimize spend. Without evals, you can't tell whether a bigger model, a better prompt, or more retrieval actually helped — so you either overspend on capability you didn't need or ship regressions you didn't catch. Evals are how the money aimed at quality lands on quality.

This is the through-line of the whole think tank. An eval is just the research method turned on your own system: define the claim, gather held-out evidence, measure honestly, and beware the metric you're tempted to game. It's the same discipline that separates analysis worth paying for from confident noise — applied to AI itself.

Unit tests for a probabilistic machine

Eval suite — v1 vs v2, case by case

How you grade a machine that improvises

Accuracy, pass@k, and Goodhart

Scoring a suite, catching a regression

The discipline that separates products from demos

The primary sources