First Principles / Part IV · Building with AI / Chapter 24
First Principles · Building with AI · 24
You cannot improve what you don't measure — and AI outputs are variable and subjective, so measuring is hard. Evals are test suites for AI: a fixed set of cases with known-good answers, run on every change, so "better" is a number instead of a vibe.
01The answer, then the intuition
When you change a prompt, swap a model, or tweak retrieval, does the system get better or worse? Without a measurement you're guessing from a handful of examples — "vibes-based" development, where a change that fixes one case silently breaks three others you didn't check.
An eval fixes this the way unit tests fixed software: assemble representative cases with expected outcomes, score every version against them, and never ship a regression. The catch is that a single aggregate score can hide a regression. Toggle two model versions — v2 scores higher overall, yet look closely at what it broke:
Six test cases, each graded pass/fail. Watch the per-case view, not just the total.
02Mechanics
The loop that ties it together is eval-driven development: build a representative set, measure, change one thing, re-measure, keep only what improves the score without regressing a case. It's the difference between engineering and hoping.
04The math
expand ▾The simplest score is accuracy over a suite of $n$ cases:
Because models are stochastic, one score understates them — a model might succeed on the second try. pass@k measures the chance at least one of $k$ samples passes, given a per-sample success probability $p$:
At $p=0.5$: pass@1 = 50%, pass@3 = 87.5%, pass@5 = 96.9% — why sampling several times and picking the best is a real strategy. And a warning that governs all of it, Goodhart's law: when a measure becomes a target, it ceases to be a good measure. Optimize hard enough against any single eval and the model learns to game it rather than get genuinely better — which is why suites must be broad, held-out, and refreshed.
05The code
expand ▾Two versions graded case-by-case — the higher total hides a real regression.
evals.py
cases = ["json format", "refusal", "math", "citation", "tone", "edge case"]
v1 = [1, 1, 0, 1, 1, 0] # 4/6
v2 = [1, 1, 1, 1, 0, 1] # 5/6 overall...
acc = lambda v: sum(v) / len(v)
print(f"v1: {acc(v1)*100:.0f}% ({sum(v1)}/{len(v1)})")
print(f"v2: {acc(v2)*100:.0f}% ({sum(v2)}/{len(v2)})")
regressed = [c for c, a, b in zip(cases, v1, v2) if a == 1 and b == 0]
print("regressions:", regressed) # ['tone'] <- v2 broke a case v1 passed
def pass_at_k(p, k): return 1 - (1 - p)**k
for k in (1, 3, 5):
print(f"pass@{k} at p=0.5: {pass_at_k(0.5, k)*100:.1f}%")
# v1: 67% (4/6)
# v2: 83% (5/6)
# regressions: ['tone']
# pass@1 50.0% | pass@3 87.5% | pass@5 96.9%
06The economics
Measurement → money
Evals are the cheapest expensive thing in AI. Building a good suite costs real effort, but not having one costs far more: silent quality decay, shipped regressions, and months spent chasing improvements you can't prove. Every serious AI team runs on evals because they convert an unmeasurable "is it good?" into a number you can defend to a customer or a board — the line between a demo that impresses and a product that's trusted.
They also govern the whole build-vs-optimize spend. Without evals, you can't tell whether a bigger model, a better prompt, or more retrieval actually helped — so you either overspend on capability you didn't need or ship regressions you didn't catch. Evals are how the money aimed at quality lands on quality.
This is the through-line of the whole think tank. An eval is just the research method turned on your own system: define the claim, gather held-out evidence, measure honestly, and beware the metric you're tempted to game. It's the same discipline that separates analysis worth paying for from confident noise — applied to AI itself.
07Going deeper
expand ▾
Liang et al. (2022) — HELM · holistic, multi-metric evaluation of language models.
Zheng et al. (2023) — Judging LLM-as-a-Judge (MT-Bench) · using models to grade, and its biases.
Chen et al. (2021) — Evaluating Code (pass@k) · the pass@k metric.
Goodhart's Law · why every target metric eventually gets gamed.
Cite this chapter: Divergent Compute, "Evals", First Principles, 2026. divergentcompute.com/first-principles-evals · v1.0 · CC-BY.