Divergent Compute.AI Economic Think Tank

First Principles / Part IV · Building with AI / Chapter 24

First Principles · Building with AI · 24

Evals

You cannot improve what you don't measure — and AI outputs are variable and subjective, so measuring is hard. Evals are test suites for AI: a fixed set of cases with known-good answers, run on every change, so "better" is a number instead of a vibe.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

Unit tests for a probabilistic machine

When you change a prompt, swap a model, or tweak retrieval, does the system get better or worse? Without a measurement you're guessing from a handful of examples — "vibes-based" development, where a change that fixes one case silently breaks three others you didn't check.

An eval fixes this the way unit tests fixed software: assemble representative cases with expected outcomes, score every version against them, and never ship a regression. The catch is that a single aggregate score can hide a regression. Toggle two model versions — v2 scores higher overall, yet look closely at what it broke:

Eval suite — v1 vs v2, case by case

Six test cases, each graded pass/fail. Watch the per-case view, not just the total.

02Mechanics

How you grade a machine that improvises

  • Programmatic checks. For structured tasks — valid JSON, correct number, a required field present — you can grade with plain code: exact match or a rule. Cheap, deterministic, and the gold standard where it applies.
  • Reference-based metrics. Compare output to a known-good answer with a similarity measure. Useful, but blunt — two good answers can be worded completely differently, so these miss a lot.
  • LLM-as-judge. Use a strong model to grade outputs against a rubric ("is this answer faithful to the source? cite yes/no and why"). Scales to subjective quality that code can't check — but the judge has its own biases, so it must itself be validated against human ratings.
  • Human eval & A/B tests. The ground truth for subjective quality is people — expert ratings offline, or real-user A/B tests in production. Slow and costly, so you reserve them for what the cheaper graders can't settle.

The loop that ties it together is eval-driven development: build a representative set, measure, change one thing, re-measure, keep only what improves the score without regressing a case. It's the difference between engineering and hoping.

04The math

expand ▾

Accuracy, pass@k, and Goodhart

The simplest score is accuracy over a suite of $n$ cases:

$$ \text{accuracy} = \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}[\text{case }i\text{ passed}] $$

Because models are stochastic, one score understates them — a model might succeed on the second try. pass@k measures the chance at least one of $k$ samples passes, given a per-sample success probability $p$:

$$ \text{pass@}k = 1 - (1-p)^{k} $$

At $p=0.5$: pass@1 = 50%, pass@3 = 87.5%, pass@5 = 96.9% — why sampling several times and picking the best is a real strategy. And a warning that governs all of it, Goodhart's law: when a measure becomes a target, it ceases to be a good measure. Optimize hard enough against any single eval and the model learns to game it rather than get genuinely better — which is why suites must be broad, held-out, and refreshed.

05The code

expand ▾

Scoring a suite, catching a regression

Two versions graded case-by-case — the higher total hides a real regression.

evals.py

cases = ["json format", "refusal", "math", "citation", "tone", "edge case"]
v1 = [1, 1, 0, 1, 1, 0]     # 4/6
v2 = [1, 1, 1, 1, 0, 1]     # 5/6 overall...

acc = lambda v: sum(v) / len(v)
print(f"v1: {acc(v1)*100:.0f}%  ({sum(v1)}/{len(v1)})")
print(f"v2: {acc(v2)*100:.0f}%  ({sum(v2)}/{len(v2)})")

regressed = [c for c, a, b in zip(cases, v1, v2) if a == 1 and b == 0]
print("regressions:", regressed)      # ['tone']  <- v2 broke a case v1 passed

def pass_at_k(p, k): return 1 - (1 - p)**k
for k in (1, 3, 5):
    print(f"pass@{k} at p=0.5: {pass_at_k(0.5, k)*100:.1f}%")
# v1: 67%  (4/6)
# v2: 83%  (5/6)
# regressions: ['tone']
# pass@1 50.0% | pass@3 87.5% | pass@5 96.9%

06The economics

The discipline that separates products from demos

Measurement → money

Evals are the cheapest expensive thing in AI. Building a good suite costs real effort, but not having one costs far more: silent quality decay, shipped regressions, and months spent chasing improvements you can't prove. Every serious AI team runs on evals because they convert an unmeasurable "is it good?" into a number you can defend to a customer or a board — the line between a demo that impresses and a product that's trusted.

They also govern the whole build-vs-optimize spend. Without evals, you can't tell whether a bigger model, a better prompt, or more retrieval actually helped — so you either overspend on capability you didn't need or ship regressions you didn't catch. Evals are how the money aimed at quality lands on quality.

This is the through-line of the whole think tank. An eval is just the research method turned on your own system: define the claim, gather held-out evidence, measure honestly, and beware the metric you're tempted to game. It's the same discipline that separates analysis worth paying for from confident noise — applied to AI itself.

07Going deeper

expand ▾

The primary sources

Liang et al. (2022) — HELM · holistic, multi-metric evaluation of language models.
Zheng et al. (2023) — Judging LLM-as-a-Judge (MT-Bench) · using models to grade, and its biases.
Chen et al. (2021) — Evaluating Code (pass@k) · the pass@k metric.
Goodhart's Law · why every target metric eventually gets gamed.

Cite this chapter: Divergent Compute, "Evals", First Principles, 2026. divergentcompute.com/first-principles-evals · v1.0 · CC-BY.

← Chapter 23
RAG vs fine-tune vs prompt
Next · Chapter 25 →
Vector search