First Principles / Part VI · Best practices & tools / Chapter 31
First Principles · Best practices & tools · 31
The most common and most expensive mistake in applied AI is reaching for the best model when you needed the cheapest one that clears the bar. The right rule is nearly the opposite of the instinct: match the task, don't max the model.
01The answer, then the intuition
Every task has a quality threshold — the level below which the output is useless and above which extra capability is wasted. Classifying a support ticket needs far less than drafting a legal brief. The job isn't to find the smartest model; it's to find the cheapest model whose capability sits just above your task's threshold, verified with an eval, not a hunch.
Because model cost varies by ~50× across tiers while capability varies far less, this one decision dominates your economics. Drag the quality bar and watch the pick jump between tiers — always landing on the cheapest model above the line:
Illustrative tiers on cost ($/1M tokens) vs relative capability. Drag the quality threshold.
02Mechanics
Put together, model choice is a constrained optimization: minimize cost subject to clearing quality, latency, context, and privacy. The instinct to grab the best model skips the constraint that matters most — the budget.
04The math
expand ▾The selection rule is a one-line optimization — cheapest model that clears the quality bar $q$:
Routing does better than any single pick. Send everything to a cheap model at cost $c_{\text{cheap}}$, and escalate a fraction $p$ of hard cases to the frontier at $c_{\text{frontier}}$. Expected cost per request:
With a mid model at \$0.60, a frontier at \$5.00, and only 20% escalating, that's $0.60 + 0.2\times5.00 = \$1.60$ — 68% cheaper than sending everything to the frontier, while the escalated hard cases still get top quality. The whole art is a good escalation signal: knowing which requests actually need the expensive model, and letting the rest ride the cheap one.
05The code
expand ▾The selection rule, then the cascade savings.
choose_model.py
models = { # $ per 1M tokens, relative capability 0-100
"small-open": (0.10, 55), "mid": (0.60, 72),
"frontier-open": (1.50, 82), "frontier-closed": (5.00, 92),
}
def cheapest_clearing(q):
ok = {n: c for n, (c, cap) in models.items() if cap >= q}
return min(ok, key=lambda n: models[n][0]) if ok else None
for q in (50, 70, 85):
print(f"quality bar {q}: {cheapest_clearing(q)}")
# quality bar 50: small-open | 70: mid | 85: frontier-closed
# routing: cheap first, escalate 20% of hard cases to frontier
route = 0.60 + 0.20 * 5.00
print(f"always-frontier $5.00 vs route $ {route:.2f} "
f"-> {(1-route/5.0)*100:.0f}% cheaper")
# always-frontier $5.00 vs route $ 1.60 -> 68% cheaper
06The economics
The choice → money
Model selection is the single highest-leverage decision on an AI product's unit economics. Because cost varies ~50× across tiers, using a frontier model where a mid one would do can multiply your bill by an order of magnitude for quality nobody needed — the classic way a promising AI feature quietly loses money. The discipline of "cheapest that clears the bar," verified by evals, is worth more to a P&L than almost any prompt tweak.
Routing compounds the win. A cascade that sends easy requests to a cheap model and escalates only the hard ones captures most of the frontier's quality at a fraction of the price — the ~68% saving above is typical, not exceptional. It's the applied-AI version of the whole open-vs-closed market dynamic, run inside a single product.
For the Circuit, this is the demand side made rational. As open-weight models keep raising the capability you can get cheaply, the quality bar that requires a frontier model rises too — squeezing the premium tier's addressable work. The everyday choice of which model to call, made by millions of developers, is quietly one of the forces deciding whether the frontier's economics hold.
07Going deeper
expand ▾
Chen et al. (2023) — FrugalGPT · cascades and routing to cut LLM cost.
LMArena (Chatbot Arena) · human-preference leaderboards — useful, with the eval caveat.
Artificial Analysis · cost, speed, and quality compared across models.
Liang et al. (2022) — HELM · why one benchmark number is never the whole picture.
Cite this chapter: Divergent Compute, "Choosing a model", First Principles, 2026. divergentcompute.com/first-principles-choosing-a-model · v1.0 · CC-BY.