Cost optimization

AI serving cost isn't one number to shave — it's a stack of independent levers that multiply. Each cuts cost by a factor; stacked, they compound. This is the operational discipline that turns a token that loses money into one that makes it.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

Independent factors, multiplied

The reason AI costs can fall so dramatically is that the savings multiply. Caching cuts cost in half; batching cuts what's left to a quarter; quantization halves that again; routing and prompt trimming take more still. Because each lever acts on a different part of the cost, they stack — five modest wins become one enormous one.

That's why "AI is too expensive" is usually a solvable engineering problem, not a fixed fact. Toggle the levers and watch a naive $100-per-million-tokens bill collapse toward a couple of dollars:

The cost stack — toggle the levers

Baseline $100 / 1M tokens (naive). Each lever's multiplier is illustrative but realistic.

$100/1M

baseline · nothing enabled

The five levers, and what each one moves

Prompt caching. When many requests share a long prefix — a system prompt, a document, few-shot examples — cache its KV state once and reuse it, so you don't reprocess it every call. For repeated-context workloads this alone can halve cost or more.
Continuous batching. Pack many requests into each forward pass so one weight-load serves them all. The single biggest lever on a memory-bound GPU's economics.
Quantization. Serve at 4- or 8-bit instead of 16, cutting the memory moved per token and letting you batch more — usually for negligible quality loss.
Model routing. Send easy requests to a cheap model and escalate only the hard ones. Captures most of the frontier's quality at a fraction of the price.
Prompt & output trimming. Every token is billed, so shorter prompts, tighter instructions, and capped output lengths cut cost on every single call — the least glamorous lever, and often the easiest.

The discipline is to treat cost as a product of factors and attack each independently, always guarding quality with an eval so a saving doesn't quietly become a regression. Stacked carefully, order-of-magnitude reductions are routine.

Because each lever scales a different part of the pipeline, total cost is the baseline times the product of the multipliers — not the sum:

$$ \text{cost} = \text{baseline} \times \prod_{i} f_i, \qquad 0 < f_i < 1 $$

Multiplication is what makes the effect so large. Five levers of $\{0.5, 0.25, 0.5, 0.4, 0.7\}$ give:

$$ 100 \times 0.5 \times 0.25 \times 0.5 \times 0.4 \times 0.7 = \$1.75 \;\;\Rightarrow\;\; 57\times \text{ cheaper} $$

No single lever did that — the biggest was only 4× on its own. The compounding is the point: a stack of merely-good optimizations produces a great one. It also explains why serving prices have fallen so steeply industry-wide — providers are stacking these same factors, and the product keeps shrinking. (The multipliers aren't truly independent — caching and batching interact — but the multiplicative model is the right first-order intuition.)

baseline = 100.0 # $ per 1M tokens, naive setup levers = { "Prompt caching": 0.50, "Continuous batching": 0.25, "Quantization (4-bit)": 0.50, "Model routing": 0.40, "Prompt trimming": 0.70, } total = baseline for name, f in levers.items(): total *= f print(f"+ {name:22s} x{f} -> ${total:.2f}/1M") print(f"final ${total:.2f} vs ${baseline:.0f} = {baseline/total:.0f}x cheaper") # + Prompt caching x0.5 -> $50.00/1M # + Continuous batching x0.25 -> $12.50/1M # + Quantization (4-bit) x0.5 -> $6.25/1M # + Model routing x0.4 -> $2.50/1M # + Prompt trimming x0.7 -> $1.75/1M # final $1.75 vs $100 = 57x cheaper

How a losing token becomes a winning one

Optimization → money

This chapter is the operational answer to Chapter 29's problem. There, a token was deeply unprofitable at low utilization; here, a stack of levers cuts its cost by ~50× — which is exactly what carries it across the line from loss to margin. Cost optimization isn't a nice-to-have; for most AI products it's the difference between a viable business and a subsidized demo.

The compounding is also why the industry's cost curve falls so fast, and why the same capability keeps getting cheaper to serve every year. Providers stack these factors continuously, so the price of a given quality of intelligence deflates — the optimistic half of the Circuit's ledger, pushing against the rising capex on the other side.

For the desk, this is a caution against static analysis. A token that looks unprofitable at today's naive cost may be comfortably profitable once optimized — and today's price may already assume optimizations a competitor hasn't made. Reading AI economics honestly means asking not just "what does it cost?" but "what could it cost, fully optimized?" — because that's the number the market is racing toward.

Cost optimization

Independent factors, multiplied

The cost stack — toggle the levers

The five levers, and what each one moves

Why savings compound

Stacking the levers

How a losing token becomes a winning one

The primary sources