Divergent Compute.AI Economic Think Tank

First Principles / Part VI · Best practices & tools / Chapter 32

First Principles · Best practices & tools · 32

Cost optimization

AI serving cost isn't one number to shave — it's a stack of independent levers that multiply. Each cuts cost by a factor; stacked, they compound. This is the operational discipline that turns a token that loses money into one that makes it.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

Independent factors, multiplied

The reason AI costs can fall so dramatically is that the savings multiply. Caching cuts cost in half; batching cuts what's left to a quarter; quantization halves that again; routing and prompt trimming take more still. Because each lever acts on a different part of the cost, they stack — five modest wins become one enormous one.

That's why "AI is too expensive" is usually a solvable engineering problem, not a fixed fact. Toggle the levers and watch a naive $100-per-million-tokens bill collapse toward a couple of dollars:

The cost stack — toggle the levers

Baseline $100 / 1M tokens (naive). Each lever's multiplier is illustrative but realistic.

$100/1M
baseline · nothing enabled

02Mechanics

The five levers, and what each one moves

  • Prompt caching. When many requests share a long prefix — a system prompt, a document, few-shot examples — cache its KV state once and reuse it, so you don't reprocess it every call. For repeated-context workloads this alone can halve cost or more.
  • Continuous batching. Pack many requests into each forward pass so one weight-load serves them all. The single biggest lever on a memory-bound GPU's economics.
  • Quantization. Serve at 4- or 8-bit instead of 16, cutting the memory moved per token and letting you batch more — usually for negligible quality loss.
  • Model routing. Send easy requests to a cheap model and escalate only the hard ones. Captures most of the frontier's quality at a fraction of the price.
  • Prompt & output trimming. Every token is billed, so shorter prompts, tighter instructions, and capped output lengths cut cost on every single call — the least glamorous lever, and often the easiest.

The discipline is to treat cost as a product of factors and attack each independently, always guarding quality with an eval so a saving doesn't quietly become a regression. Stacked carefully, order-of-magnitude reductions are routine.

04The math

expand ▾

Why savings compound

Because each lever scales a different part of the pipeline, total cost is the baseline times the product of the multipliers — not the sum:

$$ \text{cost} = \text{baseline} \times \prod_{i} f_i, \qquad 0 < f_i < 1 $$

Multiplication is what makes the effect so large. Five levers of $\{0.5, 0.25, 0.5, 0.4, 0.7\}$ give:

$$ 100 \times 0.5 \times 0.25 \times 0.5 \times 0.4 \times 0.7 = \$1.75 \;\;\Rightarrow\;\; 57\times \text{ cheaper} $$

No single lever did that — the biggest was only 4× on its own. The compounding is the point: a stack of merely-good optimizations produces a great one. It also explains why serving prices have fallen so steeply industry-wide — providers are stacking these same factors, and the product keeps shrinking. (The multipliers aren't truly independent — caching and batching interact — but the multiplicative model is the right first-order intuition.)

05The code

expand ▾

Stacking the levers

Apply each multiplier in turn and watch the running total fall.

cost_stack.py

baseline = 100.0    # $ per 1M tokens, naive setup
levers = {
    "Prompt caching":       0.50, "Continuous batching": 0.25,
    "Quantization (4-bit)": 0.50, "Model routing":       0.40,
    "Prompt trimming":      0.70,
}

total = baseline
for name, f in levers.items():
    total *= f
    print(f"+ {name:22s} x{f}  -> ${total:.2f}/1M")

print(f"final ${total:.2f} vs ${baseline:.0f} = {baseline/total:.0f}x cheaper")
# + Prompt caching         x0.5   -> $50.00/1M
# + Continuous batching    x0.25  -> $12.50/1M
# + Quantization (4-bit)   x0.5   -> $6.25/1M
# + Model routing          x0.4   -> $2.50/1M
# + Prompt trimming        x0.7   -> $1.75/1M
# final $1.75 vs $100 = 57x cheaper

06The economics

How a losing token becomes a winning one

Optimization → money

This chapter is the operational answer to Chapter 29's problem. There, a token was deeply unprofitable at low utilization; here, a stack of levers cuts its cost by ~50× — which is exactly what carries it across the line from loss to margin. Cost optimization isn't a nice-to-have; for most AI products it's the difference between a viable business and a subsidized demo.

The compounding is also why the industry's cost curve falls so fast, and why the same capability keeps getting cheaper to serve every year. Providers stack these factors continuously, so the price of a given quality of intelligence deflates — the optimistic half of the Circuit's ledger, pushing against the rising capex on the other side.

For the desk, this is a caution against static analysis. A token that looks unprofitable at today's naive cost may be comfortably profitable once optimized — and today's price may already assume optimizations a competitor hasn't made. Reading AI economics honestly means asking not just "what does it cost?" but "what could it cost, fully optimized?" — because that's the number the market is racing toward.

07Going deeper

expand ▾

The primary sources

Anthropic — Prompt Caching · reusing a cached prefix to cut cost and latency.
Kwon et al. (2023) — vLLM / PagedAttention · the batching engine behind the savings.
Chen et al. (2023) — FrugalGPT · routing and cascades as a cost lever.
SemiAnalysis · how inference cost is falling across the stack.

Cite this chapter: Divergent Compute, "Cost optimization", First Principles, 2026. divergentcompute.com/first-principles-cost-optimization · v1.0 · CC-BY.

← Chapter 31
Choosing a model
Next · Chapter 33 →
Safety, evals & guardrails