Divergent Compute.AI Economic Think Tank

First Principles / Part III · Inference & systems / Chapter 14

First Principles · Inference & systems · 14

What is inference?

Training builds the model once. Inference is running it — the forward passes that turn your prompt into an answer, every single time you hit send. It happens in two very different phases, and it is the cost you pay forever.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

Read the prompt fast; write the answer slow

Inference splits into two phases that feel nothing alike. Prefill reads your entire prompt in a single parallel pass — every token processed at once, the GPU running flat out. Then decode begins, and the model writes its answer one token at a time: each new token needs its own forward pass, and each pass depends on the one before, so they can't be parallelized.

That asymmetry is the whole story of inference cost. Reading is cheap and parallel; writing is sequential and slow. Click Run inference and watch it happen — the prompt lights up all at once, then the answer crawls out token by token:

Run inference — prefill, then decode

One parallel prefill pass over the prompt, then one forward pass per generated token.

Prompt & generated answer

0
prefill passes
0
decode passes
0
total forward passes

02Mechanics

Two phases, two bottlenecks

  • Prefill (compute-bound). The whole prompt is pushed through the model in one pass. Because all prompt tokens are available at once, the math is one big parallel matrix multiply — the GPU's compute is the limit, and it's used efficiently. This phase sets the "time to first token."
  • Decode (memory-bound). Now the model generates. Each token requires loading the model's weights from memory to do one forward pass, producing exactly one token, which is appended and fed back in. The arithmetic per step is tiny relative to the weights moved, so the bottleneck flips from compute to memory bandwidth — the GPU mostly waits on memory. This is why generation streams out at a steady, limited pace.
  • The KV cache. To avoid recomputing the whole prompt at every decode step, the model caches the per-token intermediate values (keys and values) — the KV cache. It's what makes decode merely slow instead of catastrophically slow, and it's a major consumer of GPU memory (Chapter 18).
  • No learning happens. Inference only runs the forward pass — no gradients, no weight updates. The model is frozen; it's a pure function from tokens to tokens.

So the model you spent $100M training does the same forward-pass arithmetic on every request for the rest of its life. Making that arithmetic fast and cheap is what all of Part III is about.

04The math

expand ▾

Two FLOPs per parameter, per token

One forward pass through a model with $N$ parameters costs about $2N$ floating-point operations per token (a multiply and an add per weight). For a request with a prompt of $P$ tokens and a generated answer of $G$ tokens:

$$ \text{prefill} \approx 2N\,P \quad(\text{1 parallel pass}), \qquad \text{decode} \approx 2N\,G \quad(G \text{ sequential passes}) $$
$$ \text{total} \approx 2N\,(P + G) $$

The FLOP counts can be equal, but the wall-clock isn't: prefill's $P$ tokens run together, while decode's $G$ tokens run strictly one after another. The decode phase is also memory-bound — its limiter isn't FLOPs but the bytes of weights streamed per token, roughly $2N \cdot (\text{bytes per parameter})$ moved from memory each step. That ratio, arithmetic intensity, is why decode leaves a GPU's compute mostly idle — and why fewer bits and batching matter so much.

05The code

expand ▾

The cost of one request

Prefill vs decode FLOPs for a 70B model answering with a 500-token prompt and 500-token reply.

inference.py

def inference_flops(N, prompt_len, gen_len):
    per_token = 2 * N                  # ~2 FLOPs per parameter, per token
    prefill = per_token * prompt_len   # all prompt tokens, ONE parallel pass
    decode  = per_token * gen_len      # gen tokens, that many SEQUENTIAL passes
    return prefill, decode

N = 70e9
pf, dc = inference_flops(N, 500, 500)
print(f"prefill: {pf:.2e} FLOPs  (1 parallel pass, 500 prompt tokens)")
print(f"decode:  {dc:.2e} FLOPs  (500 sequential passes)")
print(f"total:   {pf+dc:.2e} FLOPs per request")
# prefill: 7.00e+13 FLOPs  (1 parallel pass, 500 prompt tokens)
# decode:  7.00e+13 FLOPs  (500 sequential passes)
# total:   1.40e+14 FLOPs per request   <- and decode's are one-at-a-time

06The economics

The cost that never stops

Inference → money

Training is a one-time capital event. Inference is the bill that arrives forever — every request, from every user, runs the full forward pass again. Across a model's life, serving it almost always costs far more in total than training it did, which is why the data centers being built are sized for inference demand, not just training runs.

And the expensive half is decode. Because it's sequential and memory-bound, a single user's generation barely uses a GPU's compute — so providers pack many users' requests together (batching, Chapter 18) to fill the silicon. The entire discipline of inference engineering exists to claw back the efficiency that the one-token-at-a-time nature of decode throws away.

This is the meter the Circuit watches most closely: every token decoded is a real, recurring cost, and the question is whether the revenue per token clears it. Training built the asset; inference is where the money is actually spent — and, one hopes, made.

07Going deeper

expand ▾

The primary sources

Pope et al. (2022) — Efficiently Scaling Transformer Inference · prefill vs decode, the cost model.
Kwon et al. (2023) — PagedAttention / vLLM · the modern serving engine.
Kaplan et al. (2020) — Scaling Laws · the $2N$ FLOPs-per-token accounting.
How to Scale Your Model (Google DeepMind) · a deep, free reference on inference arithmetic.

Cite this chapter: Divergent Compute, "What is inference?", First Principles, 2026. divergentcompute.com/first-principles-inference · v1.0 · CC-BY.

← Chapter 13
Quantization & distillation
Next · Chapter 15 →
Latency, throughput, tokens/sec