What is inference?

Training builds the model once. Inference is running it — the forward passes that turn your prompt into an answer, every single time you hit send. It happens in two very different phases, and it is the cost you pay forever.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

Read the prompt fast; write the answer slow

Inference splits into two phases that feel nothing alike. Prefill reads your entire prompt in a single parallel pass — every token processed at once, the GPU running flat out. Then decode begins, and the model writes its answer one token at a time: each new token needs its own forward pass, and each pass depends on the one before, so they can't be parallelized.

That asymmetry is the whole story of inference cost. Reading is cheap and parallel; writing is sequential and slow. Click Run inference and watch it happen — the prompt lights up all at once, then the answer crawls out token by token:

Run inference — prefill, then decode

One parallel prefill pass over the prompt, then one forward pass per generated token.

Prompt & generated answer

prefill passes

decode passes

total forward passes

Two phases, two bottlenecks

Prefill (compute-bound). The whole prompt is pushed through the model in one pass. Because all prompt tokens are available at once, the math is one big parallel matrix multiply — the GPU's compute is the limit, and it's used efficiently. This phase sets the "time to first token."
Decode (memory-bound). Now the model generates. Each token requires loading the model's weights from memory to do one forward pass, producing exactly one token, which is appended and fed back in. The arithmetic per step is tiny relative to the weights moved, so the bottleneck flips from compute to memory bandwidth — the GPU mostly waits on memory. This is why generation streams out at a steady, limited pace.
The KV cache. To avoid recomputing the whole prompt at every decode step, the model caches the per-token intermediate values (keys and values) — the KV cache. It's what makes decode merely slow instead of catastrophically slow, and it's a major consumer of GPU memory (Chapter 18).
No learning happens. Inference only runs the forward pass — no gradients, no weight updates. The model is frozen; it's a pure function from tokens to tokens.

So the model you spent $100M training does the same forward-pass arithmetic on every request for the rest of its life. Making that arithmetic fast and cheap is what all of Part III is about.

One forward pass through a model with $N$ parameters costs about $2N$ floating-point operations per token (a multiply and an add per weight). For a request with a prompt of $P$ tokens and a generated answer of $G$ tokens:

$$ \text{prefill} \approx 2N\,P \quad(\text{1 parallel pass}), \qquad \text{decode} \approx 2N\,G \quad(G \text{ sequential passes}) $$

$$ \text{total} \approx 2N\,(P + G) $$

The FLOP counts can be equal, but the wall-clock isn't: prefill's $P$ tokens run together, while decode's $G$ tokens run strictly one after another. The decode phase is also memory-bound — its limiter isn't FLOPs but the bytes of weights streamed per token, roughly $2N \cdot (\text{bytes per parameter})$ moved from memory each step. That ratio, arithmetic intensity, is why decode leaves a GPU's compute mostly idle — and why fewer bits and batching matter so much.

def inference_flops(N, prompt_len, gen_len): per_token = 2 * N # ~2 FLOPs per parameter, per token prefill = per_token * prompt_len # all prompt tokens, ONE parallel pass decode = per_token * gen_len # gen tokens, that many SEQUENTIAL passes return prefill, decode N = 70e9 pf, dc = inference_flops(N, 500, 500) print(f"prefill: {pf:.2e} FLOPs (1 parallel pass, 500 prompt tokens)") print(f"decode: {dc:.2e} FLOPs (500 sequential passes)") print(f"total: {pf+dc:.2e} FLOPs per request") # prefill: 7.00e+13 FLOPs (1 parallel pass, 500 prompt tokens) # decode: 7.00e+13 FLOPs (500 sequential passes) # total: 1.40e+14 FLOPs per request <- and decode's are one-at-a-time

The cost that never stops

Inference → money

Training is a one-time capital event. Inference is the bill that arrives forever — every request, from every user, runs the full forward pass again. Across a model's life, serving it almost always costs far more in total than training it did, which is why the data centers being built are sized for inference demand, not just training runs.

And the expensive half is decode. Because it's sequential and memory-bound, a single user's generation barely uses a GPU's compute — so providers pack many users' requests together (batching, Chapter 18) to fill the silicon. The entire discipline of inference engineering exists to claw back the efficiency that the one-token-at-a-time nature of decode throws away.

This is the meter the Circuit watches most closely: every token decoded is a real, recurring cost, and the question is whether the revenue per token clears it. Training built the asset; inference is where the money is actually spent — and, one hopes, made.

What is inference?

Read the prompt fast; write the answer slow

Run inference — prefill, then decode

Two phases, two bottlenecks

Two FLOPs per parameter, per token

The cost of one request

The cost that never stops

The primary sources