First Principles / Part III · Inference & systems / Chapter 14
First Principles · Inference & systems · 14
Training builds the model once. Inference is running it — the forward passes that turn your prompt into an answer, every single time you hit send. It happens in two very different phases, and it is the cost you pay forever.
01The answer, then the intuition
Inference splits into two phases that feel nothing alike. Prefill reads your entire prompt in a single parallel pass — every token processed at once, the GPU running flat out. Then decode begins, and the model writes its answer one token at a time: each new token needs its own forward pass, and each pass depends on the one before, so they can't be parallelized.
That asymmetry is the whole story of inference cost. Reading is cheap and parallel; writing is sequential and slow. Click Run inference and watch it happen — the prompt lights up all at once, then the answer crawls out token by token:
One parallel prefill pass over the prompt, then one forward pass per generated token.
Prompt & generated answer
02Mechanics
So the model you spent $100M training does the same forward-pass arithmetic on every request for the rest of its life. Making that arithmetic fast and cheap is what all of Part III is about.
04The math
expand ▾One forward pass through a model with $N$ parameters costs about $2N$ floating-point operations per token (a multiply and an add per weight). For a request with a prompt of $P$ tokens and a generated answer of $G$ tokens:
The FLOP counts can be equal, but the wall-clock isn't: prefill's $P$ tokens run together, while decode's $G$ tokens run strictly one after another. The decode phase is also memory-bound — its limiter isn't FLOPs but the bytes of weights streamed per token, roughly $2N \cdot (\text{bytes per parameter})$ moved from memory each step. That ratio, arithmetic intensity, is why decode leaves a GPU's compute mostly idle — and why fewer bits and batching matter so much.
05The code
expand ▾Prefill vs decode FLOPs for a 70B model answering with a 500-token prompt and 500-token reply.
inference.py
def inference_flops(N, prompt_len, gen_len):
per_token = 2 * N # ~2 FLOPs per parameter, per token
prefill = per_token * prompt_len # all prompt tokens, ONE parallel pass
decode = per_token * gen_len # gen tokens, that many SEQUENTIAL passes
return prefill, decode
N = 70e9
pf, dc = inference_flops(N, 500, 500)
print(f"prefill: {pf:.2e} FLOPs (1 parallel pass, 500 prompt tokens)")
print(f"decode: {dc:.2e} FLOPs (500 sequential passes)")
print(f"total: {pf+dc:.2e} FLOPs per request")
# prefill: 7.00e+13 FLOPs (1 parallel pass, 500 prompt tokens)
# decode: 7.00e+13 FLOPs (500 sequential passes)
# total: 1.40e+14 FLOPs per request <- and decode's are one-at-a-time
06The economics
Inference → money
Training is a one-time capital event. Inference is the bill that arrives forever — every request, from every user, runs the full forward pass again. Across a model's life, serving it almost always costs far more in total than training it did, which is why the data centers being built are sized for inference demand, not just training runs.
And the expensive half is decode. Because it's sequential and memory-bound, a single user's generation barely uses a GPU's compute — so providers pack many users' requests together (batching, Chapter 18) to fill the silicon. The entire discipline of inference engineering exists to claw back the efficiency that the one-token-at-a-time nature of decode throws away.
This is the meter the Circuit watches most closely: every token decoded is a real, recurring cost, and the question is whether the revenue per token clears it. Training built the asset; inference is where the money is actually spent — and, one hopes, made.
07Going deeper
expand ▾
Pope et al. (2022) — Efficiently Scaling Transformer Inference · prefill vs decode, the cost model.
Kwon et al. (2023) — PagedAttention / vLLM · the modern serving engine.
Kaplan et al. (2020) — Scaling Laws · the $2N$ FLOPs-per-token accounting.
How to Scale Your Model (Google DeepMind) · a deep, free reference on inference arithmetic.
Cite this chapter: Divergent Compute, "What is inference?", First Principles, 2026. divergentcompute.com/first-principles-inference · v1.0 · CC-BY.