First Principles / Part III · Inference & systems / Chapter 18
First Principles · Inference & systems · 18
Two tricks make serving LLMs affordable. The KV cache stops the model recomputing the whole prompt every step; batching reuses one expensive weight-load across many users at once. Together they turn a wasteful, memory-bound GPU into a profit center — until the cache eats the memory.
01The answer, then the intuition
Two facts from earlier collide here. From attention: each new token attends to every previous token — so naively, generating token 1,000 would recompute the first 999 from scratch, again and again. From inference: decode is memory-bound, so a single user leaves the GPU's compute mostly idle.
The KV cache fixes the first: store each token's attention keys and values once, and every later step just reads them — turning quadratic recompute into a cheap lookup, at the price of memory. Batching fixes the second: since one weight-load can serve many requests in the same pass, pack dozens of users together and aggregate throughput climbs almost for free.
But the KV cache and the batch compete for the same HBM. Drag the batch size: throughput soars — until the cache exhausts the memory and the whole thing falls over:
A 70B model on an 8×80GB node (~500 GB free after weights), each request at 8K context. Illustrative.
02Mechanics
So the serving problem is a memory allocation problem in disguise: fit the most revenue-generating requests you can into a fixed budget of very expensive bandwidth and capacity.
04The math
expand ▾For a model with $L$ layers and hidden size $d$, at $b$ bytes per number, the KV cache stores a key and a value per token per layer:
For a batch of $B$ requests each holding $n$ tokens of context, the total cache is:
A large model with full multi-head attention ($L=80$, $d=8192$, 16-bit) caches $2\cdot80\cdot8192\cdot2 \approx 2.6$ MB per token — so 8,192 tokens is ~21 GB per request. (Real 70B-class models use grouped-query attention and cache roughly 8× less; the batch-vs-cache tension is identical, just at a smaller scale.) Aggregate throughput rises with the batch, throughput $\approx B \times (\text{single-stream rate})$, until either compute saturates or the cache hits the HBM limit:
Lower precision $b$ or shorter context $n$ shrinks the cache and lets you batch more — which is why serving stacks fight so hard for every byte.
05The code
expand ▾The KV cache size and the batch it caps, for a 70B model on an 8-GPU node.
batching.py
layers, d, bytes_pp = 80, 8192, 2 # 70B-class, 16-bit
kv_per_token = 2 * layers * d * bytes_pp # K and V per layer
ctx = 8192
kv_per_req = kv_per_token * ctx
print(f"KV per token: {kv_per_token/1e6:.2f} MB")
print(f"KV per request @ {ctx} ctx: {kv_per_req/1e9:.1f} GB")
free = 8*80e9 - 140e9 # 8x80GB node minus 140GB weights
B_max = int(free // kv_per_req)
print(f"free HBM: {free/1e9:.0f} GB -> {B_max} concurrent requests")
print(f"throughput at batch {B_max}: {23.9*B_max:.0f} tok/s (vs 23.9 for one)")
# KV per token: 2.62 MB
# KV per request @ 8192 ctx: 21.5 GB
# free HBM: 500 GB -> 23 concurrent requests
# throughput at batch 23: 550 tok/s (vs 23.9 for one) <- 23x, for ~free
06The economics
Batching → money
This is the chapter where inference economics turns positive. A GPU serving one user is mostly idle silicon being paid for at full price. Batching spreads that fixed cost across many users, so the cost per token falls by nearly the batch factor — the difference between a business and a bonfire. When a provider quotes a low per-token price, efficient batching is how they can afford to.
And it explains the seams in the product. Long contexts are expensive not only in attention but because their KV caches crowd out other users, shrinking the batch and raising everyone's cost. That's why huge context windows carry premium pricing, and why the whole stack — quantization, PagedAttention, continuous batching — exists to squeeze more paying requests into a fixed HBM budget.
For the Circuit, batch efficiency is the hinge of the cost side: it's what lets falling prices coexist with expensive hardware. Improve it, and the break-even the whole build-out is chasing moves closer. It is unglamorous plumbing that quietly decides whether AI serving makes money.
07Going deeper
expand ▾
Kwon et al. (2023) — PagedAttention / vLLM · managing the KV cache like virtual memory.
Yu et al. (2022) — Orca · continuous (iteration-level) batching.
Pope et al. (2022) — Efficiently Scaling Transformer Inference · the batch/latency trade-off.
Dao et al. (2022) — FlashAttention · computing attention without materializing the big matrix.
Cite this chapter: Divergent Compute, "KV cache & batching", First Principles, 2026. divergentcompute.com/first-principles-kv-cache · v1.0 · CC-BY.