KV cache & batching

Two tricks make serving LLMs affordable. The KV cache stops the model recomputing the whole prompt every step; batching reuses one expensive weight-load across many users at once. Together they turn a wasteful, memory-bound GPU into a profit center — until the cache eats the memory.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

Cache the past; share the weights

Two facts from earlier collide here. From attention: each new token attends to every previous token — so naively, generating token 1,000 would recompute the first 999 from scratch, again and again. From inference: decode is memory-bound, so a single user leaves the GPU's compute mostly idle.

The KV cache fixes the first: store each token's attention keys and values once, and every later step just reads them — turning quadratic recompute into a cheap lookup, at the price of memory. Batching fixes the second: since one weight-load can serve many requests in the same pass, pack dozens of users together and aggregate throughput climbs almost for free.

But the KV cache and the batch compete for the same HBM. Drag the batch size: throughput soars — until the cache exhausts the memory and the whole thing falls over:

Batching — throughput vs the KV cache budget

A 70B model on an 8×80GB node (~500 GB free after weights), each request at 8K context. Illustrative.

—

aggregate throughput

—

per-user tok/s

—

KV cache used

KV cache vs 500 GB free HBM0%

batch size (concurrent requests)1

140

Why each trick works, and where they fight

The KV cache. In attention, every token produces a key and a value vector. Since past tokens never change, you compute their K and V once and keep them. Each decode step then computes K/V for only the new token and attends against the cache — so per-step work is linear in context length, not quadratic. The cost is memory: the cache grows with every token generated.
Static batching. Group N requests and run their decode steps in one forward pass. The weights are loaded from HBM once and reused across all N, so you get ~N× the throughput for roughly the same memory traffic — the direct answer to decode being memory-bound.
Continuous batching. Real requests start and finish at different times. Continuous batching (the vLLM idea) adds and removes requests from the running batch every step, keeping the GPU packed instead of waiting for the slowest request. This is most of why modern serving is efficient.
The collision. Both the model weights and every request's KV cache must fit in the same HBM. A bigger batch means more caches, so memory — not compute — usually sets the ceiling on batch size. PagedAttention manages the cache like virtual memory (pages, not one big block) to pack far more requests into the same HBM.

So the serving problem is a memory allocation problem in disguise: fit the most revenue-generating requests you can into a fixed budget of very expensive bandwidth and capacity.

For a model with $L$ layers and hidden size $d$, at $b$ bytes per number, the KV cache stores a key and a value per token per layer:

$$ \text{KV per token} = 2 \, L \, d \, b $$

For a batch of $B$ requests each holding $n$ tokens of context, the total cache is:

$$ \text{KV total} = 2 \, L \, d \, b \cdot n \cdot B $$

A large model with full multi-head attention ($L=80$, $d=8192$, 16-bit) caches $2\cdot80\cdot8192\cdot2 \approx 2.6$ MB per token — so 8,192 tokens is ~21 GB per request. (Real 70B-class models use grouped-query attention and cache roughly 8× less; the batch-vs-cache tension is identical, just at a smaller scale.) Aggregate throughput rises with the batch, throughput $\approx B \times (\text{single-stream rate})$, until either compute saturates or the cache hits the HBM limit:

$$ B_{\max} \approx \frac{\text{HBM free}}{2 \, L \, d \, b \cdot n} $$

Lower precision $b$ or shorter context $n$ shrinks the cache and lets you batch more — which is why serving stacks fight so hard for every byte.

layers, d, bytes_pp = 80, 8192, 2 # 70B-class, 16-bit kv_per_token = 2 * layers * d * bytes_pp # K and V per layer ctx = 8192 kv_per_req = kv_per_token * ctx print(f"KV per token: {kv_per_token/1e6:.2f} MB") print(f"KV per request @ {ctx} ctx: {kv_per_req/1e9:.1f} GB") free = 8*80e9 - 140e9 # 8x80GB node minus 140GB weights B_max = int(free // kv_per_req) print(f"free HBM: {free/1e9:.0f} GB -> {B_max} concurrent requests") print(f"throughput at batch {B_max}: {23.9*B_max:.0f} tok/s (vs 23.9 for one)") # KV per token: 2.62 MB # KV per request @ 8192 ctx: 21.5 GB # free HBM: 500 GB -> 23 concurrent requests # throughput at batch 23: 550 tok/s (vs 23.9 for one) <- 23x, for ~free

Where inference stops losing money

Batching → money

This is the chapter where inference economics turns positive. A GPU serving one user is mostly idle silicon being paid for at full price. Batching spreads that fixed cost across many users, so the cost per token falls by nearly the batch factor — the difference between a business and a bonfire. When a provider quotes a low per-token price, efficient batching is how they can afford to.

And it explains the seams in the product. Long contexts are expensive not only in attention but because their KV caches crowd out other users, shrinking the batch and raising everyone's cost. That's why huge context windows carry premium pricing, and why the whole stack — quantization, PagedAttention, continuous batching — exists to squeeze more paying requests into a fixed HBM budget.

For the Circuit, batch efficiency is the hinge of the cost side: it's what lets falling prices coexist with expensive hardware. Improve it, and the break-even the whole build-out is chasing moves closer. It is unglamorous plumbing that quietly decides whether AI serving makes money.

KV cache & batching

Cache the past; share the weights

Batching — throughput vs the KV cache budget

Why each trick works, and where they fight

The size of the cache

How many users fit

Where inference stops losing money

The primary sources