The context window

The context window is a model's working memory — the number of tokens it can hold and attend to at once. Everything inside it the model can use; everything outside it, it never sees. And making it bigger is brutally expensive.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

How much the model can hold at once

A model doesn't have long-term memory between turns; within a single request it has the context window — the maximum number of tokens it can read and reason over together. A whole conversation, a pasted document, the system instructions: it all has to fit. Run past the limit and the oldest tokens fall off the edge, or get truncated, and the model simply doesn't know they existed.

Windows have grown fast — from ~2,000 tokens in early GPT-3 to 128,000 in GPT-4-class models to over a million in the largest. But that growth fights a hard wall you've already met: attention's cost grows with the square of the window, and the model must cache every token's keys and values in fast memory. Drag the window and watch what it costs (on a 70B-class model):

The cost of context — drag it

KV cache and attention work for one request on an 80-layer, 8192-wide model in fp16.

—

1K32K1M

—

KV cache memory

—

80 GB GPUs of cache

—

attention work vs 1K

Why it's capped, and how it's stretched

Two costs set the ceiling, both from attention:

Compute grows quadratically. Every token attends to every other, so the work scales as $n^2$. Going from a 1K window to a 128K one is a 128× increase in length — and therefore about 16,000× the attention work (128 squared). It explodes.
Memory grows linearly — and ruinously. To avoid recomputing, the model stores every token's keys and values, the KV cache. That cache grows with the window and the model's depth and width, and it lives in scarce high-bandwidth memory.

So engineers fight on three fronts: position tricks like RoPE and ALiBi that let a model generalize to lengths it wasn't trained on; efficient attention like FlashAttention, sliding-window, and sparse patterns that cut the constants (or even the exponent); and simply more memory. One caveat worth knowing: even when a model can read a million tokens, it often attends less to the middle — the "lost in the middle" effect — so a bigger window isn't automatically better understanding.

For a window of $n$ tokens, width $d$, and $N$ layers, the two costs are:

$$ \text{attention compute} \;\propto\; n^2 d \qquad\qquad \text{KV cache bytes} \;=\; 2 \, N \, n \, d \, b $$

The $2$ counts keys and values, $b$ is the bytes per number ($2$ in fp16). The compute term is the famous quadratic: doubling $n$ quadruples the attention work. The cache term is "only" linear in $n$ — but the constant is enormous, because it also multiplies by every layer and the full width.

Make it concrete. For an 80-layer, 8192-wide model, each token costs $2 \cdot 80 \cdot 8192 \cdot 2 \approx 2.6$ MB of cache. So a 128K-token window holds about 340 GB of KV cache — roughly five 80 GB GPUs, before the model produces a single word. A one-million-token window is measured in terabytes. (These are full multi-head-attention figures — the honest worst case. Real 70B-class models use grouped-query attention, sharing keys and values across heads to shrink the cache by roughly 8×; it is precisely the trick invented to fight this wall.) That is why long context is a hardware problem, not a software setting.

def kv_cache_gb(n_tokens, n_layers, d_model, bytes_per=2): # 2 = one key + one value vector per token, per layer return 2 * n_layers * n_tokens * d_model * bytes_per / 1e9 # an 80-layer, 8192-wide model in fp16 for ctx in [4096, 32768, 131072, 1_000_000]: print(f"{ctx:>9,} tokens -> {kv_cache_gb(ctx, 80, 8192):8.1f} GB of KV cache") # -> 4,096 tokens -> 10.7 GB of KV cache # 32,768 tokens -> 85.9 GB of KV cache # 131,072 tokens -> 343.6 GB of KV cache # 1,000,000 tokens -> 2621.4 GB of KV cache

Where the square meets the customer

Window → money

The context window is where every cost in this Part comes due at once. The tokens set $n$; attention squares it for compute; the KV cache turns it into hundreds of gigabytes of high-bandwidth memory per long request. This is the most direct reason inference is memory-bound, and why providers charge more for long-context calls — you are renting scarce HBM by the token-squared.

It also reframes the build-out. A huge share of data-center spending isn't about training ever-larger models; it's about serving them — holding millions of users' KV caches in memory at once. Every leap in advertised context length ripples straight into demand for the exact resource the Circuit calls the memory wall.

So the context window is the perfect closing note for the foundations: it is the single setting where the token, the embedding, attention, the network, the transformer, and the parameter count all collapse into one number you pay for. Read the spec "1M token context," and now you know precisely what it costs to keep that promise.

That completes Part I — the Foundations. From a token to the full transformer and the economics of running it, you now have the whole machine in view. Part II opens the model itself: how LLMs are trained, fine-tuned, and how they differ. Back to the curriculum →

The context window

How much the model can hold at once

The cost of context — drag it

Why it's capped, and how it's stretched

Quadratic compute, linear cache

The KV-cache calculator

Where the square meets the customer

The primary sources