First Principles / Part I · Foundations / Chapter 07
First Principles · Foundations · 07
The context window is a model's working memory — the number of tokens it can hold and attend to at once. Everything inside it the model can use; everything outside it, it never sees. And making it bigger is brutally expensive.
01The answer, then the intuition
A model doesn't have long-term memory between turns; within a single request it has the context window — the maximum number of tokens it can read and reason over together. A whole conversation, a pasted document, the system instructions: it all has to fit. Run past the limit and the oldest tokens fall off the edge, or get truncated, and the model simply doesn't know they existed.
Windows have grown fast — from ~2,000 tokens in early GPT-3 to 128,000 in GPT-4-class models to over a million in the largest. But that growth fights a hard wall you've already met: attention's cost grows with the square of the window, and the model must cache every token's keys and values in fast memory. Drag the window and watch what it costs (on a 70B-class model):
KV cache and attention work for one request on an 80-layer, 8192-wide model in fp16.
02Mechanics
Two costs set the ceiling, both from attention:
So engineers fight on three fronts: position tricks like RoPE and ALiBi that let a model generalize to lengths it wasn't trained on; efficient attention like FlashAttention, sliding-window, and sparse patterns that cut the constants (or even the exponent); and simply more memory. One caveat worth knowing: even when a model can read a million tokens, it often attends less to the middle — the "lost in the middle" effect — so a bigger window isn't automatically better understanding.
04The math
expand ▾For a window of $n$ tokens, width $d$, and $N$ layers, the two costs are:
The $2$ counts keys and values, $b$ is the bytes per number ($2$ in fp16). The compute term is the famous quadratic: doubling $n$ quadruples the attention work. The cache term is "only" linear in $n$ — but the constant is enormous, because it also multiplies by every layer and the full width.
Make it concrete. For an 80-layer, 8192-wide model, each token costs $2 \cdot 80 \cdot 8192 \cdot 2 \approx 2.6$ MB of cache. So a 128K-token window holds about 340 GB of KV cache — roughly five 80 GB GPUs, before the model produces a single word. A one-million-token window is measured in terabytes. (These are full multi-head-attention figures — the honest worst case. Real 70B-class models use grouped-query attention, sharing keys and values across heads to shrink the cache by roughly 8×; it is precisely the trick invented to fight this wall.) That is why long context is a hardware problem, not a software setting.
05The code
expand ▾The exact function behind the slider — runnable.
kv_cache.py
def kv_cache_gb(n_tokens, n_layers, d_model, bytes_per=2):
# 2 = one key + one value vector per token, per layer
return 2 * n_layers * n_tokens * d_model * bytes_per / 1e9
# an 80-layer, 8192-wide model in fp16
for ctx in [4096, 32768, 131072, 1_000_000]:
print(f"{ctx:>9,} tokens -> {kv_cache_gb(ctx, 80, 8192):8.1f} GB of KV cache")
# -> 4,096 tokens -> 10.7 GB of KV cache
# 32,768 tokens -> 85.9 GB of KV cache
# 131,072 tokens -> 343.6 GB of KV cache
# 1,000,000 tokens -> 2621.4 GB of KV cache
06The economics
Window → money
The context window is where every cost in this Part comes due at once. The tokens set $n$; attention squares it for compute; the KV cache turns it into hundreds of gigabytes of high-bandwidth memory per long request. This is the most direct reason inference is memory-bound, and why providers charge more for long-context calls — you are renting scarce HBM by the token-squared.
It also reframes the build-out. A huge share of data-center spending isn't about training ever-larger models; it's about serving them — holding millions of users' KV caches in memory at once. Every leap in advertised context length ripples straight into demand for the exact resource the Circuit calls the memory wall.
So the context window is the perfect closing note for the foundations: it is the single setting where the token, the embedding, attention, the network, the transformer, and the parameter count all collapse into one number you pay for. Read the spec "1M token context," and now you know precisely what it costs to keep that promise.
07Going deeper
expand ▾
Su et al. (2021) — RoFormer (RoPE) · rotary position encodings that help models extend their length.
Dao et al. (2022) — FlashAttention · making long-context attention fit in memory.
Liu et al. (2023) — Lost in the Middle · why a longer window isn't always better understanding.
Beltagy et al. (2020) — Longformer · sparse attention for long documents.
Cite this chapter: Divergent Compute, "The context window", First Principles, 2026. divergentcompute.com/first-principles-context-window · v1.0 · CC-BY.