First Principles / Part III · Inference & systems / Chapter 17
First Principles · Inference & systems · 17
Chips can now do arithmetic far faster than they can fetch the numbers to work on. That widening gap is the memory wall — and it's why AI accelerators are defined by their high-bandwidth memory (HBM) as much as their compute.
01The answer, then the intuition
For decades, the arithmetic units on chips got faster far quicker than the memory feeding them. Compute has roughly tripled each hardware generation while memory bandwidth grew only about 1.6× — so with every generation, the processor spends more of its time waiting on data and less doing math. That is the memory wall, and modern AI slammed straight into it: decode is memory-bound precisely because it moves the whole model per token but does little arithmetic with it.
Drag through the generations and watch the two bars diverge — compute racing ahead, bandwidth trailing, and the gap you must somehow bridge growing wider each step:
Illustrative model of the documented trend (Gholami et al. 2024): compute ×3.0 / generation, bandwidth ×1.6 / generation. Bars are relative to generation 0.
The gap: ×1.0
At generation 0 they start even. Drag right to open the wall.
02Mechanics
So the mental model flips: for AI, a chip's memory system is often the real product, and the compute is the part that's easy to oversupply.
04The math
expand ▾The roofline model says achievable performance is capped by whichever runs out first — compute or bandwidth — given a workload's arithmetic intensity $I$ (FLOPs done per byte moved):
The crossover — the ridge point — sits at:
A workload is memory-bound when $I < I^{*}$. As compute grows ×3 per generation and bandwidth only ×1.6, $I^{*}$ climbs by ×$(3/1.6) \approx 1.9$ each generation — the bar for staying compute-bound keeps rising. LLM decode at batch 1 has an intensity of roughly 1–2 FLOPs/byte, far below a modern GPU's ridge point of several hundred — which is exactly why it wastes most of the chip's compute, and why batching (raising $I$) is the escape.
05The code
expand ▾Relative compute, bandwidth, and the widening gap, using the documented per-generation growth rates.
memory_wall.py
# Gholami et al. "AI and Memory Wall" (2024): per ~2-year generation,
# peak compute grows ~3x, memory bandwidth ~1.6x.
C, B = 3.0, 1.6
for k in range(6):
comp, bw = C**k, B**k
gap = comp / bw # extra arithmetic-per-byte you must find to stay busy
print(f"gen {k} (+{2*k}yr): compute x{comp:6.1f} bandwidth x{bw:5.1f} gap x{gap:5.2f}")
# gen 0 (+0yr): compute x 1.0 bandwidth x 1.0 gap x 1.00
# gen 3 (+6yr): compute x 27.0 bandwidth x 4.1 gap x 6.59
# gen 5 (+10yr): compute x 243.0 bandwidth x 10.5 gap x23.17 <- the wall
06The economics
The wall → money
The memory wall rewrites what a GPU purchase really is. Because inference is memory-bound, buyers are often paying for capacity and bandwidth — and getting compute they can't fully use as a side effect. HBM is the scarce, expensive component inside the accelerator, made by just three suppliers (SK Hynix, Samsung, Micron), and its availability has become a genuine gate on how many AI chips can be built at all.
That reshapes the capex story. The bottleneck isn't only fabricating logic; it's stacking enough high-bandwidth memory around it. When a frontier chip is supply-constrained, HBM is frequently why — a detail that turns an obscure memory technology into a macro variable.
For the Circuit, the wall sets a hard floor under costs: no matter how cheap arithmetic gets, moving data stays expensive, so the price of serving a token can't fall faster than memory improves. Every efficiency lever in this section — fewer bits, batching, better caching — is ultimately a way to do more with each precious byte of bandwidth.
07Going deeper
expand ▾
Gholami et al. (2024) — AI and Memory Wall · the documented compute-vs-bandwidth divergence.
Williams, Waterman & Patterson (2009) — Roofline · the performance model.
Dao et al. (2022) — FlashAttention · beating the wall by minimizing memory traffic.
SemiAnalysis · HBM supply and its role in accelerator availability.
Cite this chapter: Divergent Compute, "Memory, HBM & the memory wall", First Principles, 2026. divergentcompute.com/first-principles-memory-wall · v1.0 · CC-BY.