Memory, HBM & the memory wall

Chips can now do arithmetic far faster than they can fetch the numbers to work on. That widening gap is the memory wall — and it's why AI accelerators are defined by their high-bandwidth memory (HBM) as much as their compute.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

Compute outran memory — and never looked back

For decades, the arithmetic units on chips got faster far quicker than the memory feeding them. Compute has roughly tripled each hardware generation while memory bandwidth grew only about 1.6× — so with every generation, the processor spends more of its time waiting on data and less doing math. That is the memory wall, and modern AI slammed straight into it: decode is memory-bound precisely because it moves the whole model per token but does little arithmetic with it.

Drag through the generations and watch the two bars diverge — compute racing ahead, bandwidth trailing, and the gap you must somehow bridge growing wider each step:

The memory wall — compute vs bandwidth, per generation

Illustrative model of the documented trend (Gholami et al. 2024): compute ×3.0 / generation, bandwidth ×1.6 / generation. Bars are relative to generation 0.

×1

compute

×1

memory bandwidth

hardware generationgen 0 · +0 yr

2010s→ each step = ~2 years →now

The gap: ×1.0

At generation 0 they start even. Drag right to open the wall.

What HBM is, and why it's never enough

The problem. A GPU's thousands of cores can only work as fast as memory delivers operands. When the arithmetic is cheap but the data movement is expensive, the cores sit idle — "memory-bound." Most LLM inference lives here.
HBM (high-bandwidth memory). The fix is to stack DRAM dies vertically, right next to the compute, connected by an extremely wide interface. That buys enormous bandwidth — an H100's HBM moves ~3.35 TB/s, versus tens of GB/s for a normal CPU's DRAM. It's the single feature that makes a GPU a viable AI chip.
The catches. HBM is expensive to manufacture, limited in capacity (tens of GB per chip), power-hungry, and made by only a handful of suppliers. So you're always short of it — short of capacity (the weights plus the KV cache must fit) and short of bandwidth (they must stream fast enough).
Why it keeps biting. Because compute keeps outpacing bandwidth, each new generation makes the imbalance worse, not better. Chip designers respond with bigger caches, lower precision, and clever data movement — but the wall recedes slowly at best.

So the mental model flips: for AI, a chip's memory system is often the real product, and the compute is the part that's easy to oversupply.

The roofline model says achievable performance is capped by whichever runs out first — compute or bandwidth — given a workload's arithmetic intensity $I$ (FLOPs done per byte moved):

$$ \text{performance} = \min\!\big(\text{peak compute},\; I \times \text{bandwidth}\big) $$

The crossover — the ridge point — sits at:

$$ I^{*} = \frac{\text{peak compute}}{\text{bandwidth}} $$

A workload is memory-bound when $I < I^{*}$. As compute grows ×3 per generation and bandwidth only ×1.6, $I^{*}$ climbs by ×$(3/1.6) \approx 1.9$ each generation — the bar for staying compute-bound keeps rising. LLM decode at batch 1 has an intensity of roughly 1–2 FLOPs/byte, far below a modern GPU's ridge point of several hundred — which is exactly why it wastes most of the chip's compute, and why batching (raising $I$) is the escape.

# Gholami et al. "AI and Memory Wall" (2024): per ~2-year generation, # peak compute grows ~3x, memory bandwidth ~1.6x. C, B = 3.0, 1.6 for k in range(6): comp, bw = C**k, B**k gap = comp / bw # extra arithmetic-per-byte you must find to stay busy print(f"gen {k} (+{2*k}yr): compute x{comp:6.1f} bandwidth x{bw:5.1f} gap x{gap:5.2f}") # gen 0 (+0yr): compute x 1.0 bandwidth x 1.0 gap x 1.00 # gen 3 (+6yr): compute x 27.0 bandwidth x 4.1 gap x 6.59 # gen 5 (+10yr): compute x 243.0 bandwidth x 10.5 gap x23.17 <- the wall

You're buying memory, priced as compute

The wall → money

The memory wall rewrites what a GPU purchase really is. Because inference is memory-bound, buyers are often paying for capacity and bandwidth — and getting compute they can't fully use as a side effect. HBM is the scarce, expensive component inside the accelerator, made by just three suppliers (SK Hynix, Samsung, Micron), and its availability has become a genuine gate on how many AI chips can be built at all.

That reshapes the capex story. The bottleneck isn't only fabricating logic; it's stacking enough high-bandwidth memory around it. When a frontier chip is supply-constrained, HBM is frequently why — a detail that turns an obscure memory technology into a macro variable.

For the Circuit, the wall sets a hard floor under costs: no matter how cheap arithmetic gets, moving data stays expensive, so the price of serving a token can't fall faster than memory improves. Every efficiency lever in this section — fewer bits, batching, better caching — is ultimately a way to do more with each precious byte of bandwidth.

Memory, HBM & the memory wall

Compute outran memory — and never looked back

The memory wall — compute vs bandwidth, per generation

What HBM is, and why it's never enough

The roofline and the ridge point

The wall, generation by generation

You're buying memory, priced as compute

The primary sources