First Principles / Part III · Inference & systems / Chapter 16
First Principles · Inference & systems · 16
A neural network is, underneath, one operation repeated trillions of times: matrix multiplication. GPUs win because they do that one thing with thousands of cores at once — while a CPU, built for a few fast sequential tasks, does it a handful at a time.
01The answer, then the intuition
Strip a transformer down and almost all of its work is multiplying big grids of numbers — every attention step and every weight layer is a matrix multiply. And a matrix multiply is embarrassingly parallel: each output number is an independent dot product that doesn't depend on the others, so in principle you could compute all of them at the same time.
A CPU has a few very powerful cores tuned for fast, branchy, sequential work. A GPU has thousands of simpler cores built to run the exact same arithmetic on many numbers at once — plus huge memory bandwidth to feed them. For AI's one repeated operation, that's a perfect match. Race them on the same matrix multiply:
Illustrative. Each tile is a chunk of output; cores fill tiles in parallel. The animation shows the shape of the gap — the real number is in the readout.
02Mechanics
This is also why the field is a hardware story as much as a software one: progress is gated by how many parallel multiply-adds per second per dollar the silicon can deliver.
04The math
expand ▾Multiplying two $d \times d$ matrices produces $d^2$ outputs, each a dot product of length $d$ (a multiply and an add per term), so the cost is:
Wall-clock time is just that divided by the hardware's throughput. For $d = 8192$ that's about $1.1 \times 10^{12}$ FLOPs. A CPU delivering ~0.5 TFLOP/s of usable fp32 takes seconds; a GPU's tensor cores deliver on the order of ~1,000 TFLOP/s:
A roughly 2,000× difference on a single operation — and a model is billions of these. The gap isn't that the GPU's individual cores are faster (they're slower); it's that there are thousands of them doing independent work at once. Parallelism, not clock speed, is the whole game.
05The code
expand ▾One large matrix multiply, timed against a CPU's and a GPU's throughput.
cpu_vs_gpu.py
def matmul_flops(d):
return 2 * d**3 # d^2 outputs, each a length-d dot product
d = 8192
flops = matmul_flops(d)
cpu = 0.5e12 # CPU: ~0.5 TFLOP/s usable fp32
gpu = 990e12 # GPU tensor cores: ~990 TFLOP/s bf16 (H100-class)
print(f"{d}x{d} matmul: {flops:.2e} FLOPs")
print(f"CPU: {flops/cpu:.3f} s")
print(f"GPU: {flops/gpu*1000:.3f} ms")
print(f"speedup: {(flops/cpu)/(flops/gpu):.0f}x")
# 8192x8192 matmul: 1.10e+12 FLOPs
# CPU: 2.199 s
# GPU: 1.111 ms
# speedup: 1980x
06The economics
Parallelism → money
Because AI's appetite is for parallel matrix-multiply throughput, the entire build-out resolves to one scarce thing: GPUs. The hundreds of billions in capital expenditure are, concretely, orders for accelerators and the power and buildings to run them. When people say a lab is "compute-constrained," they mean it cannot get enough of these chips.
That scarcity is why a single company, Nvidia, captures so much of the value in AI — it sells the one input everyone needs — and why its data-center revenue became a real-time gauge of the build-out itself. The chip is the bottleneck, the capital line, and the moat all at once.
For the Circuit, this is the supply side of the equation: the cost of intelligence is set by GPU throughput per dollar, and it falls only as fast as the silicon improves. Every efficiency trick in this section is ultimately about extracting more useful tokens from each very expensive chip.
07Going deeper
expand ▾
NVIDIA — GPU Performance Background · cores, throughput, and the roofline.
NVIDIA — Tensor Core architecture · the matrix-multiply-accumulate unit.
Dao et al. (2022) — FlashAttention · why memory movement, not FLOPs, often dominates.
SemiAnalysis · ongoing economic analysis of the AI hardware supply chain.
Cite this chapter: Divergent Compute, "Why AI needs GPUs", First Principles, 2026. divergentcompute.com/first-principles-gpus · v1.0 · CC-BY.