Divergent Compute.AI Economic Think Tank

First Principles / Part III · Inference & systems / Chapter 16

First Principles · Inference & systems · 16

Why AI needs GPUs

A neural network is, underneath, one operation repeated trillions of times: matrix multiplication. GPUs win because they do that one thing with thousands of cores at once — while a CPU, built for a few fast sequential tasks, does it a handful at a time.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

The same sum, a million times over

Strip a transformer down and almost all of its work is multiplying big grids of numbers — every attention step and every weight layer is a matrix multiply. And a matrix multiply is embarrassingly parallel: each output number is an independent dot product that doesn't depend on the others, so in principle you could compute all of them at the same time.

A CPU has a few very powerful cores tuned for fast, branchy, sequential work. A GPU has thousands of simpler cores built to run the exact same arithmetic on many numbers at once — plus huge memory bandwidth to feed them. For AI's one repeated operation, that's a perfect match. Race them on the same matrix multiply:

CPU vs GPU — the same matrix multiply

Illustrative. Each tile is a chunk of output; cores fill tiles in parallel. The animation shows the shape of the gap — the real number is in the readout.

CPU~8 cores · sequential
progress0%
GPUthousands of cores · parallel
progress0%
Same math. One finishes while the other is barely started.

02Mechanics

Built for throughput, not latency

  • Thousands of cores. A modern GPU has tens of thousands of arithmetic units running the same instruction across different data at once (SIMT). For one operation repeated over millions of numbers, that's ideal; for tangled, branchy logic, it's wasted.
  • Tensor cores. GPUs now include dedicated units that do a small matrix multiply-accumulate in a single step — hardware built specifically for the one operation neural networks need. This is most of where the headline FLOP/s comes from.
  • Memory bandwidth. All those cores are useless if you can't feed them. GPUs pair the compute with very fast high-bandwidth memory (HBM) — the subject of the next chapter — so weights and activations stream in fast enough to keep the cores busy.
  • The trade-off. A CPU minimizes the time for one task (latency); a GPU maximizes the total work done across many (throughput). AI is almost entirely the second kind of problem, which is why the GPU — originally built to shade millions of pixels in parallel — turned out to be the perfect AI chip by accident.

This is also why the field is a hardware story as much as a software one: progress is gated by how many parallel multiply-adds per second per dollar the silicon can deliver.

04The math

expand ▾

Why the gap is ~2,000×

Multiplying two $d \times d$ matrices produces $d^2$ outputs, each a dot product of length $d$ (a multiply and an add per term), so the cost is:

$$ \text{FLOPs} = 2\,d^3 $$

Wall-clock time is just that divided by the hardware's throughput. For $d = 8192$ that's about $1.1 \times 10^{12}$ FLOPs. A CPU delivering ~0.5 TFLOP/s of usable fp32 takes seconds; a GPU's tensor cores deliver on the order of ~1,000 TFLOP/s:

$$ t = \frac{2\,d^3}{\text{throughput}} \;\Rightarrow\; t_{\text{CPU}} \approx 2.2\ \text{s}, \quad t_{\text{GPU}} \approx 1.1\ \text{ms} $$

A roughly 2,000× difference on a single operation — and a model is billions of these. The gap isn't that the GPU's individual cores are faster (they're slower); it's that there are thousands of them doing independent work at once. Parallelism, not clock speed, is the whole game.

05The code

expand ▾

The race, in numbers

One large matrix multiply, timed against a CPU's and a GPU's throughput.

cpu_vs_gpu.py

def matmul_flops(d):
    return 2 * d**3            # d^2 outputs, each a length-d dot product

d = 8192
flops = matmul_flops(d)

cpu = 0.5e12     # CPU: ~0.5 TFLOP/s usable fp32
gpu = 990e12     # GPU tensor cores: ~990 TFLOP/s bf16 (H100-class)

print(f"{d}x{d} matmul: {flops:.2e} FLOPs")
print(f"CPU: {flops/cpu:.3f} s")
print(f"GPU: {flops/gpu*1000:.3f} ms")
print(f"speedup: {(flops/cpu)/(flops/gpu):.0f}x")
# 8192x8192 matmul: 1.10e+12 FLOPs
# CPU: 2.199 s
# GPU: 1.111 ms
# speedup: 1980x

06The economics

The build-out is a GPU build-out

Parallelism → money

Because AI's appetite is for parallel matrix-multiply throughput, the entire build-out resolves to one scarce thing: GPUs. The hundreds of billions in capital expenditure are, concretely, orders for accelerators and the power and buildings to run them. When people say a lab is "compute-constrained," they mean it cannot get enough of these chips.

That scarcity is why a single company, Nvidia, captures so much of the value in AI — it sells the one input everyone needs — and why its data-center revenue became a real-time gauge of the build-out itself. The chip is the bottleneck, the capital line, and the moat all at once.

For the Circuit, this is the supply side of the equation: the cost of intelligence is set by GPU throughput per dollar, and it falls only as fast as the silicon improves. Every efficiency trick in this section is ultimately about extracting more useful tokens from each very expensive chip.

07Going deeper

expand ▾

The primary sources

NVIDIA — GPU Performance Background · cores, throughput, and the roofline.
NVIDIA — Tensor Core architecture · the matrix-multiply-accumulate unit.
Dao et al. (2022) — FlashAttention · why memory movement, not FLOPs, often dominates.
SemiAnalysis · ongoing economic analysis of the AI hardware supply chain.

Cite this chapter: Divergent Compute, "Why AI needs GPUs", First Principles, 2026. divergentcompute.com/first-principles-gpus · v1.0 · CC-BY.

← Chapter 15
Latency, throughput, tokens/sec
Next · Chapter 17 →
Memory, HBM & the memory wall