First Principles / Part III · Inference & systems / Chapter 19
First Principles · Inference & systems · 19
A frontier model is too big for any single GPU. It runs across thousands of them, wired together into racks, pods, and buildings — drawing tens of megawatts. At this scale the network, not the chip, becomes the bottleneck, and power becomes the real limit.
01The answer, then the intuition
Everything so far assumed the model lived on a GPU or two. Frontier models break that assumption: their weights and caches overflow any single chip, and training them in a human lifetime needs thousands working in concert. So the real unit of AI isn't a GPU — it's a cluster: tens of thousands of GPUs stitched into one enormous parallel computer.
But you can't just add chips and get proportional speed. They have to talk — sharing weights, gradients, and activations over the network — and that communication is a tax that grows with scale. Drag a cluster from one GPU to hyperscale and watch three things move: compute soars, efficiency sags to the communication tax, and power climbs into the megawatts:
~1 PFLOP/s and ~1.4 kW all-in per GPU; MFU (realized ÷ peak) declines with scale as communication grows. Illustrative, real-world magnitudes.
02Mechanics
So the frontier is an exercise in distributed systems as much as machine learning — and increasingly in electrical engineering, because the thing you eventually run out of is power.
04The math
expand ▾Peak compute is just the count times per-GPU throughput; realized compute applies the efficiency $\eta$ (MFU):
The efficiency drop is an Amdahl-style limit: if a fraction of each step is communication that grows with scale, the useful fraction shrinks. And power scales cleanly with the count, marked up by the facility's overhead (PUE):
At $G = 32{,}768$, $f_{\text{GPU}} \approx 1$ PFLOP/s and $P_{\text{GPU}} \approx 1.4$ kW all-in, that's ~33 EFLOP/s of peak but only ~10 realized at ~32% MFU — drawing about 46 MW, the power of a small town. The compute grows linearly with money; the useful compute grows more slowly, and the power grows right along with the bill.
05The code
expand ▾Peak vs realized compute and power across five scales — watch MFU erode as the cluster grows.
cluster.py
import math
PER_GPU = 1e15 # ~1 PFLOP/s per GPU
W = 1400 # ~watts per GPU, all-in (chip + share of node & cooling)
def mfu(G): # efficiency falls with scale (communication tax)
return 0.62 - 0.02 * math.log2(G)
for G in [1, 8, 512, 8192, 32768]:
peak = G * PER_GPU / 1e18 # EFLOP/s
realized = peak * mfu(G)
power = G * W / 1e6 # MW
print(f"{G:6d} GPUs: peak {peak:5.2f} EF MFU {mfu(G)*100:4.1f}% "
f"realized {realized:5.2f} EF power {power:5.1f} MW")
# 32768 GPUs: peak 32.77 EF MFU 32.0% realized 10.49 EF power 45.9 MW
# -> compute grows linearly with spend; useful compute lags; power tracks the bill
06The economics
Clusters → money
This is what the capital expenditure actually buys. When a hyperscaler reports tens of billions in spend, it resolves to this: buildings full of GPUs, the networking to bind them, and the substations to power them. The abstract "$500B build-out" is, concretely, a fleet of these clusters — and the reason the numbers are so large is that a frontier machine is a small industrial facility.
Two constraints now bind harder than chips. First, the scaling tax: because efficiency falls with size, the next doubling of a cluster costs 2× but delivers less than 2× the useful compute — diminishing returns the spending has to outrun. Second, power: at tens of megawatts per cluster and gigawatts in aggregate, AI has become an energy story, gated by grid capacity, turbines, and permits as much as by silicon.
For the Circuit, the cluster is the whole cost side made physical — GPUs, network, and power compounding into the denominator that revenue must eventually clear. It's the concrete thing the divergence between spending and payoff is measuring. Everything in this section — quantization, batching, beating the memory wall — exists to wring more value out of these very expensive, very hungry machines.
Part III complete
From a single forward pass, through the tokens-per-second speed limit, the GPU, the memory wall, the KV cache and batching, all the way to a warehouse of thirty thousand chips drawing the power of a town — you've followed inference from one operation to the physical build-out it demands. Part I built the machine; Part II made it a model; Part III is the cost of running it.
Part IV turns from mechanics to use: prompting, retrieval (RAG), agents and tools, evals, and vector search — how you actually build something with all this. See the full curriculum →
07Going deeper
expand ▾
Shoeybi et al. (2019) — Megatron-LM · tensor and pipeline model parallelism.
Narayanan et al. (2021) — Efficient Large-Scale Training on GPU Clusters · MFU and the scaling tax.
Llama 3 Herd of Models (2024) · a real 16K-GPU training cluster, described.
SemiAnalysis · data-center power, networking, and cluster economics.
Cite this chapter: Divergent Compute, "The data-center cluster", First Principles, 2026. divergentcompute.com/first-principles-cluster · v1.0 · CC-BY.