Divergent Compute.AI Economic Think Tank

First Principles / Part III · Inference & systems / Chapter 19

First Principles · Inference & systems · 19

The data-center cluster

A frontier model is too big for any single GPU. It runs across thousands of them, wired together into racks, pods, and buildings — drawing tens of megawatts. At this scale the network, not the chip, becomes the bottleneck, and power becomes the real limit.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

One model, ten thousand chips, one machine

Everything so far assumed the model lived on a GPU or two. Frontier models break that assumption: their weights and caches overflow any single chip, and training them in a human lifetime needs thousands working in concert. So the real unit of AI isn't a GPU — it's a cluster: tens of thousands of GPUs stitched into one enormous parallel computer.

But you can't just add chips and get proportional speed. They have to talk — sharing weights, gradients, and activations over the network — and that communication is a tax that grows with scale. Drag a cluster from one GPU to hyperscale and watch three things move: compute soars, efficiency sags to the communication tax, and power climbs into the megawatts:

Scale a cluster — compute, efficiency, and power

~1 PFLOP/s and ~1.4 kW all-in per GPU; MFU (realized ÷ peak) declines with scale as communication grows. Illustrative, real-world magnitudes.

peak compute
efficiency (MFU)
realized compute
power draw
cluster scale1 GPU
1 GPU32,768 GPUs

02Mechanics

How thousands of GPUs become one computer

  • Splitting the model. When a model won't fit on one GPU, you split it. Tensor parallelism shards each layer's matrices across GPUs in a node; pipeline parallelism puts different layers on different nodes; data parallelism runs full replicas on different batches. Real clusters combine all three.
  • The interconnect. Split work means constant communication. Inside a node, GPUs talk over NVLink at terabytes per second; across nodes, over InfiniBand or specialized Ethernet — fast, but far slower than a GPU's own memory. Training's all-reduce (summing gradients across every GPU each step) makes the network a first-class bottleneck.
  • The physical hierarchy. 8 GPUs to a server, several servers to a rack, racks into pods, pods into a building. A modern AI data center is tens of thousands of GPUs, kilometers of fiber, and a power substation — a single machine the size of a warehouse.
  • The scaling tax. Because of communication, doubling the GPUs doesn't double the useful work. Model FLOPs utilization (MFU) — realized work over peak — drops as you scale, often from ~50–60% at modest size toward ~30–40% at the frontier. You pay for the chips; the network eats part of what you get.

So the frontier is an exercise in distributed systems as much as machine learning — and increasingly in electrical engineering, because the thing you eventually run out of is power.

04The math

expand ▾

Compute, the tax, and the wattage

Peak compute is just the count times per-GPU throughput; realized compute applies the efficiency $\eta$ (MFU):

$$ \text{realized} = G \cdot f_{\text{GPU}} \cdot \eta(G), \qquad \eta(G) \;\text{falls as}\; G \;\text{grows} $$

The efficiency drop is an Amdahl-style limit: if a fraction of each step is communication that grows with scale, the useful fraction shrinks. And power scales cleanly with the count, marked up by the facility's overhead (PUE):

$$ P = G \cdot P_{\text{GPU}} \cdot \text{PUE} $$

At $G = 32{,}768$, $f_{\text{GPU}} \approx 1$ PFLOP/s and $P_{\text{GPU}} \approx 1.4$ kW all-in, that's ~33 EFLOP/s of peak but only ~10 realized at ~32% MFU — drawing about 46 MW, the power of a small town. The compute grows linearly with money; the useful compute grows more slowly, and the power grows right along with the bill.

05The code

expand ▾

A cluster, by the numbers

Peak vs realized compute and power across five scales — watch MFU erode as the cluster grows.

cluster.py

import math
PER_GPU = 1e15          # ~1 PFLOP/s per GPU
W = 1400                # ~watts per GPU, all-in (chip + share of node & cooling)

def mfu(G):             # efficiency falls with scale (communication tax)
    return 0.62 - 0.02 * math.log2(G)

for G in [1, 8, 512, 8192, 32768]:
    peak = G * PER_GPU / 1e18                 # EFLOP/s
    realized = peak * mfu(G)
    power = G * W / 1e6                        # MW
    print(f"{G:6d} GPUs: peak {peak:5.2f} EF  MFU {mfu(G)*100:4.1f}%  "
          f"realized {realized:5.2f} EF  power {power:5.1f} MW")
# 32768 GPUs: peak 32.77 EF  MFU 32.0%  realized 10.49 EF  power 45.9 MW
#   -> compute grows linearly with spend; useful compute lags; power tracks the bill

06The economics

The build-out, made physical

Clusters → money

This is what the capital expenditure actually buys. When a hyperscaler reports tens of billions in spend, it resolves to this: buildings full of GPUs, the networking to bind them, and the substations to power them. The abstract "$500B build-out" is, concretely, a fleet of these clusters — and the reason the numbers are so large is that a frontier machine is a small industrial facility.

Two constraints now bind harder than chips. First, the scaling tax: because efficiency falls with size, the next doubling of a cluster costs 2× but delivers less than 2× the useful compute — diminishing returns the spending has to outrun. Second, power: at tens of megawatts per cluster and gigawatts in aggregate, AI has become an energy story, gated by grid capacity, turbines, and permits as much as by silicon.

For the Circuit, the cluster is the whole cost side made physical — GPUs, network, and power compounding into the denominator that revenue must eventually clear. It's the concrete thing the divergence between spending and payoff is measuring. Everything in this section — quantization, batching, beating the memory wall — exists to wring more value out of these very expensive, very hungry machines.

Part III complete

You now know what it takes to run a model

From a single forward pass, through the tokens-per-second speed limit, the GPU, the memory wall, the KV cache and batching, all the way to a warehouse of thirty thousand chips drawing the power of a town — you've followed inference from one operation to the physical build-out it demands. Part I built the machine; Part II made it a model; Part III is the cost of running it.

Part IV turns from mechanics to use: prompting, retrieval (RAG), agents and tools, evals, and vector search — how you actually build something with all this. See the full curriculum →

07Going deeper

expand ▾

The primary sources

Shoeybi et al. (2019) — Megatron-LM · tensor and pipeline model parallelism.
Narayanan et al. (2021) — Efficient Large-Scale Training on GPU Clusters · MFU and the scaling tax.
Llama 3 Herd of Models (2024) · a real 16K-GPU training cluster, described.
SemiAnalysis · data-center power, networking, and cluster economics.

Cite this chapter: Divergent Compute, "The data-center cluster", First Principles, 2026. divergentcompute.com/first-principles-cluster · v1.0 · CC-BY.

← Chapter 18
KV cache & batching
Part IV · Next →
Prompting