Scaling laws

Scaling laws

The single most consequential fact in modern AI: model loss falls as a smooth power law as you add compute, data, and parameters. On a log-log plot it's a straight line — predictable enough to bet hundreds of billions on.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

A straight line you can bank on

You'd expect intelligence to be unpredictable. The surprise of the last five years is that, at least for next-token loss, it isn't. Plot a model's loss against the compute used to train it, on log-log axes, and the points fall on a straight line across many orders of magnitude. Double, then re-double the compute, and the loss keeps dropping by the same predictable fraction.

That line is a scaling law, and its flatness is why the industry looks the way it does: a lab can forecast, before spending a dollar, roughly how good the next model will be. Drag the compute budget and watch loss slide down the curve — the same ratio for every factor of ten:

The scaling law — loss vs training compute

Illustrative power law (Kaplan/Chinchilla form). Loss is on a relative scale; the straightness is the point.

—

training compute (FLOPs)

—

predicted loss (rel.)

—

era scale

log₁₀ compute1e21

10¹⁸10²⁶

Three knobs, one line

Three resources. Loss falls as a power law in each of parameters $N$, training tokens $D$, and compute $C \approx 6ND$. More of any one helps — but only if the others keep up.
Compute-optimal (Chinchilla). For a fixed compute budget, there's a best split between model size and data — grow both together, roughly 20 tokens per parameter. Early models like GPT-3 were too big for their data; Chinchilla showed a smaller, better-fed model wins. This is the rule that reshaped how labs train.
Predictability. Because the curve is smooth, labs run small "scaling experiments," fit the line, and extrapolate to forecast a giant model's loss before committing the budget. Training frontier models is an engineering plan, not a leap of faith.
What the loss hides. Smoothly falling loss can still produce emergent jumps in specific abilities — a capability that's absent, then suddenly present, as scale crosses a threshold. The aggregate is predictable; the surprises live in the details.

The catch is baked into the shape: a power law with a small exponent means diminishing returns. Each equal step down in loss costs another full factor of ten in compute. The line is friendly to forecasting and brutal to budgets.

Empirically, loss follows a power law in compute (and similarly in $N$ and $D$):

$$ L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C} \quad\Longleftrightarrow\quad \log L = \alpha_C\big(\log C_c - \log C\big) $$

Taking the log makes it a straight line — slope $-\alpha_C$ — which is why log-log plots are the field's native language. The full Chinchilla form separates the contributions and an irreducible floor $E$:

$$ L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} $$

The small exponent is the whole economic story. With $\alpha_C \approx 0.05$, every $10\times$ in compute multiplies loss by $10^{-0.05} \approx 0.89$ — a fixed ~11% cut per decade. Constant progress therefore demands exponentially growing spend: to keep the loss falling in a straight line, the compute (and the bill) must rise geometrically. Predictable, and relentless.

alpha = 0.050 # compute exponent (illustrative, Kaplan-ish) Cc = 1e21 def loss(C): return (Cc / C) ** alpha prev = None for C in [1e21, 1e22, 1e23, 1e24, 1e25]: L = loss(C) ratio = "" if prev is None else f" (x{L/prev:.3f} per 10x)" print(f"C={C:.0e} -> loss {L:.4f}{ratio}") prev = L # C=1e+21 -> loss 1.0000 # C=1e+22 -> loss 0.8913 (x0.891 per 10x) # C=1e+23 -> loss 0.7943 (x0.891 per 10x) # C=1e+24 -> loss 0.7079 (x0.891 per 10x) # C=1e+25 -> loss 0.6310 (x0.891 per 10x) <- 10x the spend, same 11% gain

The physics that justifies the bet

The law → money

Scaling laws are the reason the build-out is rational rather than reckless. Because more compute reliably buys a better model, spending on clusters is a forecast, not a gamble — a lab can project the return on the next $10 billion and act on it. Remove that predictability and the whole capital cycle collapses; it's the closest thing AI has to a law of physics for investors.

But the same line contains the warning. The exponent is small, so returns diminish: each equal gain costs another factor of ten. Progress on the straight line requires spending that grows geometrically — which is exactly why capex is exploding, and exactly why the question of whether the payoff keeps pace is not academic. And a second limit looms: the data wall, since the world contains only so many quality tokens to train on.

This is the beating heart of the Circuit. The scaling law is the divergence in one equation: capability climbs smoothly while cost climbs exponentially, and the entire thesis rides on whether revenue can track the second curve as it chases the first. Everything the earlier chapters described — the chips, the memory, the tokens — is ultimately in service of buying another step down this line.

A straight line you can bank on

The scaling law — loss vs training compute

Three knobs, one line

The power law, and its price

The same cut, every decade

The physics that justifies the bet

The primary sources