First Principles / Part V · The frontier & the industry / Chapter 26
First Principles · The frontier & the industry · 26
The single most consequential fact in modern AI: model loss falls as a smooth power law as you add compute, data, and parameters. On a log-log plot it's a straight line — predictable enough to bet hundreds of billions on.
01The answer, then the intuition
You'd expect intelligence to be unpredictable. The surprise of the last five years is that, at least for next-token loss, it isn't. Plot a model's loss against the compute used to train it, on log-log axes, and the points fall on a straight line across many orders of magnitude. Double, then re-double the compute, and the loss keeps dropping by the same predictable fraction.
That line is a scaling law, and its flatness is why the industry looks the way it does: a lab can forecast, before spending a dollar, roughly how good the next model will be. Drag the compute budget and watch loss slide down the curve — the same ratio for every factor of ten:
Illustrative power law (Kaplan/Chinchilla form). Loss is on a relative scale; the straightness is the point.
02Mechanics
The catch is baked into the shape: a power law with a small exponent means diminishing returns. Each equal step down in loss costs another full factor of ten in compute. The line is friendly to forecasting and brutal to budgets.
04The math
expand ▾Empirically, loss follows a power law in compute (and similarly in $N$ and $D$):
Taking the log makes it a straight line — slope $-\alpha_C$ — which is why log-log plots are the field's native language. The full Chinchilla form separates the contributions and an irreducible floor $E$:
The small exponent is the whole economic story. With $\alpha_C \approx 0.05$, every $10\times$ in compute multiplies loss by $10^{-0.05} \approx 0.89$ — a fixed ~11% cut per decade. Constant progress therefore demands exponentially growing spend: to keep the loss falling in a straight line, the compute (and the bill) must rise geometrically. Predictable, and relentless.
05The code
expand ▾Loss across five orders of magnitude of compute — note the identical ratio at every step.
scaling.py
alpha = 0.050 # compute exponent (illustrative, Kaplan-ish)
Cc = 1e21
def loss(C): return (Cc / C) ** alpha
prev = None
for C in [1e21, 1e22, 1e23, 1e24, 1e25]:
L = loss(C)
ratio = "" if prev is None else f" (x{L/prev:.3f} per 10x)"
print(f"C={C:.0e} -> loss {L:.4f}{ratio}")
prev = L
# C=1e+21 -> loss 1.0000
# C=1e+22 -> loss 0.8913 (x0.891 per 10x)
# C=1e+23 -> loss 0.7943 (x0.891 per 10x)
# C=1e+24 -> loss 0.7079 (x0.891 per 10x)
# C=1e+25 -> loss 0.6310 (x0.891 per 10x) <- 10x the spend, same 11% gain
06The economics
The law → money
Scaling laws are the reason the build-out is rational rather than reckless. Because more compute reliably buys a better model, spending on clusters is a forecast, not a gamble — a lab can project the return on the next $10 billion and act on it. Remove that predictability and the whole capital cycle collapses; it's the closest thing AI has to a law of physics for investors.
But the same line contains the warning. The exponent is small, so returns diminish: each equal gain costs another factor of ten. Progress on the straight line requires spending that grows geometrically — which is exactly why capex is exploding, and exactly why the question of whether the payoff keeps pace is not academic. And a second limit looms: the data wall, since the world contains only so many quality tokens to train on.
This is the beating heart of the Circuit. The scaling law is the divergence in one equation: capability climbs smoothly while cost climbs exponentially, and the entire thesis rides on whether revenue can track the second curve as it chases the first. Everything the earlier chapters described — the chips, the memory, the tokens — is ultimately in service of buying another step down this line.
07Going deeper
expand ▾
Kaplan et al. (2020) — Scaling Laws for Neural Language Models · the original power laws.
Hoffmann et al. (2022) — Chinchilla · compute-optimal training, 20 tokens/parameter.
Wei et al. (2022) — Emergent Abilities of LLMs · when smooth loss hides sudden jumps.
Epoch AI — Trends in Machine Learning · measured compute, data, and cost trajectories.
Cite this chapter: Divergent Compute, "Scaling laws", First Principles, 2026. divergentcompute.com/first-principles-scaling-laws · v1.0 · CC-BY.