First Principles / Part II · Models / Chapter 09
First Principles · Models · 09
Pretraining is the long, brutally expensive first phase where a model reads trillions of tokens of text and, by predicting the next one over and over, teaches itself language, facts, and reasoning — with no labels and no human in the loop.
01The answer, then the intuition
The model from the last chapter isn't born knowing anything — its parameters start as random noise. Pretraining is how those billions of numbers become a model of language. The recipe is almost absurdly simple: feed it a colossal pile of text, and at every position ask it to predict the next token. When it's wrong, nudge the weights to be a little less wrong. Do that for trillions of tokens.
Because the right answer is always just "the next word in the text," no human has to label anything — the data supervises itself. To predict text well, the model is quietly forced to learn grammar, facts, code, arithmetic, and reasoning, because all of those help it guess what comes next. Drag through a training run and watch capability emerge:
Illustrative. The loss falls and the model's writing sharpens as it sees more tokens.
02Mechanics
The output is a base model: fluent, knowledgeable, and completely unhelpful. It will happily continue your text, but it hasn't learned to follow instructions or be safe. Turning it into a usable assistant is the next chapter.
04The math
expand ▾For one position, with the model's predicted distribution $P$ and the true next token $t$, the loss is the negative log-probability it assigned to the truth:
Pretraining minimizes the average of this over the whole corpus of $D$ tokens — the cross-entropy loss:
by gradient descent on the parameters $\theta$: $\;\theta \leftarrow \theta - \eta\,\nabla_\theta \mathcal{L}$. Empirically the loss follows a scaling law — it falls as a power of the compute $C$:
This smooth, predictable curve is what makes "spend more, get a better model" a plannable investment rather than a gamble — the single fact underneath the entire build-out.
05The code
expand ▾The cross-entropy of a single prediction — the quantity, summed over trillions of tokens, that is pretraining.
loss.py
import numpy as np
def softmax(z):
e = np.exp(np.array(z, float) - max(z)); return e / e.sum()
def cross_entropy(logits, true_token):
p = softmax(logits)
return -np.log(p[true_token]) # surprise at the real next token
# model's logits over a tiny vocab; the true next token is index 0
logits = [6.0, 1.5, 1.8, 1.2]
print(round(cross_entropy(logits, 0), 3)) # 0.034 — confident and correct: tiny loss
print(round(cross_entropy(logits, 2), 3)) # 4.234 — it bet wrong: large loss
# training = average this over D tokens and take a gradient step, ~D/batch times
06The economics
The run → money
Pretraining is the most concentrated cost in all of AI — a single training run for a frontier model costs tens to hundreds of millions of dollars in compute, burned over weeks or months on a cluster of tens of thousands of GPUs that exists largely for this purpose. It is the sharpest spike of the build-out's capital expenditure, and it happens before the model has earned a cent.
The economics only work because the result is an asset: one expensive run produces a base model that is then served to hundreds of millions of users, amortizing the cost across billions of cheap-by-comparison inference calls. The scaling law is what makes the bet rational — spend predictably more on the run, get a predictably better asset.
So pretraining is the training half of the two clocks: the enormous, upfront, capitalized cost that the Circuit asks whether inference revenue will ever repay. Every new frontier run raises the stakes of that question.
07Going deeper
expand ▾
Brown et al. (2020) — GPT-3 · pretraining at scale and few-shot learning.
Hoffmann et al. (2022) — Chinchilla · how to spend a compute budget between parameters and data.
Wei et al. (2022) — Emergent Abilities of LLMs · capabilities that appear with scale.
Gao et al. (2020) — The Pile · what a pretraining corpus actually looks like.
Cite this chapter: Divergent Compute, "How models are pretrained", First Principles, 2026. divergentcompute.com/first-principles-pretraining · v1.0 · CC-BY.