Divergent Compute.AI Economic Think Tank

First Principles / Part II · Models / Chapter 09

First Principles · Models · 09

How models are pretrained

Pretraining is the long, brutally expensive first phase where a model reads trillions of tokens of text and, by predicting the next one over and over, teaches itself language, facts, and reasoning — with no labels and no human in the loop.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

Learning the world by guessing the next word

The model from the last chapter isn't born knowing anything — its parameters start as random noise. Pretraining is how those billions of numbers become a model of language. The recipe is almost absurdly simple: feed it a colossal pile of text, and at every position ask it to predict the next token. When it's wrong, nudge the weights to be a little less wrong. Do that for trillions of tokens.

Because the right answer is always just "the next word in the text," no human has to label anything — the data supervises itself. To predict text well, the model is quietly forced to learn grammar, facts, code, arithmetic, and reasoning, because all of those help it guess what comes next. Drag through a training run and watch capability emerge:

A training run — drag through it

Illustrative. The loss falls and the model's writing sharpens as it sees more tokens.

training tokens seen
loss (lower is better)
Output

02Mechanics

The data, the loss, and the months of compute

  • The data. A frontier model trains on something like 10–15 trillion tokens — a filtered, deduplicated slice of the web, books, and code. Data quality and mix matter as much as quantity; much of the craft of pretraining is in the cleaning.
  • The objective. At each position the model predicts a distribution over the next token, and the loss is the cross-entropy — how surprised it was by the real next token. Averaged over the whole corpus, that single number is what training drives down.
  • The compute. Each token is processed by every parameter; training cost is about $6ND$ — six times the parameters times the tokens. For a frontier model that is on the order of $10^{25}$–$10^{26}$ operations: thousands of GPUs running for months.
  • Emergence. As scale grows the loss falls along a smooth power law, but specific abilities — arithmetic, translation, in-context learning — can appear rather suddenly once the model is big enough. You can't fully predict what a bigger model will be able to do, only that it will do better.

The output is a base model: fluent, knowledgeable, and completely unhelpful. It will happily continue your text, but it hasn't learned to follow instructions or be safe. Turning it into a usable assistant is the next chapter.

04The math

expand ▾

Cross-entropy, minimized over a corpus

For one position, with the model's predicted distribution $P$ and the true next token $t$, the loss is the negative log-probability it assigned to the truth:

$$ \ell = -\log P(t) $$

Pretraining minimizes the average of this over the whole corpus of $D$ tokens — the cross-entropy loss:

$$ \mathcal{L} = -\frac{1}{D}\sum_{k=1}^{D} \log P\!\left(t_k \mid t_{

by gradient descent on the parameters $\theta$: $\;\theta \leftarrow \theta - \eta\,\nabla_\theta \mathcal{L}$. Empirically the loss follows a scaling law — it falls as a power of the compute $C$:

$$ \mathcal{L}(C) \approx \mathcal{L}_\infty + \left(\frac{C_0}{C}\right)^{\alpha} $$

This smooth, predictable curve is what makes "spend more, get a better model" a plannable investment rather than a gamble — the single fact underneath the entire build-out.

05The code

expand ▾

The loss the whole run minimizes

The cross-entropy of a single prediction — the quantity, summed over trillions of tokens, that is pretraining.

loss.py

import numpy as np

def softmax(z):
    e = np.exp(np.array(z, float) - max(z)); return e / e.sum()

def cross_entropy(logits, true_token):
    p = softmax(logits)
    return -np.log(p[true_token])     # surprise at the real next token

# model's logits over a tiny vocab; the true next token is index 0
logits = [6.0, 1.5, 1.8, 1.2]
print(round(cross_entropy(logits, 0), 3))   # 0.034  — confident and correct: tiny loss
print(round(cross_entropy(logits, 2), 3))   # 4.234  — it bet wrong: large loss

# training = average this over D tokens and take a gradient step, ~D/batch times

06The economics

The hundred-million-dollar single event

The run → money

Pretraining is the most concentrated cost in all of AI — a single training run for a frontier model costs tens to hundreds of millions of dollars in compute, burned over weeks or months on a cluster of tens of thousands of GPUs that exists largely for this purpose. It is the sharpest spike of the build-out's capital expenditure, and it happens before the model has earned a cent.

The economics only work because the result is an asset: one expensive run produces a base model that is then served to hundreds of millions of users, amortizing the cost across billions of cheap-by-comparison inference calls. The scaling law is what makes the bet rational — spend predictably more on the run, get a predictably better asset.

So pretraining is the training half of the two clocks: the enormous, upfront, capitalized cost that the Circuit asks whether inference revenue will ever repay. Every new frontier run raises the stakes of that question.

07Going deeper

expand ▾

The primary sources

Brown et al. (2020) — GPT-3 · pretraining at scale and few-shot learning.
Hoffmann et al. (2022) — Chinchilla · how to spend a compute budget between parameters and data.
Wei et al. (2022) — Emergent Abilities of LLMs · capabilities that appear with scale.
Gao et al. (2020) — The Pile · what a pretraining corpus actually looks like.

Cite this chapter: Divergent Compute, "How models are pretrained", First Principles, 2026. divergentcompute.com/first-principles-pretraining · v1.0 · CC-BY.

← Chapter 08
What is an LLM?
Next · Chapter 10 →
Fine-tuning & RLHF