First Principles / Part II · Models / Chapter 11
First Principles · Models · 11
GPT, Claude, Gemini, Llama, Mistral, DeepSeek — almost all of them are the same transformer recipe you just learned. What actually separates them lives on a handful of axes: scale, data, architecture, alignment, openness, and context.
01The answer, then the intuition
It's tempting to imagine each lab guards some secret architecture. Mostly they don't. Strip the branding away and you find the same decoder-only transformer, trained with the same next-token objective and aligned the same way. The famous models differ by degree on a small set of dials, not by kind.
That's genuinely clarifying: once you understand one model, you understand them all, and the marketing resolves into six measurable choices. Click each axis to see what it controls — and where real, publicly documented models actually sit:
Only publicly disclosed facts. Frontier closed-model sizes are undisclosed and shown as such.
02Mechanics
Benchmarks try to collapse all of this into a single leaderboard number. Treat those with suspicion — they're easily gamed and rarely capture the axis you actually care about for your task.
04The math
expand ▾Training compute is set by parameters $N$ and tokens $D$:
The Chinchilla result says that for a fixed compute budget $C$, loss is minimized when parameters and data grow together — both scale roughly as the square root of compute:
The rule of thumb that fell out: about 20 tokens per parameter for compute-optimal training. This is why model size alone tells you little. A 405B model trained on too few tokens is undertrained; an 8B model trained on 15 trillion tokens is pushed far past compute-optimal on purpose — because the lab is optimizing not for training cost but for cheap inference later. Different objective, different point on the curve.
05The code
expand ▾Compute $6ND$ for four models with publicly disclosed sizes and token counts. Note the surprise in the output.
compute.py
# training compute C = 6 * N(params) * D(tokens) — all figures publicly disclosed
models = [
("GPT-3 (2020)", 175e9, 300e9), # 175B params, 300B tokens
("Llama-3 8B", 8e9, 15e12), # 8B params, 15T tokens
("Llama-3 70B", 70e9, 15e12),
("Llama-3.1 405B", 405e9, 15e12),
]
for name, N, D in models:
C = 6 * N * D
print(f"{name:16s} N={N:.0e} D={D:.0e} C={C:.2e} FLOPs")
# GPT-3 (2020) N=2e+11 D=3e+11 C=3.15e+23 FLOPs
# Llama-3 8B N=8e+09 D=2e+13 C=7.20e+23 FLOPs <- MORE compute than GPT-3,
# Llama-3 70B N=7e+10 D=2e+13 C=6.30e+24 FLOPs at 1/22 the size
# Llama-3.1 405B N=4e+11 D=2e+13 C=3.65e+25 FLOPs
Llama-3 8B is 22× smaller than GPT-3 yet consumed ~2.3× more training compute, because it was fed 50× the data. Parameter count is a label on the box, not a measure of what went into it.
06The economics
Differences → money
These axes aren't just technical — they carve the market. Open-weight models commoditize the bottom: anyone can run Llama or Mistral on their own hardware, so the price of "good enough" intelligence collapses toward the cost of electricity. Closed frontier models defend a premium at the top, charging per token for the best available quality. The whole industry is a tug-of-war between those two forces.
The compute-optimal math is why: labs now deliberately overtrain small models, spending more upfront so that inference — the part you pay for forever — is cheap. A small, heavily-trained model that runs on one GPU can undercut a frontier API on price for most everyday tasks. That pressure on margins, even as capability rises, is exactly the tension the Circuit tracks.
So "which model is best" is the wrong question economically. The right one is which point on these six dials fits your task and your budget — and whether the frontier premium survives as the open tier keeps catching up.
07Going deeper
expand ▾
Llama 3 Herd of Models (2024) · disclosed sizes, token counts, and design choices.
Hoffmann et al. (2022) — Chinchilla · the compute-optimal ~20-tokens-per-parameter rule.
Mixtral of Experts (2024) · mixture-of-experts: more parameters, same per-token cost.
DeepSeek-V3 (2024) · an open MoE frontier model, with disclosed training cost.
Cite this chapter: Divergent Compute, "Differences between LLMs", First Principles, 2026. divergentcompute.com/first-principles-model-differences · v1.0 · CC-BY.