Differences between LLMs

GPT, Claude, Gemini, Llama, Mistral, DeepSeek — almost all of them are the same transformer recipe you just learned. What actually separates them lives on a handful of axes: scale, data, architecture, alignment, openness, and context.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

One recipe, a few dials

It's tempting to imagine each lab guards some secret architecture. Mostly they don't. Strip the branding away and you find the same decoder-only transformer, trained with the same next-token objective and aligned the same way. The famous models differ by degree on a small set of dials, not by kind.

That's genuinely clarifying: once you understand one model, you understand them all, and the marketing resolves into six measurable choices. Click each axis to see what it controls — and where real, publicly documented models actually sit:

The six axes of difference — explore

Only publicly disclosed facts. Frontier closed-model sizes are undisclosed and shown as such.

What each dial actually changes

Scale. More parameters = more capacity, but only if matched with enough data. A bigger model isn't automatically better — an undertrained giant loses to a well-fed smaller one.
Data. The mix, quality, and quantity of training tokens. This is the least visible and arguably most decisive dial — and the hardest to copy, since each lab's data pipeline is proprietary.
Architecture tweaks. The core is shared, but labs vary attention (grouped-query, sliding-window), use mixture-of-experts to grow parameters without growing per-token cost, and extend context length. Differences of efficiency, not of kind.
Alignment. The post-training — SFT data, RLHF quality, system behaviour — is where a model's "personality," refusal style, and reliability come from. Two models with identical pretraining can feel completely different after this step.
Openness. Open-weight (Llama, Mistral, Qwen, DeepSeek — you can download and run them) vs closed (GPT, Claude, Gemini — API only). This is a business and safety choice, not a capability one.
Context window. How much the model can attend to at once — from 8K to a million-plus tokens. A real capability difference for long-document work, and a real cost difference.

Benchmarks try to collapse all of this into a single leaderboard number. Treat those with suspicion — they're easily gamed and rarely capture the axis you actually care about for your task.

Training compute is set by parameters $N$ and tokens $D$:

$$ C \approx 6\,N\,D $$

The Chinchilla result says that for a fixed compute budget $C$, loss is minimized when parameters and data grow together — both scale roughly as the square root of compute:

$$ N_{\text{opt}} \propto C^{0.5}, \qquad D_{\text{opt}} \propto C^{0.5} \quad\Rightarrow\quad D_{\text{opt}} \approx 20\,N_{\text{opt}} $$

The rule of thumb that fell out: about 20 tokens per parameter for compute-optimal training. This is why model size alone tells you little. A 405B model trained on too few tokens is undertrained; an 8B model trained on 15 trillion tokens is pushed far past compute-optimal on purpose — because the lab is optimizing not for training cost but for cheap inference later. Different objective, different point on the curve.

# training compute C = 6 * N(params) * D(tokens) — all figures publicly disclosed models = [ ("GPT-3 (2020)", 175e9, 300e9), # 175B params, 300B tokens ("Llama-3 8B", 8e9, 15e12), # 8B params, 15T tokens ("Llama-3 70B", 70e9, 15e12), ("Llama-3.1 405B", 405e9, 15e12), ] for name, N, D in models: C = 6 * N * D print(f"{name:16s} N={N:.0e} D={D:.0e} C={C:.2e} FLOPs") # GPT-3 (2020) N=2e+11 D=3e+11 C=3.15e+23 FLOPs # Llama-3 8B N=8e+09 D=2e+13 C=7.20e+23 FLOPs <- MORE compute than GPT-3, # Llama-3 70B N=7e+10 D=2e+13 C=6.30e+24 FLOPs at 1/22 the size # Llama-3.1 405B N=4e+11 D=2e+13 C=3.65e+25 FLOPs

The market the dials create

Differences → money

These axes aren't just technical — they carve the market. Open-weight models commoditize the bottom: anyone can run Llama or Mistral on their own hardware, so the price of "good enough" intelligence collapses toward the cost of electricity. Closed frontier models defend a premium at the top, charging per token for the best available quality. The whole industry is a tug-of-war between those two forces.

The compute-optimal math is why: labs now deliberately overtrain small models, spending more upfront so that inference — the part you pay for forever — is cheap. A small, heavily-trained model that runs on one GPU can undercut a frontier API on price for most everyday tasks. That pressure on margins, even as capability rises, is exactly the tension the Circuit tracks.

So "which model is best" is the wrong question economically. The right one is which point on these six dials fits your task and your budget — and whether the frontier premium survives as the open tier keeps catching up.

Differences between LLMs

One recipe, a few dials

The six axes of difference — explore

What each dial actually changes

Why a smaller model can win

The giant a small model out-trained

The market the dials create

The primary sources