Divergent Compute.AI Economic Think Tank

First Principles / Part II · Models / Chapter 11

First Principles · Models · 11

Differences between LLMs

GPT, Claude, Gemini, Llama, Mistral, DeepSeek — almost all of them are the same transformer recipe you just learned. What actually separates them lives on a handful of axes: scale, data, architecture, alignment, openness, and context.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

One recipe, a few dials

It's tempting to imagine each lab guards some secret architecture. Mostly they don't. Strip the branding away and you find the same decoder-only transformer, trained with the same next-token objective and aligned the same way. The famous models differ by degree on a small set of dials, not by kind.

That's genuinely clarifying: once you understand one model, you understand them all, and the marketing resolves into six measurable choices. Click each axis to see what it controls — and where real, publicly documented models actually sit:

The six axes of difference — explore

Only publicly disclosed facts. Frontier closed-model sizes are undisclosed and shown as such.

02Mechanics

What each dial actually changes

  • Scale. More parameters = more capacity, but only if matched with enough data. A bigger model isn't automatically better — an undertrained giant loses to a well-fed smaller one.
  • Data. The mix, quality, and quantity of training tokens. This is the least visible and arguably most decisive dial — and the hardest to copy, since each lab's data pipeline is proprietary.
  • Architecture tweaks. The core is shared, but labs vary attention (grouped-query, sliding-window), use mixture-of-experts to grow parameters without growing per-token cost, and extend context length. Differences of efficiency, not of kind.
  • Alignment. The post-training — SFT data, RLHF quality, system behaviour — is where a model's "personality," refusal style, and reliability come from. Two models with identical pretraining can feel completely different after this step.
  • Openness. Open-weight (Llama, Mistral, Qwen, DeepSeek — you can download and run them) vs closed (GPT, Claude, Gemini — API only). This is a business and safety choice, not a capability one.
  • Context window. How much the model can attend to at once — from 8K to a million-plus tokens. A real capability difference for long-document work, and a real cost difference.

Benchmarks try to collapse all of this into a single leaderboard number. Treat those with suspicion — they're easily gamed and rarely capture the axis you actually care about for your task.

04The math

expand ▾

Why a smaller model can win

Training compute is set by parameters $N$ and tokens $D$:

$$ C \approx 6\,N\,D $$

The Chinchilla result says that for a fixed compute budget $C$, loss is minimized when parameters and data grow together — both scale roughly as the square root of compute:

$$ N_{\text{opt}} \propto C^{0.5}, \qquad D_{\text{opt}} \propto C^{0.5} \quad\Rightarrow\quad D_{\text{opt}} \approx 20\,N_{\text{opt}} $$

The rule of thumb that fell out: about 20 tokens per parameter for compute-optimal training. This is why model size alone tells you little. A 405B model trained on too few tokens is undertrained; an 8B model trained on 15 trillion tokens is pushed far past compute-optimal on purpose — because the lab is optimizing not for training cost but for cheap inference later. Different objective, different point on the curve.

05The code

expand ▾

The giant a small model out-trained

Compute $6ND$ for four models with publicly disclosed sizes and token counts. Note the surprise in the output.

compute.py

# training compute C = 6 * N(params) * D(tokens) — all figures publicly disclosed
models = [
    ("GPT-3 (2020)",   175e9, 300e9),   # 175B params, 300B tokens
    ("Llama-3 8B",       8e9,  15e12),   # 8B params, 15T tokens
    ("Llama-3 70B",     70e9,  15e12),
    ("Llama-3.1 405B", 405e9,  15e12),
]
for name, N, D in models:
    C = 6 * N * D
    print(f"{name:16s} N={N:.0e}  D={D:.0e}  C={C:.2e} FLOPs")
# GPT-3 (2020)     N=2e+11  D=3e+11  C=3.15e+23 FLOPs
# Llama-3 8B       N=8e+09  D=2e+13  C=7.20e+23 FLOPs   <- MORE compute than GPT-3,
# Llama-3 70B      N=7e+10  D=2e+13  C=6.30e+24 FLOPs      at 1/22 the size
# Llama-3.1 405B   N=4e+11  D=2e+13  C=3.65e+25 FLOPs

Llama-3 8B is 22× smaller than GPT-3 yet consumed ~2.3× more training compute, because it was fed 50× the data. Parameter count is a label on the box, not a measure of what went into it.

06The economics

The market the dials create

Differences → money

These axes aren't just technical — they carve the market. Open-weight models commoditize the bottom: anyone can run Llama or Mistral on their own hardware, so the price of "good enough" intelligence collapses toward the cost of electricity. Closed frontier models defend a premium at the top, charging per token for the best available quality. The whole industry is a tug-of-war between those two forces.

The compute-optimal math is why: labs now deliberately overtrain small models, spending more upfront so that inference — the part you pay for forever — is cheap. A small, heavily-trained model that runs on one GPU can undercut a frontier API on price for most everyday tasks. That pressure on margins, even as capability rises, is exactly the tension the Circuit tracks.

So "which model is best" is the wrong question economically. The right one is which point on these six dials fits your task and your budget — and whether the frontier premium survives as the open tier keeps catching up.

07Going deeper

expand ▾

The primary sources

Llama 3 Herd of Models (2024) · disclosed sizes, token counts, and design choices.
Hoffmann et al. (2022) — Chinchilla · the compute-optimal ~20-tokens-per-parameter rule.
Mixtral of Experts (2024) · mixture-of-experts: more parameters, same per-token cost.
DeepSeek-V3 (2024) · an open MoE frontier model, with disclosed training cost.

Cite this chapter: Divergent Compute, "Differences between LLMs", First Principles, 2026. divergentcompute.com/first-principles-model-differences · v1.0 · CC-BY.

← Chapter 10
Fine-tuning & RLHF
Next · Chapter 12 →
Multimodal models