Divergent Compute.AI Economic Think Tank

First Principles / Part I · Foundations / Chapter 06

First Principles · Foundations · 06

Parameters & weights

A parameter is a single learned number — one weight in one of the model's matrices. A large model is billions of them, and together those numbers are the model. "175 billion parameters" means 175 billion learned knobs.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

The model is its numbers

Every multiply you've met — the embedding lookup, the attention projections, the feed-forward layers — uses weights. A parameter is just one of those weights: a single number the model learned during training. There's nothing else in the model. The "knowledge" isn't stored anywhere special; it's distributed across all the weights, and the model file you download is literally a giant list of them.

That's why parameter count is the headline spec. It sets how much the model can learn, how much memory it takes to hold, and how much compute it takes to run. Build your own and watch those three numbers move — set the depth and width of a transformer:

Build a model — watch the count

total parameters
memory to hold it (fp16)
80 GB GPUs just to load it

02Mechanics

Where the billions live

The parameters aren't spread evenly. In each transformer layer they sit in a handful of matrices:

  • Attention — four square matrices ($W_Q, W_K, W_V, W_O$), each $d \times d$, so $4d^2$ per layer.
  • Feed-forward — two matrices that expand to a wider hidden size (usually $4d$) and back, about $8d^2$ per layer. This is where most parameters live.
  • Embeddings — one $V \times d$ table at the bottom, and a same-sized one at the top (often tied to save space).
  • Norms & biases — a rounding error by comparison.

So a layer holds roughly $12d^2$ parameters, and a model with $N$ layers holds about $12Nd^2$ plus the embeddings. They begin as random noise and are nudged, one tiny gradient step at a time, until the whole pile of numbers predicts text well. Training is the act of setting these parameters; everything else in AI is what you do with them once they're set.

04The math

expand ▾

Counting the knobs

With $N$ layers, width $d$, and vocabulary $V$, the parameter count is dominated by the layers:

$$ N_{\text{params}} \;\approx\; \underbrace{N\,(4d^2 + 8d^2)}_{\text{attention + FFN}} \;+\; \underbrace{V d}_{\text{embedding}} \;=\; 12\,N\,d^2 + V d $$

For a model of GPT-3's shape — $N = 96$, $d = 12288$, $V \approx 50{,}257$ — that comes to about $175$ billion, which is exactly the famous number. Notice the $d^2$: widening a model is quadratically expensive in parameters, while adding layers is only linear.

Two consequences set the cost. The memory to hold the weights is $N_{\text{params}}$ times the bytes per number — $2$ bytes in half-precision, $1$ in int8. And running the model takes about $2N_{\text{params}}$ floating-point operations per token. So the single number on the box determines both the storage and the compute.

05The code

expand ▾

The parameter calculator, in code

The exact function behind the slider above — runnable, and it reproduces GPT-3's 175B.

params.py

def transformer_params(n_layers, d_model, vocab, d_ff=None):
    d_ff = d_ff or 4 * d_model
    per_layer = 4 * d_model**2 + 2 * d_model * d_ff   # attention (QKVO) + FFN
    embedding = vocab * d_model                       # token embedding table
    return n_layers * per_layer + embedding

# GPT-3 scale: 96 layers, width 12288, vocab 50257
N = transformer_params(96, 12288, 50257)
print(f"{N:,} parameters")          # 174,563,733,504 parameters
print(f"{N/1e9:.0f}B params  |  {N*2/1e9:.0f} GB in fp16")   # 175B params | 349 GB in fp16

06The economics

The number on the box is the bill

Count → money

Parameter count is the closest thing the AI economy has to a single price tag. It sets memory — a 175B model needs about 350 GB just to sit in half-precision, about five 80 GB GPUs before it has done any work. It sets compute — roughly $2N$ operations per token to run, and about $6N \cdot D$ to train. And by the scaling laws, it sets capability: bigger has reliably meant better, which is why the number keeps climbing.

That is the whole flywheel of the build-out. "Make it bigger" improves the product, but every extra parameter is more HBM to hold it, more GPUs to serve it, and more energy to train it. The race to larger $N$ is the race that fills the data centers — and the reason memory, not raw compute, is so often the binding constraint.

So when you read "405 billion parameters," translate it: that many learned numbers, that much memory, that much silicon. The parameter count you just dialed is the unit in which the entire Circuit is denominated.

07Going deeper

expand ▾

The primary sources

Brown et al. (2020) — Language Models are Few-Shot Learners (GPT-3) · the 175-billion-parameter model.
Kaplan et al. (2020) — Scaling Laws for Neural Language Models · why bigger reliably means better.
Hoffmann et al. (2022) — Training Compute-Optimal LLMs (Chinchilla) · how to balance parameters against training tokens.
Phuong & Hutter (2022) — Formal Algorithms for Transformers · the parameter tables, written out.

Cite this chapter: Divergent Compute, "Parameters & weights", First Principles, 2026. divergentcompute.com/first-principles-parameters · v1.0 · CC-BY.

← Chapter 05
What is a transformer?
Next · Chapter 07 →
The context window