Parameters & weights

A parameter is a single learned number — one weight in one of the model's matrices. A large model is billions of them, and together those numbers are the model. "175 billion parameters" means 175 billion learned knobs.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

The model is its numbers

Every multiply you've met — the embedding lookup, the attention projections, the feed-forward layers — uses weights. A parameter is just one of those weights: a single number the model learned during training. There's nothing else in the model. The "knowledge" isn't stored anywhere special; it's distributed across all the weights, and the model file you download is literally a giant list of them.

That's why parameter count is the headline spec. It sets how much the model can learn, how much memory it takes to hold, and how much compute it takes to run. Build your own and watch those three numbers move — set the depth and width of a transformer:

Build a model — watch the count

Layers (N) 96

Width (d_model) 12288

Vocabulary (V) 50,257

—

total parameters

—

memory to hold it (fp16)

—

80 GB GPUs just to load it

Where the billions live

The parameters aren't spread evenly. In each transformer layer they sit in a handful of matrices:

Attention — four square matrices ($W_Q, W_K, W_V, W_O$), each $d \times d$, so $4d^2$ per layer.
Feed-forward — two matrices that expand to a wider hidden size (usually $4d$) and back, about $8d^2$ per layer. This is where most parameters live.
Embeddings — one $V \times d$ table at the bottom, and a same-sized one at the top (often tied to save space).
Norms & biases — a rounding error by comparison.

So a layer holds roughly $12d^2$ parameters, and a model with $N$ layers holds about $12Nd^2$ plus the embeddings. They begin as random noise and are nudged, one tiny gradient step at a time, until the whole pile of numbers predicts text well. Training is the act of setting these parameters; everything else in AI is what you do with them once they're set.

With $N$ layers, width $d$, and vocabulary $V$, the parameter count is dominated by the layers:

$$ N_{\text{params}} \;\approx\; \underbrace{N\,(4d^2 + 8d^2)}_{\text{attention + FFN}} \;+\; \underbrace{V d}_{\text{embedding}} \;=\; 12\,N\,d^2 + V d $$

For a model of GPT-3's shape — $N = 96$, $d = 12288$, $V \approx 50{,}257$ — that comes to about $175$ billion, which is exactly the famous number. Notice the $d^2$: widening a model is quadratically expensive in parameters, while adding layers is only linear.

Two consequences set the cost. The memory to hold the weights is $N_{\text{params}}$ times the bytes per number — $2$ bytes in half-precision, $1$ in int8. And running the model takes about $2N_{\text{params}}$ floating-point operations per token. So the single number on the box determines both the storage and the compute.

def transformer_params(n_layers, d_model, vocab, d_ff=None): d_ff = d_ff or 4 * d_model per_layer = 4 * d_model**2 + 2 * d_model * d_ff # attention (QKVO) + FFN embedding = vocab * d_model # token embedding table return n_layers * per_layer + embedding # GPT-3 scale: 96 layers, width 12288, vocab 50257 N = transformer_params(96, 12288, 50257) print(f"{N:,} parameters") # 174,563,733,504 parameters print(f"{N/1e9:.0f}B params | {N*2/1e9:.0f} GB in fp16") # 175B params | 349 GB in fp16

The number on the box is the bill

Count → money

Parameter count is the closest thing the AI economy has to a single price tag. It sets memory — a 175B model needs about 350 GB just to sit in half-precision, about five 80 GB GPUs before it has done any work. It sets compute — roughly $2N$ operations per token to run, and about $6N \cdot D$ to train. And by the scaling laws, it sets capability: bigger has reliably meant better, which is why the number keeps climbing.

That is the whole flywheel of the build-out. "Make it bigger" improves the product, but every extra parameter is more HBM to hold it, more GPUs to serve it, and more energy to train it. The race to larger $N$ is the race that fills the data centers — and the reason memory, not raw compute, is so often the binding constraint.

So when you read "405 billion parameters," translate it: that many learned numbers, that much memory, that much silicon. The parameter count you just dialed is the unit in which the entire Circuit is denominated.

Parameters & weights

The model is its numbers

Build a model — watch the count

Where the billions live

Counting the knobs

The parameter calculator, in code

The number on the box is the bill

The primary sources