What is a transformer?

A transformer is the assembly. Take embeddings, attention, and a small neural network, wire them into a repeatable block, and stack that block dozens of times. That stack is what an LLM is.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

The parts you already know, wired together

You've met every piece. A transformer just connects them. Text becomes tokens, tokens become embeddings, and then the data flows through a stack of identical blocks. Inside each block, attention lets tokens share information, and a small neural network processes each one. Repeat, and out the top come probabilities for the next token.

Click through the architecture — each stage is one of the earlier chapters doing its job:

The transformer — click any stage

Data flows up. The shaded block is the unit that repeats.

The transformer block

Tap any stage in the diagram to see what it does — and which chapter it comes from.

One block, repeated

The whole architecture is one block applied over and over. Inside it, two ideas you've seen, plus two pieces of plumbing that make deep stacks trainable:

Self-attention mixes information across tokens — the only place they talk to each other.
The feed-forward network then processes each token independently. This is where most of a model's parameters actually live.
Residual connections add a layer's input back to its output, so information (and gradients) can skip straight through a hundred layers without vanishing.
Layer normalization rescales the numbers at each step to keep training stable.

One detail matters: attention is order-blind — shuffle the tokens and it gives the same answer. So before the first block, each embedding gets a positional encoding added, a signal telling the model where each token sits. Stack ~12 blocks for a small model, ~100 for a frontier one, cap it with an output layer that turns the final vectors into next-token probabilities, and you have GPT.

Let $\mathbf{x}$ be the matrix of token vectors entering a block. The block is exactly two residual-and-normalize steps — first attention, then the feed-forward network:

$$ \mathbf{x}' = \text{LayerNorm}\big(\mathbf{x} + \text{Attention}(\mathbf{x})\big) $$

$$ \mathbf{x}'' = \text{LayerNorm}\big(\mathbf{x}' + \text{FFN}(\mathbf{x}')\big) $$

where the feed-forward network is a two-layer MLP applied to each token, $\text{FFN}(\mathbf{x}) = \max(0,\,\mathbf{x}W_1)\,W_2$. Crucially the block maps $\mathbb{R}^{n\times d} \to \mathbb{R}^{n\times d}$ — same shape in, same shape out — which is precisely why you can stack it $N$ times:

$$ f(\mathbf{x}) = \text{Block}_N\big(\cdots \text{Block}_2(\text{Block}_1(\mathbf{x}))\big) $$

The final vectors are projected by an unembedding matrix to vocabulary-sized scores (logits) and softmaxed into the probability of the next token. That's the entire forward pass of a GPT.

import numpy as np np.random.seed(0) def softmax(x): e = np.exp(x - x.max(axis=-1, keepdims=True)); return e / e.sum(axis=-1, keepdims=True) def attention(x, Wq, Wk, Wv): Q, K, V = x@Wq, x@Wk, x@Wv return softmax([email protected] / np.sqrt(Q.shape[-1])) @ V def layernorm(x): return (x - x.mean(-1, keepdims=True)) / (x.std(-1, keepdims=True) + 1e-5) def ffn(x, W1, W2): return np.maximum(0, x@W1) @ W2 # 2-layer ReLU MLP def block(x, Wq, Wk, Wv, W1, W2): x = layernorm(x + attention(x, Wq, Wk, Wv)) # attention + residual + norm x = layernorm(x + ffn(x, W1, W2)) # feed-forward + residual + norm return x x = np.random.randn(3, 4) # 3 tokens, width 4 Wq, Wk, Wv = (np.random.randn(4, 4) for _ in range(3)) W1, W2 = np.random.randn(4, 8), np.random.randn(8, 4) # FFN hidden = 8 print(block(x, Wq, Wk, Wv, W1, W2).shape) # (3, 4) — stack it as many times as you like

Depth times width times the square

Architecture → money

The transformer's bill is its shape. Each of the $N$ blocks runs an attention (the $O(n^2)$ from the last chapter) and a feed-forward network whose cost scales with the square of the width, $O(n\,d^2)$. Multiply by the number of layers and you have the whole forward pass. A frontier model is roughly $N \approx 100$ blocks, width $d$ in the tens of thousands, over $n$ in the thousands — and it runs that for every token, for hundreds of millions of users.

This is why the architecture is the capex. Every knob that makes models better — more layers, more width, longer context — multiplies straight into FLOPs, GPUs, power, and memory. The transformer turned "make it bigger" into a reliable recipe, and that recipe is what the entire build-out is paying for.

You now hold the whole chain: a token ($n$), its embedding ($d$), the attention ($n^2$), the network ($N$ layers of weights) — assembled here into the machine whose cost is the Circuit. The transformer is where the foundations become an economy.

What is a transformer?

The parts you already know, wired together

The transformer — click any stage

One block, repeated

The block as two residual steps

A transformer block in numpy

Depth times width times the square

The primary sources