First Principles / Part I · Foundations / Chapter 05
First Principles · Foundations · 05
A transformer is the assembly. Take embeddings, attention, and a small neural network, wire them into a repeatable block, and stack that block dozens of times. That stack is what an LLM is.
01The answer, then the intuition
You've met every piece. A transformer just connects them. Text becomes tokens, tokens become embeddings, and then the data flows through a stack of identical blocks. Inside each block, attention lets tokens share information, and a small neural network processes each one. Repeat, and out the top come probabilities for the next token.
Click through the architecture — each stage is one of the earlier chapters doing its job:
Data flows up. The shaded block is the unit that repeats.
Tap any stage in the diagram to see what it does — and which chapter it comes from.
02Mechanics
The whole architecture is one block applied over and over. Inside it, two ideas you've seen, plus two pieces of plumbing that make deep stacks trainable:
One detail matters: attention is order-blind — shuffle the tokens and it gives the same answer. So before the first block, each embedding gets a positional encoding added, a signal telling the model where each token sits. Stack ~12 blocks for a small model, ~100 for a frontier one, cap it with an output layer that turns the final vectors into next-token probabilities, and you have GPT.
04The math
expand ▾Let $\mathbf{x}$ be the matrix of token vectors entering a block. The block is exactly two residual-and-normalize steps — first attention, then the feed-forward network:
where the feed-forward network is a two-layer MLP applied to each token, $\text{FFN}(\mathbf{x}) = \max(0,\,\mathbf{x}W_1)\,W_2$. Crucially the block maps $\mathbb{R}^{n\times d} \to \mathbb{R}^{n\times d}$ — same shape in, same shape out — which is precisely why you can stack it $N$ times:
The final vectors are projected by an unembedding matrix to vocabulary-sized scores (logits) and softmaxed into the probability of the next token. That's the entire forward pass of a GPT.
05The code
expand ▾The block, composing the pieces from the earlier chapters. Note the output shape equals the input shape — that's the stacking property.
block.py
import numpy as np
np.random.seed(0)
def softmax(x):
e = np.exp(x - x.max(axis=-1, keepdims=True)); return e / e.sum(axis=-1, keepdims=True)
def attention(x, Wq, Wk, Wv):
Q, K, V = x@Wq, x@Wk, x@Wv
return softmax([email protected] / np.sqrt(Q.shape[-1])) @ V
def layernorm(x):
return (x - x.mean(-1, keepdims=True)) / (x.std(-1, keepdims=True) + 1e-5)
def ffn(x, W1, W2):
return np.maximum(0, x@W1) @ W2 # 2-layer ReLU MLP
def block(x, Wq, Wk, Wv, W1, W2):
x = layernorm(x + attention(x, Wq, Wk, Wv)) # attention + residual + norm
x = layernorm(x + ffn(x, W1, W2)) # feed-forward + residual + norm
return x
x = np.random.randn(3, 4) # 3 tokens, width 4
Wq, Wk, Wv = (np.random.randn(4, 4) for _ in range(3))
W1, W2 = np.random.randn(4, 8), np.random.randn(8, 4) # FFN hidden = 8
print(block(x, Wq, Wk, Wv, W1, W2).shape) # (3, 4) — stack it as many times as you like
06The economics
Architecture → money
The transformer's bill is its shape. Each of the $N$ blocks runs an attention (the $O(n^2)$ from the last chapter) and a feed-forward network whose cost scales with the square of the width, $O(n\,d^2)$. Multiply by the number of layers and you have the whole forward pass. A frontier model is roughly $N \approx 100$ blocks, width $d$ in the tens of thousands, over $n$ in the thousands — and it runs that for every token, for hundreds of millions of users.
This is why the architecture is the capex. Every knob that makes models better — more layers, more width, longer context — multiplies straight into FLOPs, GPUs, power, and memory. The transformer turned "make it bigger" into a reliable recipe, and that recipe is what the entire build-out is paying for.
You now hold the whole chain: a token ($n$), its embedding ($d$), the attention ($n^2$), the network ($N$ layers of weights) — assembled here into the machine whose cost is the Circuit. The transformer is where the foundations become an economy.
07Going deeper
expand ▾
Vaswani et al. (2017) — Attention Is All You Need · the transformer architecture itself.
Radford et al. (2018) — GPT · the decoder-only transformer that became the LLM.
Phuong & Hutter (2022) — Formal Algorithms for Transformers · the architecture written out precisely.
Alammar — The Illustrated Transformer · the canonical visual walkthrough.
Cite this chapter: Divergent Compute, "What is a transformer?", First Principles, 2026. divergentcompute.com/first-principles-transformer · v1.0 · CC-BY.