First Principles / Part I · Foundations / Chapter 04
First Principles · Foundations · 04
Attention is the move that lets every token look at every other token and decide which ones matter for it right now. It's how a model figures out that "it" refers to "the animal" — and it's the heart of the transformer.
01The answer, then the intuition
To understand a word, you need its context. "Bank" means something different by a river than in a sentence about money; "it" only makes sense once you know what it points to. Earlier models read left-to-right through a bottleneck and forgot. Attention fixed that with a brutally direct idea: let every token look at all the others at once, and learn how much to weight each one.
Click a word below. The arcs show what it pays attention to — thicker means more. Notice where "it" looks:
A single attention head over one sentence. Arcs show how much each word attends to the others.
Illustrative weights for one head. Real models run dozens of these heads in parallel, each learning a different kind of relationship.
02Mechanics
Each token produces three vectors from its embedding, via learned weight matrices:
The mechanism is a soft dictionary lookup. Compare one token's query against every token's key (a dot product) to get a relevance score; the better the match, the higher the score. Run those scores through a softmax so they become weights that sum to one, then take a weighted average of the values. That blend is the token's new, context-aware representation — "it" has now mixed in a large dose of "animal."
Stack this inside the network layers, run many heads in parallel (each catching a different relationship — syntax, coreference, topic), and repeat over dozens of layers. That stack is the transformer, and it's the next chapter.
04The math
expand ▾Pack the queries, keys and values for all $n$ tokens into matrices $Q, K, V$, each of shape $n \times d_k$. The entire operation is one line:
Read it inside-out. $QK^{\top}$ is an $n \times n$ matrix — every token's query dotted with every token's key, the full grid of relevance scores. Dividing by $\sqrt{d_k}$ keeps the numbers from blowing up as dimension grows. The $\text{softmax}$ turns each row into a probability distribution (the attention weights). Multiplying by $V$ takes the weighted average of the values.
Keep your eye on that $QK^{\top}$. It is $n \times n$ — its size grows with the square of the number of tokens. Hold that thought; it is the most expensive fact in all of AI, and it returns in layer 06.
05The code
expand ▾The whole mechanism, runnable. Note the attention weights — each row sums to one.
attention.py
import numpy as np
def softmax(x):
e = np.exp(x - x.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
def attention(Q, K, V):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # (n, n): every query vs every key
weights = softmax(scores) # each row sums to 1
return weights @ V, weights # blended values, and the attention map
np.random.seed(0)
Q = np.random.randn(3, 4); K = np.random.randn(3, 4); V = np.random.randn(3, 4)
out, w = attention(Q, K, V)
print(w.round(2)) # [[0.68 0.3 0.02]
# [0.29 0.7 0.01]
# [0.5 0.19 0.31]]
print(w.sum(axis=1)) # [1. 1. 1.]
06The economics
n² → money
Attention's superpower — letting every token see every other — is also its bill. That $QK^{\top}$ grid is $n \times n$, so the compute grows as $O(n^2)$ in the number of tokens. Double the context, and you quadruple the attention work. This single fact is the most consequential number in AI economics.
It's why early models capped context at a few thousand tokens, why "1 million token context" is a genuine engineering feat rather than a setting, and why running models is memory-bound: to avoid recomputing, models cache every token's keys and values — the KV cache — whose size grows with the sequence and devours high-bandwidth memory. The $n$ you met in the token chapter, attention squares.
So the line of demand runs straight through this layer: longer, smarter context → quadratic compute and linear-but-relentless memory → more GPUs and more HBM → the build-out. A whole research frontier (FlashAttention, sparse and linear attention) exists just to soften that exponent. When the Circuit talks about the memory wall, this $n^2$ is the wall.
07Going deeper
expand ▾
Vaswani et al. (2017) — Attention Is All You Need · the transformer, and this exact formula.
Bahdanau, Cho & Bengio (2014) — Neural Machine Translation by Jointly Learning to Align and Translate · attention, before transformers.
Dao et al. (2022) — FlashAttention · making the O(n²) memory-efficient on real GPUs.
Alammar — The Illustrated Transformer · the canonical visual walkthrough.
Cite this chapter: Divergent Compute, "What is attention?", First Principles, 2026. divergentcompute.com/first-principles-attention · v1.0 · CC-BY.