What is attention?

Attention is the move that lets every token look at every other token and decide which ones matter for it right now. It's how a model figures out that "it" refers to "the animal" — and it's the heart of the transformer.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

Every word looks at every other word

To understand a word, you need its context. "Bank" means something different by a river than in a sentence about money; "it" only makes sense once you know what it points to. Earlier models read left-to-right through a bottleneck and forgot. Attention fixed that with a brutally direct idea: let every token look at all the others at once, and learn how much to weight each one.

Click a word below. The arcs show what it pays attention to — thicker means more. Notice where "it" looks:

Attention — click a word

A single attention head over one sentence. Arcs show how much each word attends to the others.

Illustrative weights for one head. Real models run dozens of these heads in parallel, each learning a different kind of relationship.

Query, key, value

Each token produces three vectors from its embedding, via learned weight matrices:

Query — what this token is looking for.
Key — what each token offers, as an advertisement.
Value — the actual content each token will hand over if attended to.

The mechanism is a soft dictionary lookup. Compare one token's query against every token's key (a dot product) to get a relevance score; the better the match, the higher the score. Run those scores through a softmax so they become weights that sum to one, then take a weighted average of the values. That blend is the token's new, context-aware representation — "it" has now mixed in a large dose of "animal."

Stack this inside the network layers, run many heads in parallel (each catching a different relationship — syntax, coreference, topic), and repeat over dozens of layers. That stack is the transformer, and it's the next chapter.

Pack the queries, keys and values for all $n$ tokens into matrices $Q, K, V$, each of shape $n \times d_k$. The entire operation is one line:

$$ \text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V $$

Read it inside-out. $QK^{\top}$ is an $n \times n$ matrix — every token's query dotted with every token's key, the full grid of relevance scores. Dividing by $\sqrt{d_k}$ keeps the numbers from blowing up as dimension grows. The $\text{softmax}$ turns each row into a probability distribution (the attention weights). Multiplying by $V$ takes the weighted average of the values.

Keep your eye on that $QK^{\top}$. It is $n \times n$ — its size grows with the square of the number of tokens. Hold that thought; it is the most expensive fact in all of AI, and it returns in layer 06.

import numpy as np def softmax(x): e = np.exp(x - x.max(axis=-1, keepdims=True)) return e / e.sum(axis=-1, keepdims=True) def attention(Q, K, V): d_k = Q.shape[-1] scores = Q @ K.T / np.sqrt(d_k) # (n, n): every query vs every key weights = softmax(scores) # each row sums to 1 return weights @ V, weights # blended values, and the attention map np.random.seed(0) Q = np.random.randn(3, 4); K = np.random.randn(3, 4); V = np.random.randn(3, 4) out, w = attention(Q, K, V) print(w.round(2)) # [[0.68 0.3 0.02] # [0.29 0.7 0.01] # [0.5 0.19 0.31]] print(w.sum(axis=1)) # [1. 1. 1.]

The square that built the data centers

n² → money

Attention's superpower — letting every token see every other — is also its bill. That $QK^{\top}$ grid is $n \times n$, so the compute grows as $O(n^2)$ in the number of tokens. Double the context, and you quadruple the attention work. This single fact is the most consequential number in AI economics.

It's why early models capped context at a few thousand tokens, why "1 million token context" is a genuine engineering feat rather than a setting, and why running models is memory-bound: to avoid recomputing, models cache every token's keys and values — the KV cache — whose size grows with the sequence and devours high-bandwidth memory. The $n$ you met in the token chapter, attention squares.

So the line of demand runs straight through this layer: longer, smarter context → quadratic compute and linear-but-relentless memory → more GPUs and more HBM → the build-out. A whole research frontier (FlashAttention, sparse and linear attention) exists just to soften that exponent. When the Circuit talks about the memory wall, this $n^2$ is the wall.

What is attention?

Every word looks at every other word

Attention — click a word

Query, key, value

Scaled dot-product attention

Attention in seven lines

The square that built the data centers

The primary sources