Divergent Compute.AI Economic Think Tank

First Principles / Part I · Foundations / Chapter 04

First Principles · Foundations · 04

What is attention?

Attention is the move that lets every token look at every other token and decide which ones matter for it right now. It's how a model figures out that "it" refers to "the animal" — and it's the heart of the transformer.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

Every word looks at every other word

To understand a word, you need its context. "Bank" means something different by a river than in a sentence about money; "it" only makes sense once you know what it points to. Earlier models read left-to-right through a bottleneck and forgot. Attention fixed that with a brutally direct idea: let every token look at all the others at once, and learn how much to weight each one.

Click a word below. The arcs show what it pays attention to — thicker means more. Notice where "it" looks:

Attention — click a word

A single attention head over one sentence. Arcs show how much each word attends to the others.

Illustrative weights for one head. Real models run dozens of these heads in parallel, each learning a different kind of relationship.

02Mechanics

Query, key, value

Each token produces three vectors from its embedding, via learned weight matrices:

  • Query — what this token is looking for.
  • Key — what each token offers, as an advertisement.
  • Value — the actual content each token will hand over if attended to.

The mechanism is a soft dictionary lookup. Compare one token's query against every token's key (a dot product) to get a relevance score; the better the match, the higher the score. Run those scores through a softmax so they become weights that sum to one, then take a weighted average of the values. That blend is the token's new, context-aware representation — "it" has now mixed in a large dose of "animal."

Stack this inside the network layers, run many heads in parallel (each catching a different relationship — syntax, coreference, topic), and repeat over dozens of layers. That stack is the transformer, and it's the next chapter.

04The math

expand ▾

Scaled dot-product attention

Pack the queries, keys and values for all $n$ tokens into matrices $Q, K, V$, each of shape $n \times d_k$. The entire operation is one line:

$$ \text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V $$

Read it inside-out. $QK^{\top}$ is an $n \times n$ matrix — every token's query dotted with every token's key, the full grid of relevance scores. Dividing by $\sqrt{d_k}$ keeps the numbers from blowing up as dimension grows. The $\text{softmax}$ turns each row into a probability distribution (the attention weights). Multiplying by $V$ takes the weighted average of the values.

Keep your eye on that $QK^{\top}$. It is $n \times n$ — its size grows with the square of the number of tokens. Hold that thought; it is the most expensive fact in all of AI, and it returns in layer 06.

05The code

expand ▾

Attention in seven lines

The whole mechanism, runnable. Note the attention weights — each row sums to one.

attention.py

import numpy as np

def softmax(x):
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

def attention(Q, K, V):
    d_k = Q.shape[-1]
    scores  = Q @ K.T / np.sqrt(d_k)   # (n, n): every query vs every key
    weights = softmax(scores)          # each row sums to 1
    return weights @ V, weights        # blended values, and the attention map

np.random.seed(0)
Q = np.random.randn(3, 4); K = np.random.randn(3, 4); V = np.random.randn(3, 4)
out, w = attention(Q, K, V)
print(w.round(2))           # [[0.68 0.3  0.02]
                            #  [0.29 0.7  0.01]
                            #  [0.5  0.19 0.31]]
print(w.sum(axis=1))        # [1. 1. 1.]

06The economics

The square that built the data centers

n² → money

Attention's superpower — letting every token see every other — is also its bill. That $QK^{\top}$ grid is $n \times n$, so the compute grows as $O(n^2)$ in the number of tokens. Double the context, and you quadruple the attention work. This single fact is the most consequential number in AI economics.

It's why early models capped context at a few thousand tokens, why "1 million token context" is a genuine engineering feat rather than a setting, and why running models is memory-bound: to avoid recomputing, models cache every token's keys and values — the KV cache — whose size grows with the sequence and devours high-bandwidth memory. The $n$ you met in the token chapter, attention squares.

So the line of demand runs straight through this layer: longer, smarter context → quadratic compute and linear-but-relentless memory → more GPUs and more HBM → the build-out. A whole research frontier (FlashAttention, sparse and linear attention) exists just to soften that exponent. When the Circuit talks about the memory wall, this $n^2$ is the wall.

07Going deeper

expand ▾

The primary sources

Vaswani et al. (2017) — Attention Is All You Need · the transformer, and this exact formula.
Bahdanau, Cho & Bengio (2014) — Neural Machine Translation by Jointly Learning to Align and Translate · attention, before transformers.
Dao et al. (2022) — FlashAttention · making the O(n²) memory-efficient on real GPUs.
Alammar — The Illustrated Transformer · the canonical visual walkthrough.

Cite this chapter: Divergent Compute, "What is attention?", First Principles, 2026. divergentcompute.com/first-principles-attention · v1.0 · CC-BY.

← Chapter 03
What is a neural network?
Next · Chapter 05 →
What is a transformer?