What is an embedding?

An embedding turns a token into a point in space — a list of numbers — arranged so that meaning becomes distance. Similar things sit close together, and relationships become directions you can do arithmetic with.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

Meaning, turned into geometry

In the last chapter a token became an integer ID. But an ID like 3404 carries no meaning — 3405 isn't "one more" of anything. So the very first thing a model does is replace each ID with an embedding: a vector of a few hundred to a few thousand numbers.

The magic is in how those numbers are arranged. The model learns them so that tokens used in similar ways end up near each other in the space. "Dog" and "cat" land close; "dog" and "Tuesday" land far apart. Meaning is no longer a symbol — it's a location.

And because it's a real vector space, you can do arithmetic on meaning. The famous example: take the vector for king, subtract man, add woman — and you land almost exactly on queen. The "royalty" and "gender" relationships are directions in the space. Try it:

The embedding space — explore it

A 2-D schematic. Words that mean similar things cluster; relationships are parallel arrows.

Real embeddings live in hundreds of dimensions; this is flattened to two so we can see it.

Where the numbers come from

Every model holds an embedding table — a big matrix with one row per vocabulary token. If the vocabulary has $|V|$ tokens and the model width is $d$, that table is |V| × d numbers (for GPT-2: ~50,000 rows of 768). Tokenizing gives you IDs; the embedding step is just a lookup — row 3404 of the table is the vector for that token.

Those numbers aren't hand-set; they're learned. During training the model nudges the vectors so that tokens appearing in similar contexts drift together — the old word2vec insight that "you shall know a word by the company it keeps." Modern LLMs learn their embeddings jointly with everything else, and they're contextual: the vector for "bank" shifts depending on whether the sentence is about rivers or money.

The same trick works on anything you can show a model — sentences, images, audio, code. Turn it into a vector, and "similar" becomes "close." That single idea is the engine under semantic search, recommendations, and retrieval.

An embedding is a vector $\mathbf{v} \in \mathbb{R}^d$. To ask "how similar are two tokens," we don't use straight-line distance — we use the angle between their vectors, via cosine similarity:

$$ \cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\lVert \mathbf{u} \rVert \, \lVert \mathbf{v} \rVert} = \frac{\sum_{i=1}^{d} u_i v_i}{\sqrt{\sum_i u_i^2}\,\sqrt{\sum_i v_i^2}} $$

It runs from $-1$ (opposite) through $0$ (unrelated) to $1$ (identical direction). Angle, not magnitude, because what matters is which way a vector points in meaning-space, not how long it is.

Analogies fall out of the same geometry. If a relationship — "royal," say — is a consistent direction $\mathbf{r}$, then $\text{king} \approx \text{man} + \mathbf{r}$ and $\text{queen} \approx \text{woman} + \mathbf{r}$. Subtract to isolate the direction and re-add it elsewhere:

$$ \text{king} - \text{man} + \text{woman} \;\approx\; \text{queen} $$

The model never "knew" that queens are royal women. The geometry encodes it, because that's how the words were used.

import numpy as np # toy 4-d embeddings (illustrative; real ones are hundreds of dims) emb = { "king": np.array([0.9, 0.8, 0.1, 0.7]), "queen": np.array([0.9, 0.1, 0.8, 0.7]), "man": np.array([0.2, 0.8, 0.1, 0.1]), "woman": np.array([0.2, 0.1, 0.8, 0.1]), } def cosine(a, b): return a @ b / (np.linalg.norm(a) * np.linalg.norm(b)) # 1) similar words point the same way print(round(cosine(emb["king"], emb["queen"]), 3)) # 0.749 # 2) analogy: king - man + woman ~ queen v = emb["king"] - emb["man"] + emb["woman"] best = max(emb, key=lambda w: cosine(v, emb[w])) print(best, round(cosine(v, emb["queen"]), 3)) # queen 1.0

Why "meaning as distance" became an industry

Geometry → money

The moment meaning is a location, finding relevant information becomes finding nearby points. That single move powers semantic search, recommendations, and retrieval-augmented generation (RAG) — the dominant way companies put their own data into an LLM. An entire category of infrastructure, the vector database, exists just to store billions of embeddings and find the nearest ones in milliseconds.

It isn't free. Every document, product, and message gets embedded (compute), and the vectors are held in fast memory to be searched at scale (memory). The embedding table itself is |V| × d parameters — the same $|V|\cdot d$ that the token chapter traded against sequence length. At the scale of the modern web, "turn everything into a vector and keep it searchable" is its own slice of the data-center demand.

So embeddings are the second atom — after the token — of the AI economy: the layer that turns meaning into something you can store, search, and bill for. See where it lands in the Circuit.

What is an embedding?

Meaning, turned into geometry

The embedding space — explore it

Where the numbers come from

Distance, similarity, and analogy

Similarity and analogy in a few lines

Why "meaning as distance" became an industry

The primary sources