Divergent Compute.AI Economic Think Tank

First Principles / Part I · Foundations / Chapter 02

First Principles · Foundations · 02

What is an embedding?

An embedding turns a token into a point in space — a list of numbers — arranged so that meaning becomes distance. Similar things sit close together, and relationships become directions you can do arithmetic with.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

Meaning, turned into geometry

In the last chapter a token became an integer ID. But an ID like 3404 carries no meaning — 3405 isn't "one more" of anything. So the very first thing a model does is replace each ID with an embedding: a vector of a few hundred to a few thousand numbers.

The magic is in how those numbers are arranged. The model learns them so that tokens used in similar ways end up near each other in the space. "Dog" and "cat" land close; "dog" and "Tuesday" land far apart. Meaning is no longer a symbol — it's a location.

And because it's a real vector space, you can do arithmetic on meaning. The famous example: take the vector for king, subtract man, add woman — and you land almost exactly on queen. The "royalty" and "gender" relationships are directions in the space. Try it:

The embedding space — explore it

A 2-D schematic. Words that mean similar things cluster; relationships are parallel arrows.

Real embeddings live in hundreds of dimensions; this is flattened to two so we can see it.

02Mechanics

Where the numbers come from

Every model holds an embedding table — a big matrix with one row per vocabulary token. If the vocabulary has $|V|$ tokens and the model width is $d$, that table is |V| × d numbers (for GPT-2: ~50,000 rows of 768). Tokenizing gives you IDs; the embedding step is just a lookup — row 3404 of the table is the vector for that token.

Those numbers aren't hand-set; they're learned. During training the model nudges the vectors so that tokens appearing in similar contexts drift together — the old word2vec insight that "you shall know a word by the company it keeps." Modern LLMs learn their embeddings jointly with everything else, and they're contextual: the vector for "bank" shifts depending on whether the sentence is about rivers or money.

The same trick works on anything you can show a model — sentences, images, audio, code. Turn it into a vector, and "similar" becomes "close." That single idea is the engine under semantic search, recommendations, and retrieval.

03The math

expand ▾

Distance, similarity, and analogy

An embedding is a vector $\mathbf{v} \in \mathbb{R}^d$. To ask "how similar are two tokens," we don't use straight-line distance — we use the angle between their vectors, via cosine similarity:

$$ \cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\lVert \mathbf{u} \rVert \, \lVert \mathbf{v} \rVert} = \frac{\sum_{i=1}^{d} u_i v_i}{\sqrt{\sum_i u_i^2}\,\sqrt{\sum_i v_i^2}} $$

It runs from $-1$ (opposite) through $0$ (unrelated) to $1$ (identical direction). Angle, not magnitude, because what matters is which way a vector points in meaning-space, not how long it is.

Analogies fall out of the same geometry. If a relationship — "royal," say — is a consistent direction $\mathbf{r}$, then $\text{king} \approx \text{man} + \mathbf{r}$ and $\text{queen} \approx \text{woman} + \mathbf{r}$. Subtract to isolate the direction and re-add it elsewhere:

$$ \text{king} - \text{man} + \text{woman} \;\approx\; \text{queen} $$

The model never "knew" that queens are royal women. The geometry encodes it, because that's how the words were used.

04The code

expand ▾

Similarity and analogy in a few lines

Cosine similarity and the king/queen analogy, on toy vectors — runnable as-is.

embeddings.py

import numpy as np

# toy 4-d embeddings (illustrative; real ones are hundreds of dims)
emb = {
    "king":  np.array([0.9, 0.8, 0.1, 0.7]),
    "queen": np.array([0.9, 0.1, 0.8, 0.7]),
    "man":   np.array([0.2, 0.8, 0.1, 0.1]),
    "woman": np.array([0.2, 0.1, 0.8, 0.1]),
}

def cosine(a, b):
    return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

# 1) similar words point the same way
print(round(cosine(emb["king"], emb["queen"]), 3))   # 0.749

# 2) analogy: king - man + woman  ~  queen
v = emb["king"] - emb["man"] + emb["woman"]
best = max(emb, key=lambda w: cosine(v, emb[w]))
print(best, round(cosine(v, emb["queen"]), 3))        # queen 1.0

The interactive space above is the same idea, with the vectors arranged by hand so you can watch the parallelogram close.

05The economics

Why "meaning as distance" became an industry

Geometry → money

The moment meaning is a location, finding relevant information becomes finding nearby points. That single move powers semantic search, recommendations, and retrieval-augmented generation (RAG) — the dominant way companies put their own data into an LLM. An entire category of infrastructure, the vector database, exists just to store billions of embeddings and find the nearest ones in milliseconds.

It isn't free. Every document, product, and message gets embedded (compute), and the vectors are held in fast memory to be searched at scale (memory). The embedding table itself is |V| × d parameters — the same $|V|\cdot d$ that the token chapter traded against sequence length. At the scale of the modern web, "turn everything into a vector and keep it searchable" is its own slice of the data-center demand.

So embeddings are the second atom — after the token — of the AI economy: the layer that turns meaning into something you can store, search, and bill for. See where it lands in the Circuit.

06Going deeper

expand ▾

The primary sources

Mikolov et al. (2013) — Efficient Estimation of Word Representations (word2vec) · the paper that made king − man + woman famous.
Pennington, Socher & Manning (2014) — GloVe · global word vectors from co-occurrence.
Devlin et al. (2018) — BERT · contextual embeddings: the same word, different vector by context.
Reimers & Gurevych (2019) — Sentence-BERT · embedding whole sentences for search.

Cite this chapter: Divergent Compute, "What is an embedding?", First Principles, 2026. divergentcompute.com/first-principles-embedding · v1.0 · CC-BY.

← Chapter 01
What is a token?
Next · Chapter 03 →
What is a neural network?