First Principles / Part I · Foundations / Chapter 01
First Principles · Foundations · 01
A token is the basic unit of text an AI model reads and writes — not a word, not a letter, but a chunk. Models don't see language; they see tokens.
Tokenization isn't magic.
4 words become 6 tokens. The model reads the chunks on the bottom, never the sentence on top.
01The answer, then the intuition
Before a language model reads a single thing, your text is chopped into tokens by a piece of software called a tokenizer. A token can be a whole word (cat), a fragment of one ( token + ization), a single character, a space, or a punctuation mark. Common words usually become one token; rare words get split into pieces.
The model then works entirely in tokens — it predicts the next token, over and over, and the tokens are stitched back into text at the end. Three things you'll meet everywhere downstream are all measured in tokens: the context window (how much it can read at once), the price (APIs bill per token), and the speed (tokens per second).
A useful rule of thumb for English: 1 token ≈ 4 characters ≈ 0.75 words. So a 1,000-word essay is roughly 1,300 tokens.
Loading the real GPT-4 tokenizer (cl100k_base)…
02Mechanics
Almost every modern model tokenizes with a variant of Byte-Pair Encoding (BPE). The idea is borrowed from a 1994 data-compression trick and is delightfully simple. You build the vocabulary by merging:
cl100k at ~100,000, the newest at ~200,000.The result is a learned list of merges. To tokenize new text, the tokenizer greedily applies those merges: frequent strings like the survive as one token, while a rare word like antidisestablishmentarianism gets rebuilt from several pieces. Each final token maps to an integer ID, and that ID indexes a row in the model's embedding table — which is the next chapter. Text → tokens → IDs → vectors.
Two consequences worth holding onto: spaces are usually glued to the front of the next word ( token, with a leading space, is a different token from token), and the same word can tokenize differently depending on what's around it. That is why token counts feel slightly unpredictable — and why you should measure, not guess.
03The math
expand ▾Let the vocabulary be a finite set of size $|V|$. A tokenizer is a function that maps a string $s$ to a sequence of token IDs $\;t = (t_1, \dots, t_n),\; t_i \in \{0, \dots, |V|-1\}$, where $n$ is the token length of $s$.
BPE builds $V$ greedily. Beginning from the base alphabet $V_0$ (the 256 bytes), it repeatedly chooses the adjacent pair with the highest corpus frequency and adds its concatenation to the vocabulary:
stopping when $|V_k|$ reaches the target. There is no clean global optimum — BPE is a greedy compressor — but it reliably shortens sequences for a given vocabulary budget.
That budget is the whole game, because tokenization sets two of the model's biggest costs:
So a tokenizer is a dial between $n$ and $|V|$ — between sequence length and parameter count — and every model picks a point on it. Hold the $O(n^2)$; it returns in layer 06.
04The code
expand ▾A minimal BPE trainer. It learns merges from a toy corpus exactly as described above — runnable as-is in Python.
train_bpe.py
from collections import Counter
def get_pairs(tokens):
return Counter(zip(tokens, tokens[1:]))
def train_bpe(words, num_merges):
# words: list of strings; start from characters
corpus = [list(w) for w in words]
merges = []
for _ in range(num_merges):
pairs = Counter()
for tok in corpus:
pairs.update(get_pairs(tok))
if not pairs:
break
(a, b), _ = pairs.most_common(1)[0] # most frequent adjacent pair
merges.append((a, b))
# merge that pair everywhere
corpus = [merge(tok, a, b) for tok in corpus]
return merges
def merge(tok, a, b):
out, i = [], 0
while i < len(tok):
if i < len(tok) - 1 and tok[i] == a and tok[i+1] == b:
out.append(a + b); i += 2
else:
out.append(tok[i]); i += 1
return out
words = ["token"] * 6 + ["tokenizer"] * 3 + ["tokenization"] * 2
print(train_bpe(words, num_merges=5))
# -> [('t','o'), ('to','k'), ('tok','e'), ('toke','n'), ('token','i')] (pieces merge up into 'token')
In production you don't train your own — you call the real one. OpenAI's tiktoken gives the exact tokenizer the models use:
count_tokens.py
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 / GPT-3.5
ids = enc.encode("Tokenization isn't magic.")
print(ids) # [3404, 2065, 4536, 956, 11204, 13]
print(len(ids)) # 6 tokens
print([enc.decode([i]) for i in ids])
# ['Token', 'ization', ' isn', "'t", ' magic', '.']
The live widget at the top runs exactly this tokenizer in your browser — type into it and watch the chunks change.
05The economics — why this is a Divergent Compute chapter
Mechanics → money
Everything you just read has a price tag. The token isn't only the unit of text — it's the unit of cost. Models are billed per token; a frontier model runs roughly $2–$5 per million input tokens. One million tokens is roughly 750,000 words — several thousand pages. Multiply by hundreds of millions of users sending thousands of tokens each, and you arrive at the spend that built the data centers.
And remember the $O(n^2)$ from the math layer. Because attention compares every token to every other, a longer context costs compute that grows with the square of the token count — and the model must hold a KV cache whose memory grows linearly with every token in the conversation. That cache is a primary reason AI is starved for high-bandwidth memory. The token, in other words, is what runs straight into the memory wall.
So the chain is short and direct: more tokens → more compute and more memory → more HBM and more data centers → the build-out. The thing on this page is the atom that the whole Circuit is made of — see how it flows through the memory supply chain and the economics in pictures.
06Going deeper
expand ▾
Sennrich, Haddow & Birch (2016) — Neural Machine Translation of Rare Words with Subword Units · the paper that brought BPE to NLP.
Gage (1994) — A New Algorithm for Data Compression · the original byte-pair-encoding idea.
Radford et al. (2019) — GPT-2 · introduced byte-level BPE for language models.
OpenAI — tiktoken · the production tokenizer (cl100k_base, o200k_base).
Kudo & Richardson (2018) — SentencePiece · the language-agnostic alternative used by many open models.
Cite this chapter: Divergent Compute, "What is a token?", First Principles, 2026. divergentcompute.com/first-principles-token · v1.0 · CC-BY.