Divergent Compute.AI Economic Think Tank

First Principles / Part I · Foundations / Chapter 01

First Principles · Foundations · 01

What is a token?

A token is the basic unit of text an AI model reads and writes — not a word, not a letter, but a chunk. Models don't see language; they see tokens.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources   — the deep layers expand on click.

Tokenization isn't magic.

4 words become 6 tokens. The model reads the chunks on the bottom, never the sentence on top.

01The answer, then the intuition

The model never sees your sentence

Before a language model reads a single thing, your text is chopped into tokens by a piece of software called a tokenizer. A token can be a whole word (cat), a fragment of one ( token + ization), a single character, a space, or a punctuation mark. Common words usually become one token; rare words get split into pieces.

The model then works entirely in tokens — it predicts the next token, over and over, and the tokens are stitched back into text at the end. Three things you'll meet everywhere downstream are all measured in tokens: the context window (how much it can read at once), the price (APIs bill per token), and the speed (tokens per second).

A useful rule of thumb for English: 1 token ≈ 4 characters ≈ 0.75 words. So a 1,000-word essay is roughly 1,300 tokens.

Try it — the real GPT tokenizer, live

tokens
characters
chars / token
cost at $2.50/1M

Loading the real GPT-4 tokenizer (cl100k_base)…

02Mechanics

How the chunks get made: Byte-Pair Encoding

Almost every modern model tokenizes with a variant of Byte-Pair Encoding (BPE). The idea is borrowed from a 1994 data-compression trick and is delightfully simple. You build the vocabulary by merging:

  • Start with the smallest possible units — individual bytes (so any character in any language is representable).
  • Count every adjacent pair of units across a huge text corpus.
  • Merge the single most frequent pair into one new unit, and add it to the vocabulary.
  • Repeat thousands of times until the vocabulary hits a target size — GPT-2 stopped at ~50,000, GPT-4's cl100k at ~100,000, the newest at ~200,000.

The result is a learned list of merges. To tokenize new text, the tokenizer greedily applies those merges: frequent strings like the survive as one token, while a rare word like antidisestablishmentarianism gets rebuilt from several pieces. Each final token maps to an integer ID, and that ID indexes a row in the model's embedding table — which is the next chapter. Text → tokens → IDs → vectors.

Two consequences worth holding onto: spaces are usually glued to the front of the next word ( token, with a leading space, is a different token from token), and the same word can tokenize differently depending on what's around it. That is why token counts feel slightly unpredictable — and why you should measure, not guess.

03The math

expand ▾

Vocabulary, merges, and the cost it sets

Let the vocabulary be a finite set of size $|V|$. A tokenizer is a function that maps a string $s$ to a sequence of token IDs $\;t = (t_1, \dots, t_n),\; t_i \in \{0, \dots, |V|-1\}$, where $n$ is the token length of $s$.

BPE builds $V$ greedily. Beginning from the base alphabet $V_0$ (the 256 bytes), it repeatedly chooses the adjacent pair with the highest corpus frequency and adds its concatenation to the vocabulary:

$$ (a^*, b^*) = \arg\max_{(a,b)} \; \text{freq}(a,b), \qquad V_{k+1} = V_k \cup \{\, a^*b^* \,\} $$

stopping when $|V_k|$ reaches the target. There is no clean global optimum — BPE is a greedy compressor — but it reliably shortens sequences for a given vocabulary budget.

That budget is the whole game, because tokenization sets two of the model's biggest costs:

  • Sequence length drives compute. Self-attention compares every token with every other token, so its cost per layer grows as $O(n^2 \cdot d)$ in the number of tokens $n$ (with model width $d$). Halving your token count roughly quarters the attention work.
  • Vocabulary size drives parameters. The embedding and output layers each hold $|V| \cdot d$ parameters. A bigger vocabulary makes sequences shorter (good) but those two matrices larger (costly).

So a tokenizer is a dial between $n$ and $|V|$ — between sequence length and parameter count — and every model picks a point on it. Hold the $O(n^2)$; it returns in layer 06.

04The code

expand ▾

BPE in a few lines — and the real thing

A minimal BPE trainer. It learns merges from a toy corpus exactly as described above — runnable as-is in Python.

train_bpe.py

from collections import Counter

def get_pairs(tokens):
    return Counter(zip(tokens, tokens[1:]))

def train_bpe(words, num_merges):
    # words: list of strings; start from characters
    corpus = [list(w) for w in words]
    merges = []
    for _ in range(num_merges):
        pairs = Counter()
        for tok in corpus:
            pairs.update(get_pairs(tok))
        if not pairs:
            break
        (a, b), _ = pairs.most_common(1)[0]   # most frequent adjacent pair
        merges.append((a, b))
        # merge that pair everywhere
        corpus = [merge(tok, a, b) for tok in corpus]
    return merges

def merge(tok, a, b):
    out, i = [], 0
    while i < len(tok):
        if i < len(tok) - 1 and tok[i] == a and tok[i+1] == b:
            out.append(a + b); i += 2
        else:
            out.append(tok[i]); i += 1
    return out

words = ["token"] * 6 + ["tokenizer"] * 3 + ["tokenization"] * 2
print(train_bpe(words, num_merges=5))
# -> [('t','o'), ('to','k'), ('tok','e'), ('toke','n'), ('token','i')]  (pieces merge up into 'token')

In production you don't train your own — you call the real one. OpenAI's tiktoken gives the exact tokenizer the models use:

count_tokens.py

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")     # GPT-4 / GPT-3.5
ids = enc.encode("Tokenization isn't magic.")
print(ids)                 # [3404, 2065, 4536, 956, 11204, 13]
print(len(ids))            # 6 tokens
print([enc.decode([i]) for i in ids])
# ['Token', 'ization', ' isn', "'t", ' magic', '.']

The live widget at the top runs exactly this tokenizer in your browser — type into it and watch the chunks change.

05The economics — why this is a Divergent Compute chapter

The token is the atom of the AI economy

Mechanics → money

Everything you just read has a price tag. The token isn't only the unit of text — it's the unit of cost. Models are billed per token; a frontier model runs roughly $2–$5 per million input tokens. One million tokens is roughly 750,000 words — several thousand pages. Multiply by hundreds of millions of users sending thousands of tokens each, and you arrive at the spend that built the data centers.

And remember the $O(n^2)$ from the math layer. Because attention compares every token to every other, a longer context costs compute that grows with the square of the token count — and the model must hold a KV cache whose memory grows linearly with every token in the conversation. That cache is a primary reason AI is starved for high-bandwidth memory. The token, in other words, is what runs straight into the memory wall.

So the chain is short and direct: more tokens → more compute and more memory → more HBM and more data centers → the build-out. The thing on this page is the atom that the whole Circuit is made of — see how it flows through the memory supply chain and the economics in pictures.

06Going deeper

expand ▾

The primary sources

Sennrich, Haddow & Birch (2016) — Neural Machine Translation of Rare Words with Subword Units · the paper that brought BPE to NLP.
Gage (1994) — A New Algorithm for Data Compression · the original byte-pair-encoding idea.
Radford et al. (2019) — GPT-2 · introduced byte-level BPE for language models.
OpenAI — tiktoken · the production tokenizer (cl100k_base, o200k_base).
Kudo & Richardson (2018) — SentencePiece · the language-agnostic alternative used by many open models.

Cite this chapter: Divergent Compute, "What is a token?", First Principles, 2026. divergentcompute.com/first-principles-token · v1.0 · CC-BY.

← Part I · Foundations
The curriculum
Next · Chapter 02 →
What is an embedding?