What is a token?

A token is the basic unit of text an AI model reads and writes — not a word, not a letter, but a chunk. Models don't see language; they see tokens.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources — the deep layers expand on click.

Tokenization isn't magic.

↓

4 words become 6 tokens. The model reads the chunks on the bottom, never the sentence on top.

The model never sees your sentence

Before a language model reads a single thing, your text is chopped into tokens by a piece of software called a tokenizer. A token can be a whole word (cat), a fragment of one ( token + ization), a single character, a space, or a punctuation mark. Common words usually become one token; rare words get split into pieces.

The model then works entirely in tokens — it predicts the next token, over and over, and the tokens are stitched back into text at the end. Three things you'll meet everywhere downstream are all measured in tokens: the context window (how much it can read at once), the price (APIs bill per token), and the speed (tokens per second).

A useful rule of thumb for English: 1 token ≈ 4 characters ≈ 0.75 words. So a 1,000-word essay is roughly 1,300 tokens.

Try it — the real GPT tokenizer, live

—

tokens

—

characters

—

chars / token

—

cost at $2.50/1M

Loading the real GPT-4 tokenizer (cl100k_base)…

How the chunks get made: Byte-Pair Encoding

Almost every modern model tokenizes with a variant of Byte-Pair Encoding (BPE). The idea is borrowed from a 1994 data-compression trick and is delightfully simple. You build the vocabulary by merging:

Start with the smallest possible units — individual bytes (so any character in any language is representable).
Count every adjacent pair of units across a huge text corpus.
Merge the single most frequent pair into one new unit, and add it to the vocabulary.
Repeat thousands of times until the vocabulary hits a target size — GPT-2 stopped at ~50,000, GPT-4's cl100k at ~100,000, the newest at ~200,000.

The result is a learned list of merges. To tokenize new text, the tokenizer greedily applies those merges: frequent strings like the survive as one token, while a rare word like antidisestablishmentarianism gets rebuilt from several pieces. Each final token maps to an integer ID, and that ID indexes a row in the model's embedding table — which is the next chapter. Text → tokens → IDs → vectors.

Two consequences worth holding onto: spaces are usually glued to the front of the next word ( token, with a leading space, is a different token from token), and the same word can tokenize differently depending on what's around it. That is why token counts feel slightly unpredictable — and why you should measure, not guess.

Let the vocabulary be a finite set of size $|V|$. A tokenizer is a function that maps a string $s$ to a sequence of token IDs $\;t = (t_1, \dots, t_n),\; t_i \in \{0, \dots, |V|-1\}$, where $n$ is the token length of $s$.

BPE builds $V$ greedily. Beginning from the base alphabet $V_0$ (the 256 bytes), it repeatedly chooses the adjacent pair with the highest corpus frequency and adds its concatenation to the vocabulary:

$$ (a^*, b^*) = \arg\max_{(a,b)} \; \text{freq}(a,b), \qquad V_{k+1} = V_k \cup \{\, a^*b^* \,\} $$

stopping when $|V_k|$ reaches the target. There is no clean global optimum — BPE is a greedy compressor — but it reliably shortens sequences for a given vocabulary budget.

That budget is the whole game, because tokenization sets two of the model's biggest costs:

Sequence length drives compute. Self-attention compares every token with every other token, so its cost per layer grows as $O(n^2 \cdot d)$ in the number of tokens $n$ (with model width $d$). Halving your token count roughly quarters the attention work.
Vocabulary size drives parameters. The embedding and output layers each hold $|V| \cdot d$ parameters. A bigger vocabulary makes sequences shorter (good) but those two matrices larger (costly).

So a tokenizer is a dial between $n$ and $|V|$ — between sequence length and parameter count — and every model picks a point on it. Hold the $O(n^2)$; it returns in layer 06.

from collections import Counter def get_pairs(tokens): return Counter(zip(tokens, tokens[1:])) def train_bpe(words, num_merges): # words: list of strings; start from characters corpus = [list(w) for w in words] merges = [] for _ in range(num_merges): pairs = Counter() for tok in corpus: pairs.update(get_pairs(tok)) if not pairs: break (a, b), _ = pairs.most_common(1)[0] # most frequent adjacent pair merges.append((a, b)) # merge that pair everywhere corpus = [merge(tok, a, b) for tok in corpus] return merges def merge(tok, a, b): out, i = [], 0 while i < len(tok): if i < len(tok) - 1 and tok[i] == a and tok[i+1] == b: out.append(a + b); i += 2 else: out.append(tok[i]); i += 1 return out words = ["token"] * 6 + ["tokenizer"] * 3 + ["tokenization"] * 2 print(train_bpe(words, num_merges=5)) # -> [('t','o'), ('to','k'), ('tok','e'), ('toke','n'), ('token','i')] (pieces merge up into 'token')

import tiktoken enc = tiktoken.get_encoding("cl100k_base") # GPT-4 / GPT-3.5 ids = enc.encode("Tokenization isn't magic.") print(ids) # [3404, 2065, 4536, 956, 11204, 13] print(len(ids)) # 6 tokens print([enc.decode([i]) for i in ids]) # ['Token', 'ization', ' isn', "'t", ' magic', '.']

The token is the atom of the AI economy

Mechanics → money

Everything you just read has a price tag. The token isn't only the unit of text — it's the unit of cost. Models are billed per token; a frontier model runs roughly $2–$5 per million input tokens. One million tokens is roughly 750,000 words — several thousand pages. Multiply by hundreds of millions of users sending thousands of tokens each, and you arrive at the spend that built the data centers.

And remember the $O(n^2)$ from the math layer. Because attention compares every token to every other, a longer context costs compute that grows with the square of the token count — and the model must hold a KV cache whose memory grows linearly with every token in the conversation. That cache is a primary reason AI is starved for high-bandwidth memory. The token, in other words, is what runs straight into the memory wall.

So the chain is short and direct: more tokens → more compute and more memory → more HBM and more data centers → the build-out. The thing on this page is the atom that the whole Circuit is made of — see how it flows through the memory supply chain and the economics in pictures.

What is a token?

The model never sees your sentence

Try it — the real GPT tokenizer, live

How the chunks get made: Byte-Pair Encoding

Vocabulary, merges, and the cost it sets

BPE in a few lines — and the real thing

The token is the atom of the AI economy

The primary sources