First Principles / Part II · Models / Chapter 08
First Principles · Models · 08
A large language model is a transformer trained to do one thing: predict the next token. Run that prediction over and over, feeding each guess back in, and the next-token machine becomes a writer.
01The answer, then the intuition
Everything in Part I builds to this. An LLM takes your text as tokens, runs it through the transformer stack, and produces a probability for every possible next token. It picks one, sticks it on the end, and runs again on the longer text. Repeat until done. That loop — predict, append, repeat — is all "generation" is.
The surprise of the last few years is that doing this one humble task well enough, at enough scale, produces translation, code, reasoning, and conversation as side effects. To answer a question, the most likely continuation of "the answer is" turns out to be the answer.
Below is that prediction step. The model gives a distribution over next tokens; a single knob, the temperature, controls how boldly it picks. Slide it:
Illustrative probabilities for one prediction step. Temperature reshapes the same distribution.
At low temperature the model is confident and repetitive; at high temperature it gets creative — and eventually incoherent.
02Mechanics
The mechanics are a short loop:
softmax turns those into probabilities.So an "LLM" is not a database of answers. It is a single learned function for "what token is likely next," wrapped in a loop. The next chapters open up where that function's knowledge comes from — pretraining — and how it's shaped to be helpful.
04The math
expand ▾Given the tokens so far, the model outputs a logit $z_i$ for each token $i$ in the vocabulary. With temperature $T$, the probability of the next token is:
As $T \to 0$ this collapses onto the single highest-logit token (greedy, deterministic); at $T = 1$ it is the model's raw distribution; as $T$ grows the probabilities flatten toward uniform — more random, eventually nonsense. Generating a whole sequence just multiplies these step by step:
Training maximizes exactly this probability on real text — equivalently, it minimizes the average cross-entropy (the negative log of the right token's probability). "Predict the next token" is the entire objective.
05The code
expand ▾Temperature sampling, then the autoregressive loop. Runnable — it's the whole idea.
generate.py
import numpy as np
def softmax(z, T=1.0):
z = np.array(z, dtype=float) / T
e = np.exp(z - z.max()); return e / e.sum()
# illustrative logits over a tiny vocabulary, after "The sky is"
vocab = ["blue", "clear", "grey", "dark", "falling", "the"]
logits = [4.2, 2.8, 2.3, 1.9, 0.8, 1.2]
for T in [0.5, 1.0, 1.7]:
p = softmax(logits, T)
print(f"T={T}:", ", ".join(f"{w} {pi:.2f}" for w, pi in zip(vocab, p)))
# T=0.5: blue 0.91, clear 0.06, grey 0.02, dark 0.01, falling 0.00, the 0.00
# T=1.0: blue 0.63, clear 0.16, grey 0.09, dark 0.06, falling 0.02, the 0.03
# T=1.7: blue 0.43, clear 0.19, grey 0.14, dark 0.11, falling 0.06, the 0.07
def generate(next_logits, prompt, n, T=1.0): # next_logits: context -> logit vector
toks = list(prompt)
for _ in range(n):
p = softmax(next_logits(toks), T)
toks.append(int(np.random.choice(len(p), p=p))) # sample the next token
return toks
06The economics
Prediction → money
Because generation is a loop, an LLM cannot produce its answer all at once — it must run a full forward pass for every single token it writes, each one conditioned on all the tokens before it. That sequential dependency is why responses stream in word by word, and why latency and cost scale with output length. Every token is one trip through ~$2N$ parameters.
This is the unit the whole industry is priced in. Providers bill per input and output token; the output tokens cost more precisely because each one is a fresh, un-parallelizable forward pass. Multiply by hundreds of millions of users generating billions of tokens a day, and "predict the next token" becomes the single largest recurring workload in computing — the inference demand that, even more than training, is what the data centers are being built to serve.
So the humble loop you just sped through is the business. The token is the product; the model is the factory; and the Circuit is the account of whether that factory's output ever justifies its cost.
07Going deeper
expand ▾
Radford et al. (2019) — GPT-2 · language modeling as a path to general capability.
Brown et al. (2020) — GPT-3 · scale turns next-token prediction into few-shot learning.
Holtzman et al. (2019) — The Curious Case of Neural Text Degeneration · why top-p / nucleus sampling beats greedy.
Jurafsky & Martin — Speech and Language Processing · the language-modeling objective, from the ground up.
Cite this chapter: Divergent Compute, "What is an LLM?", First Principles, 2026. divergentcompute.com/first-principles-llm · v1.0 · CC-BY.