Divergent Compute.AI Economic Think Tank

First Principles / Part II · Models / Chapter 08

First Principles · Models · 08

What is an LLM?

A large language model is a transformer trained to do one thing: predict the next token. Run that prediction over and over, feeding each guess back in, and the next-token machine becomes a writer.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

It is autocomplete — taken extremely seriously

Everything in Part I builds to this. An LLM takes your text as tokens, runs it through the transformer stack, and produces a probability for every possible next token. It picks one, sticks it on the end, and runs again on the longer text. Repeat until done. That loop — predict, append, repeat — is all "generation" is.

The surprise of the last few years is that doing this one humble task well enough, at enough scale, produces translation, code, reasoning, and conversation as side effects. To answer a question, the most likely continuation of "the answer is" turns out to be the answer.

Below is that prediction step. The model gives a distribution over next tokens; a single knob, the temperature, controls how boldly it picks. Slide it:

Next-token prediction — slide the temperature

Illustrative probabilities for one prediction step. Temperature reshapes the same distribution.

1.0

At low temperature the model is confident and repetitive; at high temperature it gets creative — and eventually incoherent.

02Mechanics

Predict, sample, repeat

The mechanics are a short loop:

  • The objective. During training the model is shown oceans of text and asked, at every position, to predict the next token. Its only goal is to make the real next token likely. No labels, no human in the loop — just "guess what comes next," billions of times.
  • The output. At generation time the final vector becomes a logit for every token in the vocabulary; a softmax turns those into probabilities.
  • Sampling. You then pick a token. Greedy takes the most likely. Temperature scales the logits to make the choice sharper or flatter. Top-p samples only from the smallest set of tokens that covers, say, 90% of the probability. These knobs are the difference between a dull, deterministic model and a lively one.
  • Autoregression. Append the chosen token and run the whole thing again. Each new token is conditioned on everything so far — which is exactly why the context window matters.

So an "LLM" is not a database of answers. It is a single learned function for "what token is likely next," wrapped in a loop. The next chapters open up where that function's knowledge comes from — pretraining — and how it's shaped to be helpful.

04The math

expand ▾

A distribution over the vocabulary

Given the tokens so far, the model outputs a logit $z_i$ for each token $i$ in the vocabulary. With temperature $T$, the probability of the next token is:

$$ P(\text{token} = i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)} $$

As $T \to 0$ this collapses onto the single highest-logit token (greedy, deterministic); at $T = 1$ it is the model's raw distribution; as $T$ grows the probabilities flatten toward uniform — more random, eventually nonsense. Generating a whole sequence just multiplies these step by step:

$$ P(t_1, t_2, \dots, t_m) = \prod_{k=1}^{m} P\!\left(t_k \mid t_1, \dots, t_{k-1}\right) $$

Training maximizes exactly this probability on real text — equivalently, it minimizes the average cross-entropy (the negative log of the right token's probability). "Predict the next token" is the entire objective.

05The code

expand ▾

Sampling, and the generation loop

Temperature sampling, then the autoregressive loop. Runnable — it's the whole idea.

generate.py

import numpy as np

def softmax(z, T=1.0):
    z = np.array(z, dtype=float) / T
    e = np.exp(z - z.max()); return e / e.sum()

# illustrative logits over a tiny vocabulary, after "The sky is"
vocab  = ["blue", "clear", "grey", "dark", "falling", "the"]
logits = [4.2,    2.8,     2.3,    1.9,    0.8,       1.2]

for T in [0.5, 1.0, 1.7]:
    p = softmax(logits, T)
    print(f"T={T}:", ", ".join(f"{w} {pi:.2f}" for w, pi in zip(vocab, p)))
# T=0.5: blue 0.91, clear 0.06, grey 0.02, dark 0.01, falling 0.00, the 0.00
# T=1.0: blue 0.63, clear 0.16, grey 0.09, dark 0.06, falling 0.02, the 0.03
# T=1.7: blue 0.43, clear 0.19, grey 0.14, dark 0.11, falling 0.06, the 0.07

def generate(next_logits, prompt, n, T=1.0):     # next_logits: context -> logit vector
    toks = list(prompt)
    for _ in range(n):
        p = softmax(next_logits(toks), T)
        toks.append(int(np.random.choice(len(p), p=p)))   # sample the next token
    return toks

06The economics

Sold by the token, one at a time

Prediction → money

Because generation is a loop, an LLM cannot produce its answer all at once — it must run a full forward pass for every single token it writes, each one conditioned on all the tokens before it. That sequential dependency is why responses stream in word by word, and why latency and cost scale with output length. Every token is one trip through ~$2N$ parameters.

This is the unit the whole industry is priced in. Providers bill per input and output token; the output tokens cost more precisely because each one is a fresh, un-parallelizable forward pass. Multiply by hundreds of millions of users generating billions of tokens a day, and "predict the next token" becomes the single largest recurring workload in computing — the inference demand that, even more than training, is what the data centers are being built to serve.

So the humble loop you just sped through is the business. The token is the product; the model is the factory; and the Circuit is the account of whether that factory's output ever justifies its cost.

07Going deeper

expand ▾

The primary sources

Radford et al. (2019) — GPT-2 · language modeling as a path to general capability.
Brown et al. (2020) — GPT-3 · scale turns next-token prediction into few-shot learning.
Holtzman et al. (2019) — The Curious Case of Neural Text Degeneration · why top-p / nucleus sampling beats greedy.
Jurafsky & Martin — Speech and Language Processing · the language-modeling objective, from the ground up.

Cite this chapter: Divergent Compute, "What is an LLM?", First Principles, 2026. divergentcompute.com/first-principles-llm · v1.0 · CC-BY.

← Chapter 07
The context window
Next · Chapter 09 →
How models are pretrained