First Principles / Part II · Models / Chapter 10
First Principles · Models · 10
A pretrained base model knows language but has no manners — it just continues text. Fine-tuning and RLHF are the two phases that turn that raw predictor into a model that follows instructions, stays helpful, and refuses harm.
01The answer, then the intuition
Out of pretraining you get a base model: brilliant at predicting text, useless as an assistant. Ask it "How do I reset my password?" and it might continue with a list of more questions — because that's what such text looks like on the web. It has the knowledge; it lacks the job description.
Two phases fix that. Supervised fine-tuning (SFT) shows it thousands of high-quality instruction→response examples, teaching it the shape of being helpful. Then RLHF — reinforcement learning from human feedback — has people rank competing answers, distills those preferences into a reward signal, and nudges the model toward replies humans actually prefer: clearer, more honest, and safe.
Pick a prompt and walk it through the three stages. The knowledge never changes — only the behaviour:
Illustrative responses showing what each alignment phase adds.
02Mechanics
instruction → ideal response pairs written by humans. The model learns the assistant format: when it sees a question, produce an answer, not more questions. Cheap relative to pretraining, but the data is hand-written and expensive per token.One subtlety worth holding onto: RLHF optimizes for what raters prefer, which is a proxy for what's actually good. Push too hard and you get models that are confidently agreeable rather than correct — the central open problem the field calls alignment.
04The math
expand ▾The reward model turns "humans prefer A over B" into numbers via the Bradley–Terry model. If the reward model scores responses $r(A)$ and $r(B)$, the modeled probability that $A$ is preferred is:
It's trained to maximize that probability on the human-labeled pairs — i.e. minimize $-\log \sigma(r(A)-r(B))$ whenever a human chose $A$. The policy $\pi$ is then optimized to earn reward while staying near the reference (SFT) model $\pi_{\text{ref}}$:
The first term pulls toward what humans like; the $\beta\,\mathrm{KL}$ term is the leash that stops it from degenerating into reward-hacking gibberish. Tuning $\beta$ is the whole art — too loose and the model games the reward, too tight and RLHF does nothing.
05The code
expand ▾How a single human comparison becomes a gradient on the reward model. Runnable.
reward.py
import numpy as np
def sigmoid(x): return 1.0 / (1.0 + np.exp(-x))
# reward model's current scores for two candidate responses
r_chosen, r_rejected = 2.3, 0.8 # a human preferred the first one
p_agree = sigmoid(r_chosen - r_rejected) # model's prob the human's pick is better
loss = -np.log(p_agree) # Bradley-Terry preference loss
print(round(p_agree, 3)) # 0.818 — model mostly agrees with the human
print(round(loss, 3)) # 0.201 — small loss; a confident wrong pick costs far more
# training nudges r_chosen up and r_rejected down until the model ranks like people do
06The economics
Alignment → money
Fine-tuning is a rounding error next to pretraining in compute — but it is where a model becomes sellable. Nobody subscribes to a base model; people pay for the aligned assistant. The hundred-million-dollar pretraining asset only earns revenue after this comparatively cheap polishing step gives it manners.
The cost here is a different shape: not GPUs but people. SFT demonstrations and preference labels are written and ranked by human annotators, and high-quality human feedback at scale is a real, recurring line item — increasingly the scarce input as raw compute becomes plentiful. "Data" stops meaning scraped web text and starts meaning curated human judgment.
So alignment is the conversion step in the Circuit: the move that turns a capitalized training asset into a product with paying users. It's also where the quality questions live — whether the helpful, confident answers people are paying for are actually right is the thing our research desk keeps testing.
07Going deeper
expand ▾
Ouyang et al. (2022) — InstructGPT · the SFT + RLHF recipe behind ChatGPT.
Bai et al. (2022) — Constitutional AI · using AI feedback to reduce the human-labeling burden.
Rafailov et al. (2023) — Direct Preference Optimization (DPO) · skipping the separate reward model.
Christiano et al. (2017) — Deep RL from Human Preferences · the original idea, before LLMs.
Cite this chapter: Divergent Compute, "Fine-tuning & RLHF", First Principles, 2026. divergentcompute.com/first-principles-finetuning · v1.0 · CC-BY.