Fine-tuning & RLHF

A pretrained base model knows language but has no manners — it just continues text. Fine-tuning and RLHF are the two phases that turn that raw predictor into a model that follows instructions, stays helpful, and refuses harm.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

Teaching a text-predictor to behave like an assistant

Out of pretraining you get a base model: brilliant at predicting text, useless as an assistant. Ask it "How do I reset my password?" and it might continue with a list of more questions — because that's what such text looks like on the web. It has the knowledge; it lacks the job description.

Two phases fix that. Supervised fine-tuning (SFT) shows it thousands of high-quality instruction→response examples, teaching it the shape of being helpful. Then RLHF — reinforcement learning from human feedback — has people rank competing answers, distills those preferences into a reward signal, and nudges the model toward replies humans actually prefer: clearer, more honest, and safe.

Pick a prompt and walk it through the three stages. The knowledge never changes — only the behaviour:

Base → SFT → RLHF — same prompt, evolving behaviour

Illustrative responses showing what each alignment phase adds.

Base model

+ SFT

+ RLHF

Two phases, two kinds of signal

SFT (supervised fine-tuning). Continue training the base model — same next-token objective — but now on a curated set of instruction → ideal response pairs written by humans. The model learns the assistant format: when it sees a question, produce an answer, not more questions. Cheap relative to pretraining, but the data is hand-written and expensive per token.
The reward model. Humans are shown two responses to the same prompt and pick the better one. A separate model is trained on thousands of these comparisons to output a scalar reward — a learned proxy for "what humans prefer." Ranking is far easier and more reliable for humans than writing perfect answers.
RLHF (policy optimization). The model is then optimized — typically with PPO, or more recently the simpler DPO — to produce responses the reward model scores highly, while a KL penalty keeps it from drifting too far from the sensible SFT model and "gaming" the reward.
The result. Helpfulness, honesty, and harmlessness — the behaviour, not the knowledge. This is "alignment," and it's the difference between a research artifact and a product millions will pay for.

One subtlety worth holding onto: RLHF optimizes for what raters prefer, which is a proxy for what's actually good. Push too hard and you get models that are confidently agreeable rather than correct — the central open problem the field calls alignment.

The reward model turns "humans prefer A over B" into numbers via the Bradley–Terry model. If the reward model scores responses $r(A)$ and $r(B)$, the modeled probability that $A$ is preferred is:

$$ P(A \succ B) = \sigma\!\big(r(A) - r(B)\big) = \frac{1}{1 + e^{-(r(A)-r(B))}} $$

It's trained to maximize that probability on the human-labeled pairs — i.e. minimize $-\log \sigma(r(A)-r(B))$ whenever a human chose $A$. The policy $\pi$ is then optimized to earn reward while staying near the reference (SFT) model $\pi_{\text{ref}}$:

$$ \max_{\pi}\; \mathbb{E}_{y\sim\pi}\big[r(x,y)\big] \;-\; \beta\, \mathrm{KL}\!\big(\pi \,\|\, \pi_{\text{ref}}\big) $$

The first term pulls toward what humans like; the $\beta\,\mathrm{KL}$ term is the leash that stops it from degenerating into reward-hacking gibberish. Tuning $\beta$ is the whole art — too loose and the model games the reward, too tight and RLHF does nothing.

import numpy as np def sigmoid(x): return 1.0 / (1.0 + np.exp(-x)) # reward model's current scores for two candidate responses r_chosen, r_rejected = 2.3, 0.8 # a human preferred the first one p_agree = sigmoid(r_chosen - r_rejected) # model's prob the human's pick is better loss = -np.log(p_agree) # Bradley-Terry preference loss print(round(p_agree, 3)) # 0.818 — model mostly agrees with the human print(round(loss, 3)) # 0.201 — small loss; a confident wrong pick costs far more # training nudges r_chosen up and r_rejected down until the model ranks like people do

The cheap phase that creates all the value

Alignment → money

Fine-tuning is a rounding error next to pretraining in compute — but it is where a model becomes sellable. Nobody subscribes to a base model; people pay for the aligned assistant. The hundred-million-dollar pretraining asset only earns revenue after this comparatively cheap polishing step gives it manners.

The cost here is a different shape: not GPUs but people. SFT demonstrations and preference labels are written and ranked by human annotators, and high-quality human feedback at scale is a real, recurring line item — increasingly the scarce input as raw compute becomes plentiful. "Data" stops meaning scraped web text and starts meaning curated human judgment.

So alignment is the conversion step in the Circuit: the move that turns a capitalized training asset into a product with paying users. It's also where the quality questions live — whether the helpful, confident answers people are paying for are actually right is the thing our research desk keeps testing.

Fine-tuning & RLHF

Teaching a text-predictor to behave like an assistant

Base → SFT → RLHF — same prompt, evolving behaviour

Two phases, two kinds of signal

Preferences become a reward

The preference loss, in nine lines

The cheap phase that creates all the value

The primary sources