Divergent Compute.AI Economic Think Tank

First Principles / Part II · Models / Chapter 10

First Principles · Models · 10

Fine-tuning & RLHF

A pretrained base model knows language but has no manners — it just continues text. Fine-tuning and RLHF are the two phases that turn that raw predictor into a model that follows instructions, stays helpful, and refuses harm.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

Teaching a text-predictor to behave like an assistant

Out of pretraining you get a base model: brilliant at predicting text, useless as an assistant. Ask it "How do I reset my password?" and it might continue with a list of more questions — because that's what such text looks like on the web. It has the knowledge; it lacks the job description.

Two phases fix that. Supervised fine-tuning (SFT) shows it thousands of high-quality instruction→response examples, teaching it the shape of being helpful. Then RLHF — reinforcement learning from human feedback — has people rank competing answers, distills those preferences into a reward signal, and nudges the model toward replies humans actually prefer: clearer, more honest, and safe.

Pick a prompt and walk it through the three stages. The knowledge never changes — only the behaviour:

Base → SFT → RLHF — same prompt, evolving behaviour

Illustrative responses showing what each alignment phase adds.

Base model
+ SFT
+ RLHF

02Mechanics

Two phases, two kinds of signal

  • SFT (supervised fine-tuning). Continue training the base model — same next-token objective — but now on a curated set of instruction → ideal response pairs written by humans. The model learns the assistant format: when it sees a question, produce an answer, not more questions. Cheap relative to pretraining, but the data is hand-written and expensive per token.
  • The reward model. Humans are shown two responses to the same prompt and pick the better one. A separate model is trained on thousands of these comparisons to output a scalar reward — a learned proxy for "what humans prefer." Ranking is far easier and more reliable for humans than writing perfect answers.
  • RLHF (policy optimization). The model is then optimized — typically with PPO, or more recently the simpler DPO — to produce responses the reward model scores highly, while a KL penalty keeps it from drifting too far from the sensible SFT model and "gaming" the reward.
  • The result. Helpfulness, honesty, and harmlessness — the behaviour, not the knowledge. This is "alignment," and it's the difference between a research artifact and a product millions will pay for.

One subtlety worth holding onto: RLHF optimizes for what raters prefer, which is a proxy for what's actually good. Push too hard and you get models that are confidently agreeable rather than correct — the central open problem the field calls alignment.

04The math

expand ▾

Preferences become a reward

The reward model turns "humans prefer A over B" into numbers via the Bradley–Terry model. If the reward model scores responses $r(A)$ and $r(B)$, the modeled probability that $A$ is preferred is:

$$ P(A \succ B) = \sigma\!\big(r(A) - r(B)\big) = \frac{1}{1 + e^{-(r(A)-r(B))}} $$

It's trained to maximize that probability on the human-labeled pairs — i.e. minimize $-\log \sigma(r(A)-r(B))$ whenever a human chose $A$. The policy $\pi$ is then optimized to earn reward while staying near the reference (SFT) model $\pi_{\text{ref}}$:

$$ \max_{\pi}\; \mathbb{E}_{y\sim\pi}\big[r(x,y)\big] \;-\; \beta\, \mathrm{KL}\!\big(\pi \,\|\, \pi_{\text{ref}}\big) $$

The first term pulls toward what humans like; the $\beta\,\mathrm{KL}$ term is the leash that stops it from degenerating into reward-hacking gibberish. Tuning $\beta$ is the whole art — too loose and the model games the reward, too tight and RLHF does nothing.

05The code

expand ▾

The preference loss, in nine lines

How a single human comparison becomes a gradient on the reward model. Runnable.

reward.py

import numpy as np

def sigmoid(x): return 1.0 / (1.0 + np.exp(-x))

# reward model's current scores for two candidate responses
r_chosen, r_rejected = 2.3, 0.8        # a human preferred the first one

p_agree = sigmoid(r_chosen - r_rejected)   # model's prob the human's pick is better
loss    = -np.log(p_agree)                 # Bradley-Terry preference loss

print(round(p_agree, 3))   # 0.818  — model mostly agrees with the human
print(round(loss, 3))      # 0.201  — small loss; a confident wrong pick costs far more

# training nudges r_chosen up and r_rejected down until the model ranks like people do

06The economics

The cheap phase that creates all the value

Alignment → money

Fine-tuning is a rounding error next to pretraining in compute — but it is where a model becomes sellable. Nobody subscribes to a base model; people pay for the aligned assistant. The hundred-million-dollar pretraining asset only earns revenue after this comparatively cheap polishing step gives it manners.

The cost here is a different shape: not GPUs but people. SFT demonstrations and preference labels are written and ranked by human annotators, and high-quality human feedback at scale is a real, recurring line item — increasingly the scarce input as raw compute becomes plentiful. "Data" stops meaning scraped web text and starts meaning curated human judgment.

So alignment is the conversion step in the Circuit: the move that turns a capitalized training asset into a product with paying users. It's also where the quality questions live — whether the helpful, confident answers people are paying for are actually right is the thing our research desk keeps testing.

07Going deeper

expand ▾

The primary sources

Ouyang et al. (2022) — InstructGPT · the SFT + RLHF recipe behind ChatGPT.
Bai et al. (2022) — Constitutional AI · using AI feedback to reduce the human-labeling burden.
Rafailov et al. (2023) — Direct Preference Optimization (DPO) · skipping the separate reward model.
Christiano et al. (2017) — Deep RL from Human Preferences · the original idea, before LLMs.

Cite this chapter: Divergent Compute, "Fine-tuning & RLHF", First Principles, 2026. divergentcompute.com/first-principles-finetuning · v1.0 · CC-BY.

← Chapter 09
How models are pretrained
Next · Chapter 11 →
Differences between LLMs