Prompting

Prompting

A prompt is a program — written in plain English, for a machine whose only instinct is to continue text. You don't change the weights; you change the context you condition them on. Done well, that's enough to steer the model completely.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

Programming without changing the program

Because a model just predicts the next token given everything before it, the tokens you supply are the program. The frozen weights are the interpreter; your prompt is the code. This is why the same model can write poetry, extract JSON, or debug Python — you're not switching models, you're switching the context that conditions its predictions.

Three levers do most of the work. Zero-shot is just an instruction. Few-shot adds worked examples so the model infers the exact pattern you want — remarkably, with no training at all. Chain-of-thought asks it to reason step by step, spending more tokens to think before answering. Switch between them on the same task and watch the output sharpen:

Prompt lab — one task, three techniques

Task: classify a review's sentiment and return strict JSON. Illustrative outputs; token counts are representative.

The prompt sent to the model

Illustrative output

prompt tokens: — vs zero-shot: —

Why context alone can steer a frozen model

Zero-shot. Just describe the task. The model relies entirely on what it learned in pretraining and alignment. Fast and cheap, but format and edge cases are hit-or-miss.
Few-shot (in-context learning). Put a few input→output examples in the prompt. The model infers the pattern and continues it — with no weight update. This was the headline surprise of GPT-3: examples in the context act like temporary training. It nails formats and conventions that are hard to describe in words.
Chain-of-thought. Ask it to "think step by step." By generating intermediate reasoning tokens before the answer, the model effectively does more computation — each token is another forward pass — which sharply improves multi-step and math problems.
System prompts & structure. A system message sets persistent role and rules; clear delimiters, explicit output schemas, and "return only JSON" instructions reduce ambiguity. You're shaping the probability distribution toward the tokens you want.

The craft is real but bounded: prompting can only elicit what's already in the weights. When the model simply lacks the knowledge or skill, no wording fixes it — that's when you reach for retrieval or fine-tuning (next chapters).

Everything a prompt does is condition the same distribution. The model samples the answer $y$ from:

$$ y \sim P(y \mid \text{prompt}) $$

Few-shot just makes the prompt longer — the examples $\{(x_i,y_i)\}$ are extra conditioning tokens, so "in-context learning" is Bayesian conditioning, not gradient descent. Nothing in the weights changes:

$$ P\big(y \mid x,\, (x_1,y_1),\dots,(x_k,y_k)\big) $$

Chain-of-thought factorizes the answer through an intermediate reasoning $r$, letting the model spend computation on the path before committing:

$$ P(y \mid x) = \sum_{r} P(y \mid r, x)\,P(r \mid x) $$

Generating $r$ token-by-token turns "think harder" into literal extra forward passes — more compute at inference time, which is why it helps on hard problems and costs more.

instr, per_example, output = 18, 22, 12 # representative token counts zero = instr + output # instruction only few3 = instr + 3*per_example + output # + three worked examples cot = instr + 45 + output # + a reasoning trace in the output print(f"zero-shot: {zero} tokens/call") print(f"3-shot: {few3} tokens/call ({few3/zero:.1f}x)") print(f"CoT: {cot} tokens/call ({cot/zero:.1f}x)") # zero-shot: 30 tokens/call # 3-shot: 96 tokens/call (3.2x) # CoT: 75 tokens/call (2.5x) <- better answers, more tokens, every call

The cheapest way to program — with a running meter

Prompting → money

Prompting is the cheapest possible way to customize a model: no training run, no data pipeline, just words — change it and redeploy in seconds. That's why most AI products start here. But the cost moves from upfront to per call: every example in a few-shot prompt and every step of chain-of-thought is more tokens, paid on every single request, forever.

At scale that arithmetic dominates. A prompt that's 3× longer is roughly 3× the inference bill across millions of calls — so serious teams trim prompts token by token, cache shared prefixes, and reserve chain-of-thought for the queries that truly need it. Prompt engineering is, underneath, cost engineering.

It also frames the build-vs-buy choice the next chapters unpack: prompting is a recurring per-token cost, while fine-tuning is an upfront cost that can shorten prompts later. For the Circuit, prompting is the demand side in miniature — the knob that turns a fixed model into useful work, one metered token at a time.

Programming without changing the program

Prompt lab — one task, three techniques

Why context alone can steer a frozen model

Conditioning, not learning

The price of a better prompt

The cheapest way to program — with a running meter

The primary sources