First Principles / Part IV · Building with AI / Chapter 21
First Principles · Building with AI · 21
A model's knowledge is frozen at training time and blind to your private or recent data. Retrieval-augmented generation fixes that without retraining: find the relevant documents, put them in the prompt, and let the model answer grounded in real, current text.
01The answer, then the intuition
Ask a bare model about your company's refund policy and it will guess — confidently, and often wrong — because that fact was never in its training data. It's taking a closed-book exam on material it never studied. RAG turns it into an open-book exam: before answering, it looks up the relevant page and reads from it.
The lookup is by meaning, not keywords. The question and every document are turned into embeddings, and the closest chunks by cosine similarity are pulled into the prompt. Toggle retrieval on and off and watch the same model go from a plausible fabrication to a grounded, cited fact:
A tiny vector store of company facts. Toggle retrieval and compare the answers.
Question: "What's our refund window?"
Vector store · ranked by similarity to the question
Answer
02Mechanics
RAG's limits are retrieval's limits: if the right chunk isn't found, or the chunking is poor, the model is back to guessing. Most "RAG doesn't work" stories are really "retrieval didn't return the right thing" — which is why the vector-search chapter matters.
04The math
expand ▾Every chunk $d_i$ and the query $q$ are embedded into the same space. Relevance is cosine similarity:
Take the top-$k$ chunks $R = \text{top-}k_i\,\text{score}(q,d_i)$, and generate the answer conditioned on both the retrieved text and the question:
Compared with the bare $P(y\mid q)$ from the last chapter, the only change is what's in the context — retrieval injects grounded evidence. That's the whole trick: RAG is prompting where the context is fetched, by meaning, at query time.
05The code
expand ▾Rank a tiny store by cosine similarity to a query and take the top-2. This is the heart of every RAG system.
retrieve.py
import numpy as np
def cos(a, b):
a, b = np.array(a, float), np.array(b, float)
return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))
query = [0.9, 0.1, 0.2] # "refund window" (toy embedding)
store = {
"Refunds within 30 days of purchase.": [0.88, 0.12, 0.15],
"Shipping takes 3-5 business days.": [0.10, 0.90, 0.20],
"Warranty covers defects for 1 year.": [0.30, 0.20, 0.85],
"Returns accepted in original box.": [0.80, 0.25, 0.10],
}
ranked = sorted(store.items(), key=lambda kv: -cos(query, kv[1]))
for text, v in ranked:
print(f"{cos(query, v):.3f} {text}")
# 0.998 Refunds within 30 days of purchase. <- retrieved
# 0.977 Returns accepted in original box. <- retrieved
# 0.537 Warranty covers defects for 1 year.
# 0.256 Shipping takes 3-5 business days.
06The economics
Grounding → money
RAG is how a frozen model becomes an enterprise product. It makes knowledge current (update the store, not the weights), private (your data never enters training), and auditable (every claim can cite a source). For most business uses that trio beats fine-tuning, because company data changes daily and someone always needs to know where an answer came from.
The cost shifts to two places: the retrieval infrastructure (a vector database and the embedding pipeline) and, on every call, the tokens of injected context — which enlarge the prompt and its KV cache. So RAG trades a modest per-query token premium for accuracy and trust, and it scales cheaply because indexing is a one-time cost amortized over every future question.
This is close to home: grounding claims in citable sources is exactly what a research desk like the Circuit is for. The same discipline that makes RAG trustworthy — show the source, quote the evidence, let the reader check — is the discipline that separates analysis worth paying for from confident noise.
07Going deeper
expand ▾
Lewis et al. (2020) — Retrieval-Augmented Generation · the original RAG paper.
Karpukhin et al. (2020) — Dense Passage Retrieval · embedding-based retrieval.
Gao et al. (2023) — RAG for LLMs: A Survey · the modern design space.
Brown et al. (2020) — GPT-3 · why in-context conditioning works at all.
Cite this chapter: Divergent Compute, "What is RAG?", First Principles, 2026. divergentcompute.com/first-principles-rag · v1.0 · CC-BY.