What is RAG?

What is RAG?

A model's knowledge is frozen at training time and blind to your private or recent data. Retrieval-augmented generation fixes that without retraining: find the relevant documents, put them in the prompt, and let the model answer grounded in real, current text.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

Open-book, not closed-book

Ask a bare model about your company's refund policy and it will guess — confidently, and often wrong — because that fact was never in its training data. It's taking a closed-book exam on material it never studied. RAG turns it into an open-book exam: before answering, it looks up the relevant page and reads from it.

The lookup is by meaning, not keywords. The question and every document are turned into embeddings, and the closest chunks by cosine similarity are pulled into the prompt. Toggle retrieval on and off and watch the same model go from a plausible fabrication to a grounded, cited fact:

RAG — grounding a frozen model, live

A tiny vector store of company facts. Toggle retrieval and compare the answers.

Question: "What's our refund window?"

Vector store · ranked by similarity to the question

Answer

The pipeline, end to end

Index (once, offline). Split your documents into chunks, turn each into an embedding vector, and store them in a vector database (next chapters). This is the "library" the model will consult.
Retrieve (per query). Embed the user's question with the same model, then find the chunks whose vectors are nearest — semantic search. Take the top-$k$ (here, top-2). Keyword search would miss "refund window" ↔ "refunds within 30 days"; embeddings match on meaning.
Augment. Paste the retrieved chunks into the prompt as context, with an instruction like "answer using only the sources below, and cite them."
Generate. The model answers conditioned on real text it can quote — so it's current, grounded, and auditable, and far less likely to hallucinate. Update the store and the answers update instantly; no retraining.

RAG's limits are retrieval's limits: if the right chunk isn't found, or the chunking is poor, the model is back to guessing. Most "RAG doesn't work" stories are really "retrieval didn't return the right thing" — which is why the vector-search chapter matters.

Every chunk $d_i$ and the query $q$ are embedded into the same space. Relevance is cosine similarity:

$$ \text{score}(q, d_i) = \cos(\mathbf{e}_q, \mathbf{e}_{d_i}) = \frac{\mathbf{e}_q \cdot \mathbf{e}_{d_i}}{\lVert \mathbf{e}_q\rVert\,\lVert \mathbf{e}_{d_i}\rVert} $$

Take the top-$k$ chunks $R = \text{top-}k_i\,\text{score}(q,d_i)$, and generate the answer conditioned on both the retrieved text and the question:

$$ y \sim P\big(y \mid R,\, q\big) $$

Compared with the bare $P(y\mid q)$ from the last chapter, the only change is what's in the context — retrieval injects grounded evidence. That's the whole trick: RAG is prompting where the context is fetched, by meaning, at query time.

import numpy as np def cos(a, b): a, b = np.array(a, float), np.array(b, float) return a @ b / (np.linalg.norm(a) * np.linalg.norm(b)) query = [0.9, 0.1, 0.2] # "refund window" (toy embedding) store = { "Refunds within 30 days of purchase.": [0.88, 0.12, 0.15], "Shipping takes 3-5 business days.": [0.10, 0.90, 0.20], "Warranty covers defects for 1 year.": [0.30, 0.20, 0.85], "Returns accepted in original box.": [0.80, 0.25, 0.10], } ranked = sorted(store.items(), key=lambda kv: -cos(query, kv[1])) for text, v in ranked: print(f"{cos(query, v):.3f} {text}") # 0.998 Refunds within 30 days of purchase. <- retrieved # 0.977 Returns accepted in original box. <- retrieved # 0.537 Warranty covers defects for 1 year. # 0.256 Shipping takes 3-5 business days.

Fresh, private, and auditable — for the price of a lookup

Grounding → money

RAG is how a frozen model becomes an enterprise product. It makes knowledge current (update the store, not the weights), private (your data never enters training), and auditable (every claim can cite a source). For most business uses that trio beats fine-tuning, because company data changes daily and someone always needs to know where an answer came from.

The cost shifts to two places: the retrieval infrastructure (a vector database and the embedding pipeline) and, on every call, the tokens of injected context — which enlarge the prompt and its KV cache. So RAG trades a modest per-query token premium for accuracy and trust, and it scales cheaply because indexing is a one-time cost amortized over every future question.

This is close to home: grounding claims in citable sources is exactly what a research desk like the Circuit is for. The same discipline that makes RAG trustworthy — show the source, quote the evidence, let the reader check — is the discipline that separates analysis worth paying for from confident noise.

Open-book, not closed-book

RAG — grounding a frozen model, live

The pipeline, end to end

Retrieve by similarity, condition on the result

Retrieval in a dozen lines

Fresh, private, and auditable — for the price of a lookup

The primary sources