Divergent Compute.AI Economic Think Tank

First Principles / Part IV · Building with AI / Chapter 21

First Principles · Building with AI · 21

What is RAG?

A model's knowledge is frozen at training time and blind to your private or recent data. Retrieval-augmented generation fixes that without retraining: find the relevant documents, put them in the prompt, and let the model answer grounded in real, current text.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

Open-book, not closed-book

Ask a bare model about your company's refund policy and it will guess — confidently, and often wrong — because that fact was never in its training data. It's taking a closed-book exam on material it never studied. RAG turns it into an open-book exam: before answering, it looks up the relevant page and reads from it.

The lookup is by meaning, not keywords. The question and every document are turned into embeddings, and the closest chunks by cosine similarity are pulled into the prompt. Toggle retrieval on and off and watch the same model go from a plausible fabrication to a grounded, cited fact:

RAG — grounding a frozen model, live

A tiny vector store of company facts. Toggle retrieval and compare the answers.

Question: "What's our refund window?"

Vector store · ranked by similarity to the question

Answer

02Mechanics

The pipeline, end to end

  • Index (once, offline). Split your documents into chunks, turn each into an embedding vector, and store them in a vector database (next chapters). This is the "library" the model will consult.
  • Retrieve (per query). Embed the user's question with the same model, then find the chunks whose vectors are nearest — semantic search. Take the top-$k$ (here, top-2). Keyword search would miss "refund window" ↔ "refunds within 30 days"; embeddings match on meaning.
  • Augment. Paste the retrieved chunks into the prompt as context, with an instruction like "answer using only the sources below, and cite them."
  • Generate. The model answers conditioned on real text it can quote — so it's current, grounded, and auditable, and far less likely to hallucinate. Update the store and the answers update instantly; no retraining.

RAG's limits are retrieval's limits: if the right chunk isn't found, or the chunking is poor, the model is back to guessing. Most "RAG doesn't work" stories are really "retrieval didn't return the right thing" — which is why the vector-search chapter matters.

04The math

expand ▾

Retrieve by similarity, condition on the result

Every chunk $d_i$ and the query $q$ are embedded into the same space. Relevance is cosine similarity:

$$ \text{score}(q, d_i) = \cos(\mathbf{e}_q, \mathbf{e}_{d_i}) = \frac{\mathbf{e}_q \cdot \mathbf{e}_{d_i}}{\lVert \mathbf{e}_q\rVert\,\lVert \mathbf{e}_{d_i}\rVert} $$

Take the top-$k$ chunks $R = \text{top-}k_i\,\text{score}(q,d_i)$, and generate the answer conditioned on both the retrieved text and the question:

$$ y \sim P\big(y \mid R,\, q\big) $$

Compared with the bare $P(y\mid q)$ from the last chapter, the only change is what's in the context — retrieval injects grounded evidence. That's the whole trick: RAG is prompting where the context is fetched, by meaning, at query time.

05The code

expand ▾

Retrieval in a dozen lines

Rank a tiny store by cosine similarity to a query and take the top-2. This is the heart of every RAG system.

retrieve.py

import numpy as np

def cos(a, b):
    a, b = np.array(a, float), np.array(b, float)
    return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

query = [0.9, 0.1, 0.2]                          # "refund window" (toy embedding)
store = {
    "Refunds within 30 days of purchase.": [0.88, 0.12, 0.15],
    "Shipping takes 3-5 business days.":    [0.10, 0.90, 0.20],
    "Warranty covers defects for 1 year.":  [0.30, 0.20, 0.85],
    "Returns accepted in original box.":    [0.80, 0.25, 0.10],
}
ranked = sorted(store.items(), key=lambda kv: -cos(query, kv[1]))
for text, v in ranked:
    print(f"{cos(query, v):.3f}  {text}")
# 0.998  Refunds within 30 days of purchase.   <- retrieved
# 0.977  Returns accepted in original box.     <- retrieved
# 0.537  Warranty covers defects for 1 year.
# 0.256  Shipping takes 3-5 business days.

06The economics

Fresh, private, and auditable — for the price of a lookup

Grounding → money

RAG is how a frozen model becomes an enterprise product. It makes knowledge current (update the store, not the weights), private (your data never enters training), and auditable (every claim can cite a source). For most business uses that trio beats fine-tuning, because company data changes daily and someone always needs to know where an answer came from.

The cost shifts to two places: the retrieval infrastructure (a vector database and the embedding pipeline) and, on every call, the tokens of injected context — which enlarge the prompt and its KV cache. So RAG trades a modest per-query token premium for accuracy and trust, and it scales cheaply because indexing is a one-time cost amortized over every future question.

This is close to home: grounding claims in citable sources is exactly what a research desk like the Circuit is for. The same discipline that makes RAG trustworthy — show the source, quote the evidence, let the reader check — is the discipline that separates analysis worth paying for from confident noise.

07Going deeper

expand ▾

The primary sources

Lewis et al. (2020) — Retrieval-Augmented Generation · the original RAG paper.
Karpukhin et al. (2020) — Dense Passage Retrieval · embedding-based retrieval.
Gao et al. (2023) — RAG for LLMs: A Survey · the modern design space.
Brown et al. (2020) — GPT-3 · why in-context conditioning works at all.

Cite this chapter: Divergent Compute, "What is RAG?", First Principles, 2026. divergentcompute.com/first-principles-rag · v1.0 · CC-BY.

← Chapter 20
Prompting
Next · Chapter 22 →
Agents & tool use