Divergent Compute.AI Economic Think Tank

First Principles / Part IV · Building with AI / Chapter 25

First Principles · Building with AI · 25

Vector search

Under every RAG system is one hard problem: find the handful of vectors nearest a query, among millions or billions. Comparing them all is too slow — so approximate nearest-neighbor search trades a sliver of accuracy for enormous speed.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

Don't check everything — check the right neighborhood

An embedding turns every document into a point in high-dimensional space, where nearness means similar meaning. Retrieval is then just geometry: find the points closest to the query point. The naive way — measure the distance to every stored vector and sort — is exact but costs a full scan per query. At a billion vectors, that's hopeless.

The insight of approximate nearest-neighbor (ANN) search is that you don't need to check everything. Pre-build an index that lets you jump to the right neighborhood and examine only a few candidates. You give up a tiny chance of missing a true neighbor in exchange for searching orders of magnitude fewer points. Move the query below and watch exact scan versus approximate:

Nearest-neighbor search — exact vs approximate

Click or drag anywhere to move the query. The 3 nearest points light up; examined points are ringed.

points compared
correct top-3 found

02Mechanics

How you skip the full scan

  • The problem is scale. Exact search compares the query to all $N$ vectors, each of dimension $d$ — fine for thousands, ruinous for billions. Every RAG query would otherwise touch the entire store.
  • HNSW graphs. The dominant method builds a navigable "small-world" graph linking each vector to nearby ones, with a few long-range links layered on top. Search starts anywhere and greedily hops toward the query, reaching the neighborhood in roughly logarithmic steps — examining a few dozen points instead of billions.
  • IVF & quantization. Alternatives cluster the space into cells and search only the nearest cells (IVF), and product quantization compresses each vector into a few bytes so more fit in memory. These cut both time and storage, at some cost to precision.
  • The trade-off: recall. ANN can occasionally miss a true neighbor — measured as recall@k. Turning a knob (examine more candidates) raises recall toward 100% but costs speed. Vector databases — pgvector, FAISS, Qdrant, Pinecone, Weaviate — are largely engines for tuning this recall-speed-memory triangle.

So vector search is the unglamorous machinery that makes semantic retrieval possible at real scale — the reason "search by meaning over everything you own" is a product and not a research demo.

04The math

expand ▾

From linear to logarithmic

The $k$ nearest neighbors of a query $q$ minimize distance over the store:

$$ \text{kNN}(q) = \operatorname*{arg\,min}_{|S|=k}\; \sum_{d_i \in S} \lVert q - d_i \rVert $$

Exact search evaluates that distance for every vector — cost $O(N \cdot d)$, linear in the number of documents. An HNSW index changes the exponent: greedy graph traversal reaches the query's neighborhood in roughly

$$ O\big(d \cdot \log N\big) \quad\text{expected, per query} $$

The difference between $N$ and $\log N$ is the whole game: at a billion vectors, $\log_2 N \approx 30$, so ANN examines on the order of tens of candidates instead of a billion. The price is recall@k $= \frac{|\text{retrieved} \cap \text{true top-}k|}{k}$, which a good index keeps at 95–99% while searching a vanishing fraction of the data.

05The code

expand ▾

Exact search, and why it doesn't scale

Exact k-NN is a full scan. The comparison count is the whole reason ANN indexes exist.

knn.py

import numpy as np

q = np.array([5.0, 5.0])
store = {"A":[5.2,4.8], "B":[1.0,1.0], "C":[6.0,5.5], "D":[8.5,2.0],
         "E":[4.6,5.3], "F":[9.0,9.0], "G":[2.0,7.0], "H":[5.1,6.2]}

comparisons = 0
dists = {}
for name, p in store.items():
    comparisons += 1                       # exact = compare to EVERY vector
    dists[name] = np.linalg.norm(q - np.array(p))

top3 = sorted(dists, key=dists.get)[:3]
print(f"exact scan: {comparisons} comparisons")     # 8 (here) ... N at scale
print("3 nearest:", [(n, round(float(dists[n]), 2)) for n in top3])
# exact scan: 8 comparisons
# 3 nearest: [('A', 0.28), ('E', 0.5), ('C', 1.12)]
# an HNSW index would touch ~log(N) candidates and return the same 3

06The economics

The plumbing that makes AI memory affordable

Search → money

Vector search is why AI can have a usable memory at all. Without ANN, grounding a model in a large corpus would mean scanning every document on every query — computationally and financially impossible at scale. ANN turns that linear cost into a logarithmic one, and that single change is what makes retrieval cheap enough to put under every product. The vector database has become its own infrastructure category precisely because this problem is universal.

The economics live in the recall-speed-memory triangle. Tighter recall costs more compute and memory per query; looser recall is cheaper but risks missing the right chunk and sending the model back to guessing. Tuning that trade-off is a real operating decision, because retrieval quality caps the quality of everything built on top of it.

For the Circuit, this is the demand side's foundation: the same embeddings that opened this book close the loop here, as the searchable memory that lets AI reason over a company's — or a researcher's — entire corpus. It's the least visible layer, and one of the most load-bearing.

Part IV complete

You can now build with a model, not just describe one

From prompting a frozen model, through grounding it with retrieval, giving it tools to act, choosing among adaptation strategies, measuring quality with evals, and the vector search underneath it all — you've gone from theory to a working mental model of an AI application. Part I built the machine, Part II the model, Part III the cost of running it, and Part IV the craft of using it.

That's 25 chapters and four complete parts — the mechanics, end to end. What remains is the part only this think tank is built to tell: where these mechanics meet the money. See the full curriculum →

07Going deeper

expand ▾

The primary sources

Malkov & Yashunin (2016) — HNSW · the navigable small-world graph index.
Johnson et al. (2017) — Billion-scale similarity search (FAISS) · ANN at scale.
Jégou et al. (2011) — Product Quantization · compressing vectors for search.
ANN-Benchmarks · the recall-vs-speed trade-off, measured across libraries.

Cite this chapter: Divergent Compute, "Vector search", First Principles, 2026. divergentcompute.com/first-principles-vector-search · v1.0 · CC-BY.

← Chapter 24
Evals
Part V · Next →
Scaling laws