Divergent Compute.AI Economic Think Tank

First Principles / Part II · Models / Chapter 13

First Principles · Models · 13

Quantization & distillation

A frontier model is enormous. Two techniques shrink it to run cheaply: quantization stores each weight in fewer bits, and distillation trains a small "student" to copy a big "teacher" — keeping most of the quality at a fraction of the size.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

Same model, fewer bits — and a smaller copycat

From the parameters chapter you know the rule: memory = parameter count × bytes per parameter. The second factor is the lever. By default each weight is a 16-bit number, but most of that precision is wasted — the model works almost as well if each weight is squeezed into 8 bits, or even 4. That's quantization, and it cuts memory and cost in direct proportion.

Distillation attacks the first factor. Instead of compressing one model, you train a much smaller one to imitate the big model's outputs — learning from the teacher's full probability distribution, not just the right answer. The student ends up far smaller while inheriting much of the teacher's skill.

Quantization is the bigger everyday lever. Pick a model and drag the precision down — watch the hardware bill collapse:

Quantization calculator — drag the precision

Memory = parameters × (bits / 8). GPUs assume 80 GB each with ~20% runtime overhead.

weights in memory
80GB GPUs needed
vs 16-bit
precision16-bit
4-bit8-bit16-bit32-bit

02Mechanics

Compressing the weights, and the knowledge

  • Quantization. Map each high-precision weight to a low-bit integer by storing a shared scale per group of weights: w ≈ scale × round(w / scale). 16→8 bits roughly halves memory for almost no quality loss; 4-bit (with careful methods like GPTQ or AWQ) quarters it with a small, often acceptable hit. The weights are the same model — just stored coarsely.
  • Why it works. Neural networks are remarkably robust to noise in their weights. The signal lives in the overall pattern, not the 12th decimal place, so throwing away low-order bits costs surprisingly little.
  • Distillation. Run the big teacher on lots of inputs and record its full output distribution (the "soft labels"). Train a small student to match those distributions. The soft targets carry far more information than a single correct token — the teacher reveals how confident it is across all options — so the student learns faster and smaller.
  • Used together. Modern small models are often distilled from a big teacher and then quantized — the two compressions compound, which is how genuinely capable models end up running on a laptop or phone.

Both are lossy. The craft is spending the loss where it doesn't matter — and measuring honestly whether the smaller model still does your task, rather than trusting a benchmark.

04The math

expand ▾

Bits, scales, and soft targets

Memory for the weights is linear in the bit-width $b$:

$$ \text{memory} = N \times \frac{b}{8}\ \text{bytes} $$

Quantization to $b$ bits picks a scale $s$ for a group of weights and rounds each to the nearest representable level, then dequantizes on use:

$$ w_q = \mathrm{round}\!\left(\frac{w}{s}\right), \qquad \hat{w} = s\,w_q \approx w $$

Distillation trains the student distribution $p_S$ to match the teacher's softened distribution $p_T$ (temperature $T$ flattens both), minimizing the KL divergence:

$$ \mathcal{L}_{\text{distill}} = \mathrm{KL}\!\big(p_T \,\|\, p_S\big) = \sum_i p_T(i)\,\log\frac{p_T(i)}{p_S(i)} $$

The teacher's soft probabilities — "70% cat, 25% dog, 5% fox" — teach the student the structure a hard label ("cat") never could. That extra signal is why a small student can punch so far above its size.

05The code

expand ▾

The bill, bit by bit

The same 70B model at four precisions — memory and GPUs. This is the calculator above, in seven lines.

quantize.py

import math

def mem_gb(N, bits): return N * bits / 8 / 1e9
def gpus(m, cap=80, overhead=1.2): return math.ceil(m * overhead / cap)

N = 70e9                                  # a 70-billion-parameter model
for bits in [32, 16, 8, 4]:
    m = mem_gb(N, bits)
    print(f"{bits:>2}-bit: {m:6.1f} GB  ->  {gpus(m)} GPU(s)")
# 32-bit:  280.0 GB  ->  5 GPU(s)
# 16-bit:  140.0 GB  ->  3 GPU(s)
#  8-bit:   70.0 GB  ->  2 GPU(s)
#  4-bit:   35.0 GB  ->  1 GPU(s)   <- a 70B model on a single GPU

06The economics

The deflation underneath everything

Compression → money

Quantization and distillation are the deflationary force in AI. Every bit you drop is a proportional cut in memory, hardware, and energy per token served — and distillation moves a capability from an expensive frontier model into one a fraction of the size. Together they are why the price of a given level of intelligence keeps falling fast even as the frontier rises.

This is the optimistic half of the Circuit's ledger. The build-out spends ever more on the frontier, but compression relentlessly drives down the cost of serving everything below it — pushing models onto single GPUs, laptops, and phones, where the marginal cost approaches electricity. The open tier rides this hardest.

It also sharpens the central tension. If a distilled, quantized open model captures most of the value at a tenth of the cost, what exactly is the frontier premium being paid for — and for how long? That question, asked with real numbers, is the whole point of this think tank.

Part II complete

You now know what a model is — and what it costs

From a base model learning language by guessing the next word, through alignment, the real differences between models, vision, and the compression that makes any of it affordable — you've followed the model from raw weights to a product with a price tag. Part I built the machine; Part II made it a model.

Part III opens the machinery that runs it — inference, GPUs, the memory wall, the KV cache, and the data-center cluster: what it actually takes to serve a model to millions at once, and where the real costs live. See the full curriculum →

07Going deeper

expand ▾

The primary sources

Hinton et al. (2015) — Distilling the Knowledge in a Neural Network · the soft-target idea.
Dettmers et al. (2022) — LLM.int8() · 8-bit inference for large models with no quality loss.
Frantar et al. (2022) — GPTQ · accurate 4-bit post-training quantization.
Dettmers et al. (2023) — QLoRA · fine-tuning a 4-bit model on a single GPU.

Cite this chapter: Divergent Compute, "Quantization & distillation", First Principles, 2026. divergentcompute.com/first-principles-quantization · v1.0 · CC-BY.

← Chapter 12
Multimodal models
Part III · Next →
What is inference?