First Principles / Part III · Inference & systems / Chapter 15
01The answer, then the intuition
You felt this in the last chapter: decode writes one token per forward pass, and each pass has to read the entire model out of memory. So the speed limit isn't how fast the GPU can compute — it's how fast it can move the weights. A bigger model means more bytes to stream per token, which means fewer tokens per second. It's almost that simple.
Two numbers matter to a user: time to first token (how long until the answer starts — set by prefill) and tokens per second (how fast it then types — set by decode). The calculator below computes the second from first principles. Pick a model size and drag the precision; watch a 70B model crawl, and watch quantization nearly quadruple it:
Single stream on one GPU with ~3.35 TB/s memory bandwidth (H100-class). tok/s ≈ bandwidth ÷ (params × bytes).
02Mechanics
TTFT + (tokens × per-token time). A long answer from a big model is slow on both counts.So "tokens per second" is two metrics wearing one name: a per-user speed bounded by bandwidth, and an aggregate throughput bounded by compute once the batch is full. Quantization helps the first; batching helps the second.
04The math
expand ▾Each decode step streams roughly the whole model — $N$ parameters at $b/8$ bytes each — from memory. With memory bandwidth $\text{BW}$ (bytes/sec), the time per token and the rate are:
End-to-end latency for a prompt of $P$ and an answer of $G$ tokens:
The roofline idea: decode lives on the memory-bandwidth slope, so halving the bytes (quantize) roughly doubles the rate. Batching $B$ requests reuses the same weight-load across all of them, so aggregate throughput rises toward $B \times$ — until the work becomes compute-bound and hits the GPU's FLOPs ceiling instead. Inference engineering is the art of climbing that roofline.
05The code
expand ▾Single-stream decode rate for three model sizes at three precisions, on an H100-class GPU.
tokens_per_sec.py
BW = 3.35e12 # H100 HBM3 memory bandwidth, ~3.35 TB/s
def toks_per_sec(N, bits):
bytes_streamed = N * (bits / 8) # ~whole model moved per token
return BW / bytes_streamed
for N, nm in [(7e9,"7B"), (70e9,"70B"), (175e9,"175B")]:
rates = " ".join(f"{b}-bit {toks_per_sec(N,b):5.1f}" for b in [16,8,4])
print(f"{nm:5s}: {rates} tok/s")
# 7B : 16-bit 239.3 8-bit 478.6 4-bit 957.1 tok/s
# 70B : 16-bit 23.9 8-bit 47.9 4-bit 95.7 tok/s
# 175B : 16-bit 9.6 8-bit 19.1 4-bit 38.3 tok/s
print(f"70B 16-bit, 1000-token reply: {1000/toks_per_sec(70e9,16):.1f} s") # 41.8 s
06The economics
tokens/sec → money
A GPU costs the same per hour whether it serves one user or a hundred. So the number that decides whether inference makes money is tokens per second per GPU — and therefore tokens per dollar. Everything in this chapter is a lever on that number: quantization raises per-user speed, batching raises aggregate throughput, and both cut the cost of every token served.
This is why providers obsess over batching and why they price output tokens the way they do: a memory-bound GPU running a single user is mostly idle silicon being paid for. Filling it is the difference between a gross margin and a loss. The same model can be a profitable product or a money pit depending entirely on how well its operator climbs this roofline.
For the Circuit, tokens/sec/dollar is the efficiency term in the whole equation: as it improves — through better hardware, lower precision, and smarter serving — the cost side of the ledger falls, buying the build-out more time for revenue to catch up. It is the most quietly important number in AI economics.
07Going deeper
expand ▾
Pope et al. (2022) — Efficiently Scaling Transformer Inference · latency vs throughput trade-offs.
Databricks — LLM Inference Performance Engineering · TTFT, tokens/sec, batching in practice.
NVIDIA — GPU Performance Background (roofline) · memory-bound vs compute-bound.
MLPerf Inference · the standard latency/throughput benchmark.
Cite this chapter: Divergent Compute, "Latency, throughput, tokens/sec", First Principles, 2026. divergentcompute.com/first-principles-latency · v1.0 · CC-BY.