Latency, throughput, tokens/sec

How fast is an AI model? The honest answer is a formula. Because decode is memory-bound, a model's speed is set by how fast it can stream its weights out of memory — tokens per second ≈ bandwidth ÷ model bytes.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

Speed is bandwidth divided by size

You felt this in the last chapter: decode writes one token per forward pass, and each pass has to read the entire model out of memory. So the speed limit isn't how fast the GPU can compute — it's how fast it can move the weights. A bigger model means more bytes to stream per token, which means fewer tokens per second. It's almost that simple.

Two numbers matter to a user: time to first token (how long until the answer starts — set by prefill) and tokens per second (how fast it then types — set by decode). The calculator below computes the second from first principles. Pick a model size and drag the precision; watch a 70B model crawl, and watch quantization nearly quadruple it:

Tokens/sec calculator — the memory-bound speed limit

Single stream on one GPU with ~3.35 TB/s memory bandwidth (H100-class). tok/s ≈ bandwidth ÷ (params × bytes).

—

decode tokens / sec

—

time for a 1,000-token reply

—

bytes streamed / token

precision16-bit

16-bit8-bit4-bit

Latency, throughput, and the tension between them

Time to first token (TTFT). Set by prefill — how long to process your whole prompt before the first word appears. Grows with prompt length; this is the "thinking…" pause.
Inter-token latency / tokens-per-second. Set by decode. Each token streams the model's weights from memory, so per-user speed ≈ memory bandwidth ÷ model bytes. This is the steady typing rate you watch.
End-to-end latency. Simply TTFT + (tokens × per-token time). A long answer from a big model is slow on both counts.
Throughput. The system's total tokens/sec across all users at once. Here's the twist: because one user's decode barely uses the GPU's compute, you can run many users' requests in the same forward pass — batching — and total throughput climbs almost for free, even though each individual user's speed is unchanged or slightly slower.
The tension. Latency is one user's experience; throughput is the system's efficiency. Bigger batches raise throughput (and lower cost per token) but can raise latency. Tuning that trade-off is the central job of an inference team.

So "tokens per second" is two metrics wearing one name: a per-user speed bounded by bandwidth, and an aggregate throughput bounded by compute once the batch is full. Quantization helps the first; batching helps the second.

Each decode step streams roughly the whole model — $N$ parameters at $b/8$ bytes each — from memory. With memory bandwidth $\text{BW}$ (bytes/sec), the time per token and the rate are:

$$ t_{\text{token}} \approx \frac{N \cdot (b/8)}{\text{BW}}, \qquad \text{tokens/sec} \approx \frac{\text{BW}}{N \cdot (b/8)} $$

End-to-end latency for a prompt of $P$ and an answer of $G$ tokens:

$$ \text{latency} \approx \underbrace{t_{\text{prefill}}(P)}_{\text{TTFT}} \;+\; G \cdot t_{\text{token}} $$

The roofline idea: decode lives on the memory-bandwidth slope, so halving the bytes (quantize) roughly doubles the rate. Batching $B$ requests reuses the same weight-load across all of them, so aggregate throughput rises toward $B \times$ — until the work becomes compute-bound and hits the GPU's FLOPs ceiling instead. Inference engineering is the art of climbing that roofline.

BW = 3.35e12 # H100 HBM3 memory bandwidth, ~3.35 TB/s def toks_per_sec(N, bits): bytes_streamed = N * (bits / 8) # ~whole model moved per token return BW / bytes_streamed for N, nm in [(7e9,"7B"), (70e9,"70B"), (175e9,"175B")]: rates = " ".join(f"{b}-bit {toks_per_sec(N,b):5.1f}" for b in [16,8,4]) print(f"{nm:5s}: {rates} tok/s") # 7B : 16-bit 239.3 8-bit 478.6 4-bit 957.1 tok/s # 70B : 16-bit 23.9 8-bit 47.9 4-bit 95.7 tok/s # 175B : 16-bit 9.6 8-bit 19.1 4-bit 38.3 tok/s print(f"70B 16-bit, 1000-token reply: {1000/toks_per_sec(70e9,16):.1f} s") # 41.8 s

Throughput is the unit of margin

tokens/sec → money

A GPU costs the same per hour whether it serves one user or a hundred. So the number that decides whether inference makes money is tokens per second per GPU — and therefore tokens per dollar. Everything in this chapter is a lever on that number: quantization raises per-user speed, batching raises aggregate throughput, and both cut the cost of every token served.

This is why providers obsess over batching and why they price output tokens the way they do: a memory-bound GPU running a single user is mostly idle silicon being paid for. Filling it is the difference between a gross margin and a loss. The same model can be a profitable product or a money pit depending entirely on how well its operator climbs this roofline.

For the Circuit, tokens/sec/dollar is the efficiency term in the whole equation: as it improves — through better hardware, lower precision, and smarter serving — the cost side of the ledger falls, buying the build-out more time for revenue to catch up. It is the most quietly important number in AI economics.

Latency, throughput, tokens/sec

Speed is bandwidth divided by size

Tokens/sec calculator — the memory-bound speed limit

Latency, throughput, and the tension between them

The roofline of decode

Speed, from the bandwidth up

Throughput is the unit of margin

The primary sources