Divergent Compute.AI Economic Think Tank

First Principles / Part III · Inference & systems / Chapter 15

First Principles · Inference & systems · 15

Latency, throughput, tokens/sec

How fast is an AI model? The honest answer is a formula. Because decode is memory-bound, a model's speed is set by how fast it can stream its weights out of memory — tokens per second ≈ bandwidth ÷ model bytes.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

Speed is bandwidth divided by size

You felt this in the last chapter: decode writes one token per forward pass, and each pass has to read the entire model out of memory. So the speed limit isn't how fast the GPU can compute — it's how fast it can move the weights. A bigger model means more bytes to stream per token, which means fewer tokens per second. It's almost that simple.

Two numbers matter to a user: time to first token (how long until the answer starts — set by prefill) and tokens per second (how fast it then types — set by decode). The calculator below computes the second from first principles. Pick a model size and drag the precision; watch a 70B model crawl, and watch quantization nearly quadruple it:

Tokens/sec calculator — the memory-bound speed limit

Single stream on one GPU with ~3.35 TB/s memory bandwidth (H100-class). tok/s ≈ bandwidth ÷ (params × bytes).

decode tokens / sec
time for a 1,000-token reply
bytes streamed / token
precision16-bit
16-bit8-bit4-bit

02Mechanics

Latency, throughput, and the tension between them

  • Time to first token (TTFT). Set by prefill — how long to process your whole prompt before the first word appears. Grows with prompt length; this is the "thinking…" pause.
  • Inter-token latency / tokens-per-second. Set by decode. Each token streams the model's weights from memory, so per-user speed ≈ memory bandwidth ÷ model bytes. This is the steady typing rate you watch.
  • End-to-end latency. Simply TTFT + (tokens × per-token time). A long answer from a big model is slow on both counts.
  • Throughput. The system's total tokens/sec across all users at once. Here's the twist: because one user's decode barely uses the GPU's compute, you can run many users' requests in the same forward pass — batching — and total throughput climbs almost for free, even though each individual user's speed is unchanged or slightly slower.
  • The tension. Latency is one user's experience; throughput is the system's efficiency. Bigger batches raise throughput (and lower cost per token) but can raise latency. Tuning that trade-off is the central job of an inference team.

So "tokens per second" is two metrics wearing one name: a per-user speed bounded by bandwidth, and an aggregate throughput bounded by compute once the batch is full. Quantization helps the first; batching helps the second.

04The math

expand ▾

The roofline of decode

Each decode step streams roughly the whole model — $N$ parameters at $b/8$ bytes each — from memory. With memory bandwidth $\text{BW}$ (bytes/sec), the time per token and the rate are:

$$ t_{\text{token}} \approx \frac{N \cdot (b/8)}{\text{BW}}, \qquad \text{tokens/sec} \approx \frac{\text{BW}}{N \cdot (b/8)} $$

End-to-end latency for a prompt of $P$ and an answer of $G$ tokens:

$$ \text{latency} \approx \underbrace{t_{\text{prefill}}(P)}_{\text{TTFT}} \;+\; G \cdot t_{\text{token}} $$

The roofline idea: decode lives on the memory-bandwidth slope, so halving the bytes (quantize) roughly doubles the rate. Batching $B$ requests reuses the same weight-load across all of them, so aggregate throughput rises toward $B \times$ — until the work becomes compute-bound and hits the GPU's FLOPs ceiling instead. Inference engineering is the art of climbing that roofline.

05The code

expand ▾

Speed, from the bandwidth up

Single-stream decode rate for three model sizes at three precisions, on an H100-class GPU.

tokens_per_sec.py

BW = 3.35e12   # H100 HBM3 memory bandwidth, ~3.35 TB/s

def toks_per_sec(N, bits):
    bytes_streamed = N * (bits / 8)     # ~whole model moved per token
    return BW / bytes_streamed

for N, nm in [(7e9,"7B"), (70e9,"70B"), (175e9,"175B")]:
    rates = "   ".join(f"{b}-bit {toks_per_sec(N,b):5.1f}" for b in [16,8,4])
    print(f"{nm:5s}: {rates}  tok/s")
# 7B   : 16-bit 239.3   8-bit 478.6   4-bit 957.1  tok/s
# 70B  : 16-bit  23.9   8-bit  47.9   4-bit  95.7  tok/s
# 175B : 16-bit   9.6   8-bit  19.1   4-bit  38.3  tok/s

print(f"70B 16-bit, 1000-token reply: {1000/toks_per_sec(70e9,16):.1f} s")  # 41.8 s

06The economics

Throughput is the unit of margin

tokens/sec → money

A GPU costs the same per hour whether it serves one user or a hundred. So the number that decides whether inference makes money is tokens per second per GPU — and therefore tokens per dollar. Everything in this chapter is a lever on that number: quantization raises per-user speed, batching raises aggregate throughput, and both cut the cost of every token served.

This is why providers obsess over batching and why they price output tokens the way they do: a memory-bound GPU running a single user is mostly idle silicon being paid for. Filling it is the difference between a gross margin and a loss. The same model can be a profitable product or a money pit depending entirely on how well its operator climbs this roofline.

For the Circuit, tokens/sec/dollar is the efficiency term in the whole equation: as it improves — through better hardware, lower precision, and smarter serving — the cost side of the ledger falls, buying the build-out more time for revenue to catch up. It is the most quietly important number in AI economics.

07Going deeper

expand ▾

The primary sources

Pope et al. (2022) — Efficiently Scaling Transformer Inference · latency vs throughput trade-offs.
Databricks — LLM Inference Performance Engineering · TTFT, tokens/sec, batching in practice.
NVIDIA — GPU Performance Background (roofline) · memory-bound vs compute-bound.
MLPerf Inference · the standard latency/throughput benchmark.

Cite this chapter: Divergent Compute, "Latency, throughput, tokens/sec", First Principles, 2026. divergentcompute.com/first-principles-latency · v1.0 · CC-BY.

← Chapter 14
What is inference?
Next · Chapter 16 →
Why AI needs GPUs