First Principles / Part II · Models / Chapter 12
First Principles · Models · 12
A model that "sees" doesn't need a new brain. An image is just chopped into patches, each patch turned into a token in the same vector space as words — and fed into the same transformer. Everything becomes tokens.
01The answer, then the intuition
The deep trick of multimodal models is that there's no trick. The transformer doesn't care whether a token came from a word or a corner of a photo — it just attends over a sequence of vectors. So to make a model "see," you convert an image into a sequence of vectors and splice them in next to the text tokens.
The conversion is mechanical: cut the image into a grid of small patches, flatten each patch, and run it through a linear layer that projects it into the model's embedding dimension — exactly where word tokens live. Now text and image share one sequence, and attention can relate "the cat" in the prompt to the patch of the photo where the cat is.
Click any patch of this little image and watch it light up as a token sitting right alongside the words:
A stylized 4×4 image. Click a patch (or a token) — each patch is exactly one token.
↓ patchify & project
Input sequence to the transformer
Click a patch to see it become a token.
02Mechanics
[text… , img_1, img_2, … , img_196]. From here the main transformer is unchanged — it just attends over a longer, mixed sequence.So "multimodal" is less a new architecture than a set of adapters that turn every kind of input into tokens the one transformer can read.
04The math
expand ▾An image of height $H$ and width $W$, cut into square patches of side $P$, yields a number of patch tokens:
Each patch is a block of $P \times P \times C$ pixel values ($C$ = colour channels). Flattened to a vector $x_p \in \mathbb{R}^{P^2 C}$, it's projected to the embedding dimension $d$ by a learned matrix $E$ and given a position embedding:
The full input is then the concatenation $[\,z^{\text{text}}_1,\dots,z^{\text{text}}_m,\; z^{\text{img}}_1,\dots,z^{\text{img}}_N\,]$ — one sequence of $d$-dimensional vectors. Because attention cost grows with the square of sequence length, those $N$ image tokens are not free.
05The code
expand ▾Patch counts for real input sizes — the number that lands in your context window per image.
patches.py
def patch_tokens(H, W, P):
return (H // P) * (W // P) # N = (H*W) / P^2
for H, W, P in [(224, 224, 16), (336, 336, 14), (512, 512, 16)]:
print(f"{H}x{W}, patch {P}: {patch_tokens(H, W, P)} tokens")
# 224x224, patch 16: 196 tokens
# 336x336, patch 14: 576 tokens
# 512x512, patch 16: 1024 tokens <- one image ≈ a page of text, in tokens
# a 10-word text prompt is ~13 tokens; a single image can be 50x larger
06The economics
Pixels → money
Because images are billed as tokens, vision is expensive. A single picture can be hundreds to over a thousand tokens — a short text prompt is a dozen. So sending one photo can cost more than a long paragraph, and the quadratic attention bill means a few high-resolution images can dominate a request's entire compute.
Video is the extreme case: it's images stacked through time, so a few seconds of footage can be tens of thousands of tokens. This is why "just feed it the whole video" is an economic, not just technical, problem — and why providers downsample frames aggressively. Multimodality is one of the largest forces inflating the inference demand the build-out is racing to supply.
The strategic read for the Circuit: every new modality multiplies the tokens per interaction, which multiplies the compute per user, which raises the revenue each interaction must eventually justify. Seeing and hearing are powerful — and they move the break-even line further out.
07Going deeper
expand ▾
Dosovitskiy et al. (2020) — An Image is Worth 16×16 Words (ViT) · images as patch tokens.
Radford et al. (2021) — CLIP · aligning image and text in one embedding space.
Alayrac et al. (2022) — Flamingo · bridging a vision encoder into a language model.
Liu et al. (2023) — LLaVA · a simple, open visual-instruction recipe.
Cite this chapter: Divergent Compute, "Multimodal models", First Principles, 2026. divergentcompute.com/first-principles-multimodal · v1.0 · CC-BY.