Multimodal models

A model that "sees" doesn't need a new brain. An image is just chopped into patches, each patch turned into a token in the same vector space as words — and fed into the same transformer. Everything becomes tokens.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

A picture is just more tokens

The deep trick of multimodal models is that there's no trick. The transformer doesn't care whether a token came from a word or a corner of a photo — it just attends over a sequence of vectors. So to make a model "see," you convert an image into a sequence of vectors and splice them in next to the text tokens.

The conversion is mechanical: cut the image into a grid of small patches, flatten each patch, and run it through a linear layer that projects it into the model's embedding dimension — exactly where word tokens live. Now text and image share one sequence, and attention can relate "the cat" in the prompt to the patch of the photo where the cat is.

Click any patch of this little image and watch it light up as a token sitting right alongside the words:

One image → patches → tokens in the same sequence

A stylized 4×4 image. Click a patch (or a token) — each patch is exactly one token.

↓ patchify & project

Input sequence to the transformer

Click a patch to see it become a token.

From pixels to a shared sequence

Patchify. Split the image (say 224×224) into fixed patches (say 16×16). That yields a grid of patches — for those numbers, 14×14 = 196 of them.
Embed. Flatten each patch's pixels and pass them through one linear layer that maps them to the model's embedding dimension $d$. Add a position embedding so the model knows where each patch sat. This little vision front-end is a Vision Transformer (ViT).
Concatenate. Place the image tokens into the sequence with the text tokens: [text… , img_1, img_2, … , img_196]. From here the main transformer is unchanged — it just attends over a longer, mixed sequence.
Train the bridge. The projection (and often the whole stack) is trained on image–text pairs so the visual tokens land in a space the language model already understands — the lineage of CLIP and models like LLaVA.
Other modalities, same idea. Audio becomes spectrogram patches; video becomes frames-worth of patches across time. Generation runs the other way — producing image or audio tokens, often via a diffusion or codec decoder.

So "multimodal" is less a new architecture than a set of adapters that turn every kind of input into tokens the one transformer can read.

An image of height $H$ and width $W$, cut into square patches of side $P$, yields a number of patch tokens:

$$ N_{\text{patches}} = \frac{H}{P} \times \frac{W}{P} = \frac{H\,W}{P^2} $$

Each patch is a block of $P \times P \times C$ pixel values ($C$ = colour channels). Flattened to a vector $x_p \in \mathbb{R}^{P^2 C}$, it's projected to the embedding dimension $d$ by a learned matrix $E$ and given a position embedding:

$$ z_p = E\, x_p + \text{pos}_p, \qquad E \in \mathbb{R}^{d \times P^2 C} $$

The full input is then the concatenation $[\,z^{\text{text}}_1,\dots,z^{\text{text}}_m,\; z^{\text{img}}_1,\dots,z^{\text{img}}_N\,]$ — one sequence of $d$-dimensional vectors. Because attention cost grows with the square of sequence length, those $N$ image tokens are not free.

def patch_tokens(H, W, P): return (H // P) * (W // P) # N = (H*W) / P^2 for H, W, P in [(224, 224, 16), (336, 336, 14), (512, 512, 16)]: print(f"{H}x{W}, patch {P}: {patch_tokens(H, W, P)} tokens") # 224x224, patch 16: 196 tokens # 336x336, patch 14: 576 tokens # 512x512, patch 16: 1024 tokens <- one image ≈ a page of text, in tokens # a 10-word text prompt is ~13 tokens; a single image can be 50x larger

Why a photo costs like a page

Pixels → money

Because images are billed as tokens, vision is expensive. A single picture can be hundreds to over a thousand tokens — a short text prompt is a dozen. So sending one photo can cost more than a long paragraph, and the quadratic attention bill means a few high-resolution images can dominate a request's entire compute.

Video is the extreme case: it's images stacked through time, so a few seconds of footage can be tens of thousands of tokens. This is why "just feed it the whole video" is an economic, not just technical, problem — and why providers downsample frames aggressively. Multimodality is one of the largest forces inflating the inference demand the build-out is racing to supply.

The strategic read for the Circuit: every new modality multiplies the tokens per interaction, which multiplies the compute per user, which raises the revenue each interaction must eventually justify. Seeing and hearing are powerful — and they move the break-even line further out.

Multimodal models

A picture is just more tokens

One image → patches → tokens in the same sequence

From pixels to a shared sequence

Patches, counted and projected

How many tokens is a picture?

Why a photo costs like a page

The primary sources