Divergent Compute.AI Economic Think Tank

First Principles / Part II · Models / Chapter 12

First Principles · Models · 12

Multimodal models

A model that "sees" doesn't need a new brain. An image is just chopped into patches, each patch turned into a token in the same vector space as words — and fed into the same transformer. Everything becomes tokens.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

A picture is just more tokens

The deep trick of multimodal models is that there's no trick. The transformer doesn't care whether a token came from a word or a corner of a photo — it just attends over a sequence of vectors. So to make a model "see," you convert an image into a sequence of vectors and splice them in next to the text tokens.

The conversion is mechanical: cut the image into a grid of small patches, flatten each patch, and run it through a linear layer that projects it into the model's embedding dimension — exactly where word tokens live. Now text and image share one sequence, and attention can relate "the cat" in the prompt to the patch of the photo where the cat is.

Click any patch of this little image and watch it light up as a token sitting right alongside the words:

One image → patches → tokens in the same sequence

A stylized 4×4 image. Click a patch (or a token) — each patch is exactly one token.

↓ patchify & project

Input sequence to the transformer

Click a patch to see it become a token.

02Mechanics

From pixels to a shared sequence

  • Patchify. Split the image (say 224×224) into fixed patches (say 16×16). That yields a grid of patches — for those numbers, 14×14 = 196 of them.
  • Embed. Flatten each patch's pixels and pass them through one linear layer that maps them to the model's embedding dimension $d$. Add a position embedding so the model knows where each patch sat. This little vision front-end is a Vision Transformer (ViT).
  • Concatenate. Place the image tokens into the sequence with the text tokens: [text… , img_1, img_2, … , img_196]. From here the main transformer is unchanged — it just attends over a longer, mixed sequence.
  • Train the bridge. The projection (and often the whole stack) is trained on image–text pairs so the visual tokens land in a space the language model already understands — the lineage of CLIP and models like LLaVA.
  • Other modalities, same idea. Audio becomes spectrogram patches; video becomes frames-worth of patches across time. Generation runs the other way — producing image or audio tokens, often via a diffusion or codec decoder.

So "multimodal" is less a new architecture than a set of adapters that turn every kind of input into tokens the one transformer can read.

04The math

expand ▾

Patches, counted and projected

An image of height $H$ and width $W$, cut into square patches of side $P$, yields a number of patch tokens:

$$ N_{\text{patches}} = \frac{H}{P} \times \frac{W}{P} = \frac{H\,W}{P^2} $$

Each patch is a block of $P \times P \times C$ pixel values ($C$ = colour channels). Flattened to a vector $x_p \in \mathbb{R}^{P^2 C}$, it's projected to the embedding dimension $d$ by a learned matrix $E$ and given a position embedding:

$$ z_p = E\, x_p + \text{pos}_p, \qquad E \in \mathbb{R}^{d \times P^2 C} $$

The full input is then the concatenation $[\,z^{\text{text}}_1,\dots,z^{\text{text}}_m,\; z^{\text{img}}_1,\dots,z^{\text{img}}_N\,]$ — one sequence of $d$-dimensional vectors. Because attention cost grows with the square of sequence length, those $N$ image tokens are not free.

05The code

expand ▾

How many tokens is a picture?

Patch counts for real input sizes — the number that lands in your context window per image.

patches.py

def patch_tokens(H, W, P):
    return (H // P) * (W // P)        # N = (H*W) / P^2

for H, W, P in [(224, 224, 16), (336, 336, 14), (512, 512, 16)]:
    print(f"{H}x{W}, patch {P}: {patch_tokens(H, W, P)} tokens")
# 224x224, patch 16: 196 tokens
# 336x336, patch 14: 576 tokens
# 512x512, patch 16: 1024 tokens   <- one image ≈ a page of text, in tokens

# a 10-word text prompt is ~13 tokens; a single image can be 50x larger

06The economics

Why a photo costs like a page

Pixels → money

Because images are billed as tokens, vision is expensive. A single picture can be hundreds to over a thousand tokens — a short text prompt is a dozen. So sending one photo can cost more than a long paragraph, and the quadratic attention bill means a few high-resolution images can dominate a request's entire compute.

Video is the extreme case: it's images stacked through time, so a few seconds of footage can be tens of thousands of tokens. This is why "just feed it the whole video" is an economic, not just technical, problem — and why providers downsample frames aggressively. Multimodality is one of the largest forces inflating the inference demand the build-out is racing to supply.

The strategic read for the Circuit: every new modality multiplies the tokens per interaction, which multiplies the compute per user, which raises the revenue each interaction must eventually justify. Seeing and hearing are powerful — and they move the break-even line further out.

07Going deeper

expand ▾

The primary sources

Dosovitskiy et al. (2020) — An Image is Worth 16×16 Words (ViT) · images as patch tokens.
Radford et al. (2021) — CLIP · aligning image and text in one embedding space.
Alayrac et al. (2022) — Flamingo · bridging a vision encoder into a language model.
Liu et al. (2023) — LLaVA · a simple, open visual-instruction recipe.

Cite this chapter: Divergent Compute, "Multimodal models", First Principles, 2026. divergentcompute.com/first-principles-multimodal · v1.0 · CC-BY.

← Chapter 11
Differences between LLMs
Next · Chapter 13 →
Quantization & distillation