May 18, 2026 · paperjuice · 18 min read · 4200 words

LLaVA — One Matrix Multiplication Taught a Language Model to See.

paperjuice ml vision-language-models multimodal

You paste a photo of your fridge into ChatGPT and ask "what can I cook with this?" It scans the shelves, names the ingredients, suggests a pasta recipe. You don't think twice.

But here's what should bother you: the model that answered was trained on text. Words. Sentences. Wikipedia. Reddit. Code. Not a single pixel. So how does a machine that learned language suddenly see?

In April 2023, a team from UW-Madison, Microsoft Research, and Columbia published an answer so simple I thought I was missing something. Take a vision model. Take a language model. Connect them with one matrix multiplication. Then train on data that GPT-4 wrote — without GPT-4 ever seeing a single image.

The paper is called LLaVA — Large Language and Vision Assistant. NeurIPS 2023 Oral. And it changed how the entire field thinks about multimodal AI.

The problem: two brilliant experts who don't speak the same language

By early 2023, AI had two incredible powers that couldn't talk to each other.

CLIP, OpenAI's vision model, could look at any image and produce a rich set of "visual features" — numbers that encode what's in the picture. Dogs, cars, text on signs, facial expressions. CLIP learned this from 400 million image-text pairs. It sees almost everything.

Vicuna, an open-source chatbot fine-tuned from Meta's LLaMA, could hold conversations, follow instructions, write code, reason through problems. The best open language model of its time.

The problem? CLIP's visual features live in one mathematical space. Vicuna's word embeddings live in a completely different one. Imagine a brilliant French chef and a brilliant Japanese chef in the same kitchen — both are world-class, but they literally cannot read each other's recipes.

The expensive solutions that came before

DeepMind's Flamingo solved this by inserting gated cross-attention layers throughout the language model — like stationing a professional interpreter on every floor of an office building. Powerful, but architecturally complex and expensive to train.

Salesforce's BLIP-2 built a Q-Former — an entire separate transformer whose only job was translating between vision and language. Like hiring a whole translation department with its own budget and staff, just to relay messages.

Both approaches shared a belief: the gap between vision and language is huge, so you need a heavy bridge to cross it.

LLaVA asked: what if the gap isn't that big?

LLaVA's idea: a phrase book instead of an interpreter

What if, instead of building an elaborate bridge between vision and language, you handed both sides a simple phrase book — and then gave them something interesting to talk about?

🖼️ your photo CLIP ViT-L/14 ❄ FROZEN visual features Zv W the phrase book 🔥 TRAINABLE image + text tokens concatenated Vicuna 13B ❄ / 🔥 "It's a golden retriever on a beach..." LLaVA Architecture Hv = W · Zv — one matrix multiplication is the entire bridge That's it. That's the whole thing.

Fig 1 — The complete LLaVA architecture. CLIP sees the image. One matrix W converts its output into tokens the language model understands. Vicuna generates the response. No cross-attention, no Q-Former, no new components.

CLIP looks at your image and produces visual features — a grid of numbers that capture what it sees. Those features are rich, but they're in CLIP's mathematical language, not Vicuna's.

A single trainable matrix W multiplies those features. Hv = W × Zv. No activation function. No normalization. No layers. One matrix. It converts CLIP's visual features into tokens that are the same size and shape as Vicuna's word embeddings.

Those converted image tokens sit next to the text tokens in a sequence. From Vicuna's perspective, it just sees a longer sentence. Some tokens came from an image, others from text, but they all look the same. Vicuna does what it always does — predicts the next token, then the next, until it's built a complete response.

That's the whole architecture. One matrix is the entire vision-language bridge.

Flamingo hired a professional interpreter. BLIP-2 built a translation department. LLaVA handed both sides a phrase book — and discovered they already spoke almost the same language.

Why go this simple? Because a lightweight projection lets you iterate on data experiments fast. When your architecture is one matrix, you can test ten ideas in the time a Q-Former finishes one. That speed turned out to matter more than architectural complexity.

The data trick: a blind teacher writing visual exams

If the architecture is dead simple, the data has to carry all the weight. And this is where the paper gets genuinely creative.

The problem: there's no dataset of "look at this image and follow these instructions." We have billions of image-caption pairs, but captions are boring. "A dog on a beach." "A group of people at a table." That's labeling, not thinking. Nobody asks "what challenges might this dog face?" or "describe the mood of this scene."

Building that data with humans would take months. So LLaVA asked GPT-4 to write it.

The twist? In April 2023, GPT-4 was text-only. It had never seen an image in its life. So how do you get a blind model to write visual training data?

You describe the image in words

Instead of showing GPT-4 the actual image, they gave it text descriptions from the COCO dataset — five captions written by different humans, plus a list of bounding boxes showing where each object is in the image.

For a photo of people packing luggage into a black SUV, GPT-4 gets: "A group of people standing outside a black vehicle with various luggage." "People try to fit all their luggage in an SUV." Plus: person: [0.68, 0.24], suitcase: [0.76, 0.41], bicycle: [0.28, 0.36].

From this text alone — no pixels — GPT-4 generates three types of training data:

How GPT-4 Writes Visual Training Data Without Seeing Images COCO image "People try to fit all their luggage in an SUV" + object positions text only GPT-4 (text only) has never seen a single image Conversations "What type of vehicle?" "Where is it parked?" "How many suitcases?" multi-turn, natural Q&A Detailed Descriptions "An underground parking area with three people and a black SUV being packed for a trip..." Complex Reasoning "What challenges do these people face?" → packing, comfort, visibility, planning 58K samples 23K samples 77K samples = 158K total instruction-following samples the key trick: text descriptions let a blind model teach a seeing one

Fig 2 — GPT-4 receives text descriptions of COCO images and generates three types of instruction-following data. It never sees a pixel. A handful of human-written examples seed the process.

Conversations (58K) — natural multi-turn Q&A. "What type of vehicle is that?" "A black SUV." "Where is it parked?" "In an underground garage." The kind of back-and-forth you'd have with a friend looking at the same photo. Only questions with definite answers are included — nothing vague.

Detailed descriptions (23K) — rich scene narratives. Not "a parking lot." Instead: "Three people stand around a black SUV in an underground parking area. One near the left, one in the middle, one on the right. Two backpacks, two suitcases, and a bicycle are visible." The kind of observation you'd expect from someone actually standing there.

Complex reasoning (77K) — and this is the crucial one. "What challenges do these people face?" GPT-4 doesn't say "fitting luggage." It writes a structured answer about packing efficiency, passenger comfort, driver visibility, and trip planning. That's not describing. That's thinking about what the image implies.

A model that has never seen a single pixel wrote 158K training samples that taught another model to understand images. A blind teacher raising students who can see.

158K samples is tiny. LAION has 5 billion image-text pairs. But LLaVA bet that variety and depth beat volume. That bet paid off.

Two-stage training: first learn the words, then learn to think

You can't just multiply random numbers between CLIP and Vicuna and expect intelligent answers. The phrase book needs to be calibrated. LLaVA does this in two stages:

Stage 1 — The Handshake 595K simple image-caption pairs "Describe the image" → original caption CLIP ❄ frozen W 🔥 only this trains Vicuna ❄ frozen teach W to translate CLIP's visual words into Vicuna's language ⏱ 4 hours · 8 A100 GPUs Stage 2 — The Education 158K GPT-4 instruction-following data conversations + reasoning + descriptions CLIP ❄ still frozen W 🔥 trains Vicuna 🔥 now trains too teach the model to reason about images, not just label them ⏱ 10 hours · 8 A100 GPUs

Fig 3 — Two-stage training. Stage 1 calibrates the phrase book (only W trains). Stage 2 teaches the model to reason visually (W + Vicuna both train). Total: ~14 hours.

Stage 1: the handshake

Take 595K image-caption pairs from CC3M. Freeze CLIP. Freeze Vicuna. Train only the projection matrix W.

The task is simple: show an image, ask "describe this," and the correct answer is the original caption. W learns what visual features "sound like" in Vicuna's language — like calibrating a phrase book so both sides recognize basic vocabulary.

Four hours on 8 A100 GPUs. One epoch. After this, Vicuna can "read" image tokens. It doesn't know what to do with them yet — but it can read them.

Stage 2: the education

Now unfreeze Vicuna. CLIP stays frozen — it already sees fine. Train both W and the language model on the 158K instruction-following samples.

This is the stage that transforms the model. Before: "I can describe images." After: "I can reason about images, follow complex instructions, and hold multi-turn visual conversations." The gap between describing and reasoning is everything.

Ten hours on the same 8 GPUs. Total training: ~14 hours.

Flamingo needed billions of image-text pairs and hundreds of TPU-days. BLIP-2 needed 129 million images for its Q-Former alone. LLaVA finished in less time than a transatlantic flight.

Does it actually work?

The authors built their own benchmarks because nothing existed for evaluating multimodal instruction-following. They used GPT-4 as a judge — scoring responses on helpfulness, relevance, accuracy, and detail.

The ceiling? A text-only GPT-4 that gets perfect ground-truth descriptions of each image. So LLaVA is being measured against a model with essentially flawless vision. The scores are reported as percentages of this ceiling.

The numbers:

  • 85.1% relative to GPT-4 on LLaVA-Bench (COCO) — a model with one linear layer reaches 85% of the ceiling set by GPT-4 with perfect input
  • 96.5% on complex reasoning. Read that again. On questions requiring multi-step logical inference about visual scenes, LLaVA nearly matches GPT-4 with perfect information. 96.5%.
  • Destroys existing open models. On the harder In-the-Wild benchmark: LLaVA 67.3% vs BLIP-2's 38.1% (+29 points) vs OpenFlamingo's 19.1% (+48 points)

The gap isn't about quality — it's about behavior. Ask "what's unusual about this?" and BLIP-2 describes the scene. LLaVA explains why it's unusual. BLIP-2 labels. LLaVA reasons. That's the instruction-tuning difference.

The surprise: a blind model makes a seeing one smarter

Here's the part that made me put my coffee down.

On ScienceQA — 21K multimodal science questions — LLaVA alone scores a strong 90.92%. Meanwhile, text-only GPT-4, which literally cannot see images, scores 82.69% using just the question text. Different strengths, different blind spots.

The authors tried something: when LLaVA and GPT-4 disagree, ask GPT-4 to review both answers and make a final call. The result: 92.53% — a new state-of-the-art.

A model that cannot see images improved performance on image-based questions. How? Because some "visual" questions can actually be solved with pure logic. When LLaVA made a visual mistake on one of those, GPT-4's reasoning overruled it. The text model knew when the image didn't matter.

This was the first time anyone used GPT-4 for model ensembling. It shouldn't have worked this well.

Things nobody taught it to do

The numbers are impressive. The examples are eerie.

LLaVA identifies Elon Musk in a regular headshot and in a meme where he's dressed as a Doge — despite Musk never appearing in LLaVA's training data. CLIP saw him during its own pre-training on 400M internet images, and somehow one matrix multiplication was enough to relay that identity to Vicuna.

Show it the "chicken nugget world map" meme and ask what's going on. It doesn't just describe the image. It explains the joke — the juxtaposition between the majestic text about Earth and the mundane reality of chicken nuggets. Genuine humor comprehension from a model trained to describe photos.

Give it a hand-drawn sketch of a website. It produces working HTML, CSS, and JavaScript — with one minor bug. From a pencil drawing. OCR was barely in the training data.

These aren't planned features. They're emergent — they arise from combining a strong vision encoder, a capable language model, and the right training signal. The phrase book didn't just translate. It unlocked capabilities neither side knew it had.

Where it breaks (and what each failure teaches)

The "bag of patches" problem. A fridge contains strawberries and yogurt on different shelves. "Is there strawberry-flavored yogurt?" LLaVA says yes. It sees "strawberry" and "yogurt" separately but misses that they're different items. It reads the image as a bag of ingredients, not a composed scene.

The resolution ceiling. CLIP sees images at 224×224 pixels. Brand names on small labels, text on distant signs, fine details — all lost. The vision encoder, not the language model, is the bottleneck.

Confident hallucination. When the visual signal is weak, Vicuna's fluency takes over. It generates beautiful, confident descriptions that don't match reality. The language model is too good at sounding right — even when the eyes are wrong.

Two ablations that change how you think

Skip Stage 1, and the model collapses. Train directly on instruction data without the alignment pre-training? ScienceQA drops from 90.92% to 85.81% — a 5.1% freefall. Without calibrating the phrase book first, the LLM can't make sense of visual tokens at all. It's like reading a book in a language you've never studied.

Data diversity beats data volume. Train on only conversation data: 73.8%. Add all three types: 85.1% — an 11.3-point jump. But here's the subtle part: adding reasoning and description data improved performance on conversation questions too, by 6.6 points. Teaching the model to reason made it better at chatting. Variety doesn't just add skills — it deepens existing ones.

Why should you care?

  1. Simplicity wins when data is good. One linear layer with 158K high-quality samples competes with multi-billion-parameter bridges trained on hundreds of millions of pairs. If you're building AI systems, invest in data quality before architectural complexity.
  2. You can bootstrap multimodal data from text-only models. GPT-4 never saw an image, yet it wrote training data that taught another model to see. This "describe → generate → train" pipeline is now standard across the field. The data trick is arguably the paper's biggest legacy.
  3. Instruction tuning is the unlock. Without it: 21.5%. With it: 85.1%. Same architecture. Same weights. The only difference is what you train on. That 63.6-point gap is the difference between babbling and reasoning.

The one-paragraph version

LLaVA connects a frozen CLIP vision encoder to the Vicuna language model through a single trainable matrix — one multiplication that converts image features into tokens the LLM can read. It trains in two stages: align the visual tokens using 595K caption pairs (4 hours), then instruction-tune on 158K samples that GPT-4 generated from text descriptions of images it never saw (10 hours). Despite this radical simplicity, LLaVA reaches 96.5% of GPT-4's reasoning on visual tasks, beats BLIP-2 by 29 points, and achieves 92.53% state-of-the-art on ScienceQA. The gap between vision and language was never that wide — we just needed the right data to close it.

The napkin takeaway

If connecting vision to language is a translation problem:

  • Flamingo = hired a professional interpreter at every floor of the building (gated cross-attention at every layer)
  • BLIP-2 = built a full translation department with its own staff (Q-Former — an entire separate transformer)
  • LLaVA = handed them a phrase book and realized they already spoke almost the same language (one matrix + 158K good examples)

Same two people. Same conversation. Wildly different assumptions about how hard translation is.

Paper: "Visual Instruction Tuning" — Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. UW-Madison, Microsoft Research, Columbia. NeurIPS 2023 (Oral).

← Flamingo AI Tools Fundamentals →
© cvam — written in plaintext, served warm