May 17, 2026 · paperjuice · 14 min read · 3200 words

Flamingo — What If Your AI Could Learn a New Task Just by Seeing a Few Examples.

paperjuice ml vision-language-models few-shot-learning

You want your AI to count flamingos in a photo. The usual approach? Collect 50,000 labeled images of flamingos. Annotate each one by hand. Train a model for three days. Pray it generalizes. Then repeat the entire process when someone asks it to count penguins instead.

DeepMind looked at this and said: what if the model just needed to see four examples?

The paper is called Flamingo, it was published at NeurIPS 2022, and it introduced a visual language model that learns new vision tasks the same way GPT-3 learns new text tasks — by reading a few examples in its prompt. No fine-tuning. No retraining. Just show and tell.

The problem: every new task means starting over.

In 2022, vision AI had a frustrating split personality. On one side, you had powerful vision models (like NFNet or CLIP) that understood images beautifully — but couldn't talk. On the other side, you had massive language models (like Chinchilla) that could write essays — but were completely blind.

If you wanted a model that could look at an image and answer a question about it, you had to build a custom pipeline for each task. Visual question answering? Train a model. Image captioning? Train a different model. Count objects in a video? Yet another model.

That's not intelligence. That's an assembly line.

Flamingo's big idea: give language models eyes.

Flamingo's core insight is almost embarrassingly elegant: take a frozen language model that already understands language, give it a way to see images, and let it learn new visual tasks the same way GPT-3 learns new text tasks — through examples in the prompt.

But "give it eyes" is easy to say and brutally hard to engineer. Flamingo solves this with three architectural innovations that work together like a relay team:

Flamingo Architecture — How Vision Meets Language 🖼 Images 🎬 Videos Vision Encoder (NFNet-F6, frozen) Perceiver Resampler variable → fixed LM Block (frozen) Gated XATTN Dense visual tokens LM Block (frozen) Gated XATTN Dense Text Output Text Input Frozen Trainable

Fig 1 — Flamingo's architecture. Frozen pretrained models (dashed) stay untouched. Only the Perceiver Resampler and Gated Cross-Attention layers (solid yellow) are trained — roughly 2% of total parameters.

1. The Perceiver Resampler — one size fits all.

Here's the first problem: images come in different sizes. Videos have different numbers of frames. But a language model expects a fixed-size input. How do you squeeze an entire video into something a text model can digest?

Flamingo's answer is the Perceiver Resampler. Think of it as a translator who always writes exactly the same length summary, whether you hand them a tweet or a novel. It takes the raw visual features — which could be 50 tokens for a small image or 500 tokens for a long video — and compresses them into exactly 64 fixed visual tokens.

It does this using learned queries — 64 slots that "ask questions" of the visual features through cross-attention, picking out the most important information. The result? A clean, constant-size visual representation regardless of what went in.

The Perceiver Resampler is a universal adapter. One photo, ten video frames, a 4K image — it always outputs the same 64 tokens. Plug and play.
Perceiver Resampler — Variable In, Fixed Out v₁ v₂ v₃ ... vₙ Visual features (variable size) q₁ q₂ q₃ ... q₆₄ Learned queries (always 64) Cross-Attention Transformer K, V Q t₁ t₂ t₃ ... t₆₄ Output tokens (always 64) Whether the input is 1 image (50 tokens) or 30 video frames (1500 tokens), the Perceiver Resampler always outputs exactly 64 visual tokens.

Fig 2 — The Perceiver Resampler compresses variable-length visual features into a fixed set of 64 tokens using learned queries and cross-attention.

2. Gated cross-attention — the gentle injection.

Now you have 64 visual tokens. But how do you feed them to a language model that was trained purely on text? You can't just concatenate them — that would confuse a model that's never seen visual information before.

Flamingo's solution is brilliant: insert new gated cross-attention layers between the existing frozen language model layers. These layers let text tokens "look at" the visual tokens and absorb relevant information.

The "gated" part is the clever bit. Each cross-attention layer has a learnable gate (a tanh parameter) that starts at zero. At initialization, the gate is closed — meaning the visual information has zero effect, and the model behaves exactly like the original language model. During training, the gate gradually opens, letting visual information flow in at whatever strength is optimal.

Imagine giving someone glasses for the first time, but the lenses start completely transparent. The prescription gradually sharpens as the person learns to use their new vision. That's the tanh gate.

3. Interleaved attention — see the right image at the right time.

Here's a subtle but critical design choice. When Flamingo processes a sequence with multiple images, each text token only attends to the most recent preceding image — not all previous images.

Why? Imagine reading a children's book. On page 3, when the text says "the red balloon," you look at the picture on page 3 — not the pictures from pages 1 and 2. That's exactly what Flamingo does.

This per-image attention masking has a massive practical benefit: the model is trained with only 5 image-text pairs per sequence, but at test time, it can generalize to 32 or more pairs. It learned a pattern, not a fixed window size.

Per-Image Cross-Attention Masking 🖼 Image A "A cat on a mat" 🖼 Image B "A dog in a park" 🖼 Image C ??? When generating the answer for Image C... ✗ Ignores A ✗ Ignores B ✓ Attends to C Trained with 5 images per sequence → Generalizes to 32+ images at test time Because each text only looks at one image, the pattern scales.

Fig 3 — Each text token only cross-attends to its nearest preceding image. This lets the model generalize from 5-shot training to 32-shot evaluation.

The training recipe: web-scale, no hand-labeling.

Flamingo's training data is remarkably clever in its simplicity. No human annotators. No curated datasets. Just the internet.

Three data sources, all scraped from the web:

  • M3W — 43 million webpages with naturally interleaved images and text (the HTML itself becomes training data)
  • ALIGN + LTIP — 2.1 billion image-text pairs from alt-text and long captions
  • VTP — 27 million short video clips paired with text descriptions

And here's the key constraint: the vision encoder (NFNet-F6) and the language model (Chinchilla, 70B parameters) are both completely frozen. Flamingo only trains the Perceiver Resampler and the gated cross-attention layers — roughly 1.2 billion new parameters on top of an 80B total. That's about 1.5% of the model being trained from scratch. Everything else keeps its pretrained knowledge intact.

Flamingo doesn't rebuild the brain. It adds a visual cortex to an existing brain and only trains the new wiring.

Does it work? Embarrassingly well.

Flamingo was tested on 16 benchmarks spanning image captioning, visual question answering, video understanding, and visual classification. The results are staggering.

The headline numbers:

  • State of the art on 6 of 16 tasks — with just 32 few-shot examples and zero fine-tuning
  • Beats fine-tuned models that used 1,000× more labeled data on several benchmarks
  • 82.0 CIDEr on COCO captioning (4-shot) — previous few-shot best: 65.0
  • 56.3 on VQAv2 (32-shot) vs. the fine-tuned SOTA of 80.0 — but remember, Flamingo saw 32 examples, the SOTA saw millions

Read that second bullet again. Models that were custom-trained on thousands of labeled examples, with architectures designed specifically for one task — beaten by a general-purpose model that just looked at a handful of examples in its prompt.

The surprise: dialogue falls out for free.

Here's something DeepMind didn't explicitly train for, and probably didn't fully expect.

Because Flamingo handles interleaved images and text, it turns out the model can do multi-turn visual dialogue out of the box. You show it an image, ask a question, get an answer, ask a follow-up — and it holds context across the conversation.

In one demo, researchers showed Flamingo a DALL·E 2 generated image of a "soup monster" — something that literally doesn't exist in any training data — and had a coherent multi-turn conversation about it. The model described what it saw, speculated about what the creature might eat, and even cracked a joke.

This is emergent behavior. Nobody designed a "dialogue module." The architecture was flexible enough that conversation just... worked.

Scaling: bigger model, better few-shot.

Flamingo Few-Shot Performance by Model Size Relative Score 4-shot 8-shot 32-shot 3B 9B 80B 3B 9B 80B 3B 9B 80B More shots + bigger model = consistently better. 80B Flamingo with 32 shots is the sweet spot.

Fig 4 — Flamingo's few-shot performance scales cleanly with both model size and number of examples, mirroring GPT-3's scaling behavior in the language domain.

Like GPT-3 in the text world, Flamingo shows clean scaling laws in the vision world. Three model sizes were tested: Flamingo-3B, Flamingo-9B, and Flamingo-80B.

Two trends hold across every benchmark:

  1. Bigger models are better few-shot learners. Flamingo-80B consistently crushes Flamingo-3B, even with the same number of examples.
  2. More examples always help. Going from 4-shot to 32-shot improves every model, with no sign of saturation.

When fine-tuned (unfreezing the vision encoder with a small learning rate), Flamingo-80B sets a new state of the art on 5 additional benchmarks: VQAv2, VATEX, VizWiz, MSRVTTQA, and HatefulMemes.

Why should you care?

If you work with images, video, or multimodal AI, Flamingo matters for three reasons:

  1. You don't need a dataset for every task. Flamingo proved that a single model with a few examples in its prompt can match or beat task-specific models trained on millions of labels. The era of "collect 50K images, label them, train a model" is ending.
  2. Frozen pretrained models are your friends. By keeping the vision encoder and language model frozen, Flamingo avoids catastrophic forgetting and keeps training costs manageable. This design pattern — freeze the big stuff, train the glue — is now standard in modern VLMs.
  3. The architecture pattern survived. Flamingo's gated cross-attention approach directly influenced models like IDEFICS, OpenFlamingo, and the entire generation of visual-language models that followed. If you use GPT-4V, Claude with vision, or Gemini — you're benefiting from ideas Flamingo pioneered.

The one-paragraph version.

Flamingo is a visual language model from DeepMind that learns new vision tasks from just a handful of examples — no fine-tuning needed. It works by bridging a frozen vision encoder and a frozen language model with two key innovations: a Perceiver Resampler that compresses any image or video into a fixed set of visual tokens, and gated cross-attention layers that let the language model "see" those tokens without forgetting how to read. Trained on billions of image-text pairs scraped from the web, Flamingo-80B beats fine-tuned specialist models on 6 of 16 benchmarks using only 32 examples — outperforming systems trained on 1,000× more data.

The napkin takeaway.

If teaching AI to see is like learning a language:

  • Traditional vision models = enrolling in a 4-year university program for each language
  • CLIP = learning to match photos with captions in a flashcard app
  • Flamingo = showing a polyglot four example sentences and watching them start speaking fluently

Same brain. Same eyes. Radically less hand-holding.

Paper: "Flamingo: a Visual Language Model for Few-Shot Learning" — Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech et al., DeepMind. NeurIPS 2022.

← I-JEPA LLaVA →
© cvam — written in plaintext, served warm