FlashAttention-3 — Teaching Old Attention New Hardware Tricks.

NVIDIA released the H100 GPU in 2023. It was a monster — 3× the tensor core throughput of the A100, new asynchronous execution pipelines, and native FP8 support. Every CUDA kernel on earth got faster for free.

Except FlashAttention. It got faster, sure, but only 35% utilization on H100. The old code was leaving two-thirds of the new hardware's power unused.

The A100 version couldn't tap into the H100's best features because those features didn't exist when FlashAttention-2 was written. It's like buying a car with a turbocharger and never pushing the boost button because your old car didn't have one.

Part 3 of the FlashAttention evolution series. Part 2 covered how FA-2 doubled speed through better work partitioning. Now it's time to exploit new hardware.

The problem: new GPU, old tricks

NVIDIA's Hopper architecture (H100) introduced three game-changing features:

Asynchronous execution — the Tensor Memory Accelerator (TMA) can load data from HBM while tensor cores are busy computing. Two things happening at once instead of taking turns.
Warp specialization — different warps can be dedicated to different roles (some load data, some compute). Like having dedicated sous chefs for prep and line cooks for cooking, instead of everyone doing everything.
FP8 precision — native hardware support for 8-bit floating point. Half the bits of FP16 means you can move twice the data in the same time and fit twice the data in the same memory.

FlashAttention-2 couldn't use any of these. It was written for the A100's programming model, where everything happens synchronously — load data, wait, compute, wait, load more data. On the H100, that's like running on one leg when you have two.

The big idea: overlap everything

FlashAttention-3's core principle is asynchronous pipelining. Instead of the linear "load → compute → load → compute" pattern, it overlaps loading and computing so neither the memory unit nor the tensor cores are ever idle.

1. Warp specialization — divide and conquer

FlashAttention-3 splits its warps into two teams: producers and consumers.

Producers are responsible for loading the next block of K and V from HBM into shared memory using TMA. They don't do any math. They're the warehouse team, stocking shelves.

Consumers take whatever's been loaded and run the attention computation — the QK^T matmul, softmax, and PV matmul. They're the cashiers, processing customers.

The genius is that these happen simultaneously. While consumers are computing attention on block j, producers are already loading block j+1. When consumers finish, the next block is already waiting. Zero idle time.

FlashAttention-2 ran like a relay race — one runner at a time. FlashAttention-3 runs like a factory assembly line — loading and computing happen in parallel, continuously.

2. Ping-pong scheduling — interleaving matmul and softmax

Here's a subtle problem: even with data loading overlapped, there's another bottleneck inside the computation itself. The two matmuls (QK^T and PV) use tensor cores, but the softmax between them uses general-purpose cores. While softmax runs, tensor cores sit idle.

FlashAttention-3 fixes this with a "ping-pong" schedule. It works on two tiles at once. While tile A's softmax runs on the general-purpose cores, tile B's matmul runs on the tensor cores. Then they swap. Neither unit is ever waiting for the other.

It's like a juggler keeping two balls in the air — one hand catches while the other throws.

Fig 1 — The ping-pong pipeline: TMA loads, tensor cores compute, softmax runs — all simultaneously.

3. FP8 with incoherent processing — fast AND accurate

FP8 (8-bit floating point) is tempting because it halves the memory bandwidth needed compared to FP16. But naively quantizing attention scores to FP8 introduces significant numerical errors. Some tokens get huge attention scores while others are tiny — and 8 bits can't represent both accurately.

FlashAttention-3 uses a clever trick called incoherent processing. Before quantizing, it multiplies the input by a random orthogonal matrix that "spreads out" the values, making them more uniform. After computing attention in FP8, it undoes the transformation. The net effect: you get FP8 speed with 2.6× lower numerical error than naive FP8 attention.

Think of it like packing a suitcase. If you have a few bulky items and many small ones, they pack poorly — wasted space everywhere. But if you reorganize everything to be roughly the same size first, everything fits snugly. Same contents, better packing, less wasted precision.

Does it actually work?

1.5–2.0× speedup over FlashAttention-2 on H100 with FP16
740 TFLOPs/s in FP16 — 75% of the H100's theoretical maximum
~1.2 PFLOPs/s in FP8 — yes, that's petaFLOPS, from a single GPU
2.6× lower numerical error than naive FP8 attention

Let's put that in perspective. FlashAttention-1 on A100 did about 120 TFLOPs/s. FlashAttention-2 pushed that to 225 TFLOPs/s. Now FlashAttention-3 on H100 hits 740 TFLOPs/s. That's a 6× improvement in three years — and most of it came from smarter software, not just faster hardware.

The surprise: the hardware gap keeps growing

Here's something I didn't expect. Even after all these optimizations, FlashAttention-3 only hits 75% utilization on H100. Where's the other 25%?

The answer is softmax. On the A100, the bottleneck was memory bandwidth. On the H100, memory bandwidth is better but tensor cores are so fast that now the bottleneck is the general-purpose cores running softmax. The faster tensor cores get, the more the non-matmul operations become the choking point.

This is a hardware design trend: each GPU generation doubles matmul speed but barely speeds up everything else. It's like building a faster engine without widening the exhaust pipe. FlashAttention-4 (spoiler) will have to deal with this problem getting even worse on the next GPU generation.

Why should you care?

Hardware-software co-design is mandatory now. You can't just write "GPU code" anymore. You need to write "H100 code" or "A100 code." Each architecture has different bottlenecks and different features to exploit.
FP8 is real and it works. FlashAttention-3 proved that lower precision doesn't have to mean worse results. With the right tricks (incoherent processing), FP8 attention is both faster and surprisingly accurate.
The bottleneck shifted. Attention's bottleneck went from memory bandwidth (FA-1 era) → GPU utilization (FA-2 era) → non-matmul throughput (FA-3 era). Each paper solved the current bottleneck and revealed the next one.

The one-paragraph version

FlashAttention-3 rewrites FlashAttention for the H100's new capabilities: warp specialization to overlap data loading with computation, a ping-pong schedule to run matmul and softmax simultaneously on different hardware units, and FP8 support with incoherent processing for accurate low-precision math. The result: 1.5–2× faster than FA-2, hitting 75% of H100's theoretical max in FP16 and crossing the petaFLOP barrier in FP8.

The napkin takeaway

The FlashAttention evolution so far:

FA-1 = stopped wasting time in traffic (eliminated memory round-trips)
FA-2 = learned to use all lanes (better GPU utilization through work partitioning)
FA-3 = upgraded to a hybrid engine (overlap loading + computing + softmax, all running simultaneously)

Each version solved the bottleneck the previous one revealed.

Paper: "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision" — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao. Together AI / Colfax / NVIDIA / Georgia Tech. 2024.

Next up: FlashAttention-4 — Blackwell GPUs arrive, tensor cores get so fast that exponential functions become the bottleneck, and the team has to fake math in software to keep up.