Flash Attention FA4 for Blackwell GPUs | Boost Performance & Speed

Last week, a new version of Flash Attention was released — this is already the fourth iteration. This time, the developers focused on optimizing for the new Blackwell architecture (B200 and GB200). Unlike FA3, which was tailored for Hopper, the performance gains on 5090 series GPUs are modest.

For the BF16 format, the new algorithm achieves up to 1.3 times faster performance compared to cuDNN 9.13 and up to 2.7 times faster compared to Triton. It can deliver an impressive 1.6 PFLOPs/sec, which is about 71% of the theoretical maximum for B200. It’s worth noting that some of these optimizations have already appeared in recent cuDNN versions.

Key tricks include software emulation of the exponential function (exp), conditional softmax scaling, and during backpropagation — the use of tensor memory and 2-CTA MMA mode. These approaches significantly reduce the load on shared memory. Additionally, the entire kernel is now written in Python using CuTe-DSL, without hard C++ templates, which sped up compilation by 20-30 times.

What is tensor memory? It’s a special ultra-fast on-chip buffer in Blackwell, located right next to tensor cores, where intermediate results can be stored and favor reduced access to shared memory. The 2-CTA MMA mode is a method where one matrix multiplication is performed by a group of threads (two CTAs) instead of one — allowing larger tiles and reducing traffic within shared memory, making backward pass more efficient.

If you’re interested in this area, I recommend finding a detailed breakdown on YouTube explaining how FA4 works — it’s very clearly explained.

Most importantly: there is a scientific paper and open-source code available for those who want to dive deeper.

Created with n8n:
https://cutt.ly/n8n

Created with syllaby:
https://cutt.ly/syllaby