Most H100 deployments today run FlashAttention 2. FA3 shipped in 2024 and adds warp-specialized scheduling and FP8 attention support for Hopper, but the actual throughput gain depends heavily on sequence length and whether your stack is current enough to use it. This post covers what changes architecturally, what the throughput numbers look like at 2K-128K sequence lengths on H100 SXM5 and H200 SXM5, and whether migrating from FA2 to FA3 is worth the effort for your workload.
For the next generation beyond FA3, see FlashAttention-4 on Blackwell.
FlashAttention Version Timeline
A quick reference before getting into the architecture details:
| Version | Year | Hardware target | Key advance |
|---|---|---|---|
| FA1 | 2022 | Ampere (A100) | IO-aware tiling, HBM read reduction |
| FA2 | 2023 | Ampere + Ada | Warp-level parallelism, MQA/GQA support |
| FA3 | 2024 | Hopper (H100, H200) | Async warp specialization, FP8 support |
| FA4 | 2025 | Blackwell (B200, B300) | TMA tile execution, SM100 native |
FA1 established the core insight: recomputing attention on-chip is faster than reading a materialized attention matrix from HBM. FA2 improved the parallelism strategy so more of the GPU's threads were doing useful work. FA3 added Hopper-specific features that FA2's kernel design couldn't use. FA4 targets Blackwell SM100 and is covered in the FlashAttention-4 guide.
Modern frameworks auto-detect the GPU architecture and select the appropriate version. vLLM v0.6+ and SGLang v0.4+ pick FA3 on H100/H200, FA4 on B200/B300, and FA2 on A100/L40S without any configuration needed. You only need explicit flags when overriding for debugging or benchmarking.
How FlashAttention 2 Works
FA2 made four improvements over the original FlashAttention:
Block-wise softmax tiling. FA2 keeps query, key, and value tiles in SRAM during computation. Rather than loading the full attention matrix from HBM, it computes attention in chunks that fit in the GPU's on-chip memory. This cuts HBM reads by the ratio of sequence length to tile size.
Online softmax. FA2 computes softmax using a running maximum and denominator, which lets it process attention blocks sequentially without materializing the N×N attention matrix. This is what makes tiling possible without numerical instability.
Warp-level parallelism across sequence positions. FA2 assigns different warps to different sequence positions, improving parallelism over FA1's thread-block-level splitting.
GQA and MQA support. FA2 added kernels for grouped-query attention and multi-query attention, used by Llama, Mistral, and most other models released since 2023.
FA2 works on Ampere (A100) and Ada Lovelace (L40S, RTX 4090) through the same kernel path. On Hopper, FA2 runs but does not use any of Hopper's async pipeline hardware.
What FlashAttention 3 Adds for Hopper
FA3 is not a general improvement over FA2. It is a Hopper-specific rewrite that uses three SM90 features FA2 cannot access:
Warp specialization. FA3 splits warps into producers and consumers. Producer warps fetch Q/K/V tiles asynchronously using Hopper's wgmma and cp.async instructions. Consumer warps execute the GEMM and softmax operations concurrently on tiles already in SRAM. In FA2, every warp does both data fetching and computation, which creates a sequential dependency: compute stops and waits for data. FA3 eliminates this stall by overlapping the two.
Ping-pong scheduling. FA3 maintains two SRAM tile buffers. While consumer warps process tiles from buffer A, producer warps load the next tiles into buffer B. When buffer A computation finishes, the roles swap. This keeps the tensor cores fed continuously without pipeline bubbles.
FP8 attention computation. FA3 can execute the Q@K^T and softmax(.)@V matrix multiplications in FP8 instead of BF16. The output is upcast to BF16 before returning. FP8 attention requires Transformer Engine's per-tensor scaling headers and flash-attn 2.7+.
Here is a direct comparison of the two architectures:
| Attribute | FA2 | FA3 (Hopper) |
|---|---|---|
| Kernel model | Thread-block tiling, online softmax | Warp-specialized, async pipelining |
| Hardware target | Ampere/Ada (SM80/SM89) | Hopper (SM90) |
| FP8 support | No | Yes (Q/K/V matmuls) |
| GQA/MQA | Yes | Yes |
| Relative throughput (8K context) | Baseline | 1.5-1.75x |
H100 SXM5: FA2 vs FA3 Throughput Benchmarks
The table below shows attention kernel throughput in TFLOPS at different sequence lengths on H100 SXM5. Figures are directional, based on the FA3 research paper (Shah et al., 2024) and community benchmarks. Verify against your specific model architecture before making capacity decisions.
| Sequence length | FA2 (H100 SXM5) | FA3 (H100 SXM5) | FA3/FA2 ratio |
|---|---|---|---|
| 2K | ~210 TFLOPS | ~250 TFLOPS | ~1.19x |
| 8K | ~195 TFLOPS | ~300 TFLOPS | ~1.54x |
| 32K | ~155 TFLOPS | ~275 TFLOPS | ~1.77x |
| 128K | ~90 TFLOPS | ~220 TFLOPS | ~2.44x |
The pattern is consistent with FA3's architecture: the gain is modest at short contexts (compute-bound, where better memory pipelining matters less) and grows substantially at long contexts (memory-bandwidth-bound, where eliminating pipeline stalls reduces idle time). The gap widens at longer contexts and shrinks at shorter ones, the same pattern FA4 shows over FA3.
H200 SXM5: FA2 vs FA3 Throughput Benchmarks
H200 SXM5 has the same compute capacity as H100 SXM5 in FP16 TFLOPS, but 4.8 TB/s HBM3e bandwidth versus H100's 3.35 TB/s. That 43% bandwidth increase matters most at long context lengths where attention is memory-bound.
| Sequence length | FA2 (H200 SXM5) | FA3 (H200 SXM5) | FA3/FA2 ratio |
|---|---|---|---|
| 2K | ~215 TFLOPS | ~260 TFLOPS | ~1.21x |
| 8K | ~275 TFLOPS | ~425 TFLOPS | ~1.55x |
| 32K | ~220 TFLOPS | ~395 TFLOPS | ~1.80x |
| 128K | ~130 TFLOPS | ~305 TFLOPS | ~2.35x |
Figures are directional based on published benchmarks. Verify against your workload before using these numbers in architecture decisions.
Two things stand out from this table. H200's absolute throughput at 8K+ context is substantially higher than H100's, reflecting the bandwidth advantage. FA3's relative gain over FA2 on H200 is slightly lower than on H100 at 128K (2.35x vs 2.44x). This is expected: H200's higher baseline bandwidth already compensates for some of the memory latency that FA3 hides through async pipelining. FA3 is still the right choice on H200, but H200 leaves less performance on the table with FA2 than H100 does.
FP8 Attention on FA3
FA3 exposes a second compute path where the Q@K^T and softmax(.)@V matrix multiplications run in FP8 instead of BF16. This is different from what people usually mean by "FP8 inference," which refers to weight quantization at the model level. FP8 attention is about the precision of the attention operation itself.
The precision split in FA3 FP8 attention:
- Q, K, V projections: BF16 (set at the model/vLLM level, not by FA3)
- Q@K^T matmul: FP8
- Softmax: FP8 numerics with scaling
- Softmax(.)@V matmul: FP8
- Output: upcast to BF16 before returning to the model
FA3 FP8 attention requires NVIDIA Transformer Engine's per-tensor scaling infrastructure to handle the dynamic range calibration that keeps FP8 numerics stable.
Throughput gain. FA3 FP8 attention adds roughly 15-25% throughput on top of FA3 BF16 at 8K+ context. The gain is additive with FP8 weight quantization, since they affect different parts of the computation.
Accuracy. At standard context lengths (2K-32K), the quality gap between FA3 FP8 attention and FA3 BF16 is within measurement noise on common benchmarks. At very long context (128K+), numerical precision matters more and FP8 attention can degrade output quality. Always validate on your target task before enabling FP8 attention in production.
For a full breakdown of FP8 formats (E4M3 vs E5M2) and how they interact with weight quantization, see FP8 quantization explained.
vLLM and SGLang: End-to-End Throughput Impact
Attention kernel throughput does not translate 1:1 to end-to-end serving throughput, since prefill and decode both involve non-attention operations like linear layers, normalization, and sampling. But FA3 does produce measurable improvements in serving metrics, especially at larger batch sizes and longer input lengths.
Directional figures for Llama 3.3 70B, BF16, H100 SXM5, 100 concurrent requests:
Time to first token (TTFT):
- At 2K input: FA3 and FA2 are roughly equivalent (less than 5% difference). Prefill at short context is compute-bound by the linear layers.
- At 8K input: FA3 reduces TTFT by 15-20%. The prefill phase becomes attention-heavy at longer inputs.
Inter-token latency (ITL):
- At batch size 8: The difference is small (less than 5%). Memory bandwidth at small batch sizes is not attention-dominated.
- At batch size 32+: FA3 reduces ITL by 10-15%. Larger batch sizes increase the attention memory footprint, where FA3's async pipelining helps.
SGLang shows a similar pattern. SGLang's RadixAttention cache layer works with both FA2 and FA3 without modification.
These are directional figures based on published community benchmarks. Your numbers will differ based on model, hardware SKU, and serving configuration. For a full framework comparison across vLLM, TensorRT-LLM, and SGLang, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
When FA2 Is Still the Right Call
FA3 is better than FA2 on H100 and H200. That said, there are cases where staying on FA2 is the practical choice:
Older CUDA or PyTorch stacks. FA3 requires CUDA 12.3+ and PyTorch 2.2+. Running PyTorch 2.0 or CUDA 12.2 means enabling FA3 requires a full stack upgrade. The upgrade is worth doing for a new deployment, but not necessarily mid-flight for a stable production service.
Non-Hopper hardware. FA3 is SM90-only. A100, L40S, and RTX 4090 all run FA2 regardless of what flash-attn version you have installed. If you're deploying across a mixed GPU fleet, FA3 activates only on the Hopper nodes; Ampere/Ada nodes run FA2 automatically. You don't need separate container images.
Non-power-of-2 head dimensions. FA3 as of flash-attn 2.7 has coverage gaps for non-standard head dimensions. If your model uses head dimensions that aren't powers of 2 (such as 96 or 192), test FA3 compatibility before deploying.
Custom masking patterns. FA3 did not ship all of FA2's custom mask options at launch. If you rely on non-causal masking modes or sliding window attention with unusual patterns, test the specific mask configuration before switching.
For most Hopper deployments running standard Llama, Mistral, or Qwen model families with 128-dim heads, FA3 works without issues and the upgrade path is straightforward.
Migrating from FA2 to FA3
Install
# FA3 support requires flash-attn >= 2.7.0 and CUDA 12.3+
pip install "flash-attn>=2.7.0" --no-build-isolation
# Verify FA3 is available
python -c "import flash_attn
print(flash_attn.__version__)
from flash_attn.flash_attn_interface import flash_attn_func_v3
print('FA3 available')
"vLLM
# Default: auto-selects FA3 on Hopper (vLLM 0.6+)
docker run --gpus all vllm/vllm-openai:latest --model <model> --dtype bfloat16
# Explicit FA3
vllm serve <model> --attention-backend flash_attn_v3
# Force FA2 for debugging or comparison
vllm serve <model> --attention-backend flash_attnSGLang
# SGLang v0.4+ auto-selects FA3 on H100/H200
python -m sglang.launch_server --model-path <model> --tp 1 --dtype bfloat16
# Explicit FA3
python -m sglang.launch_server --model-path <model> --attention-backend flashattn3Common pitfalls
flash_attnversion below 2.7 silently falls back to FA2 on Hopper without error or warning.- FA3 requires
--dtype bfloat16or--dtype float16. Running--dtype float32disables FA3. - If you see
CUDA error: invalid device function, your CUDA version is below 12.3. - FA3 is the default in vLLM 0.6+ on Hopper. If your logs don't show the FA3 backend, your flash-attn version is likely the issue.
For a full production deployment guide covering tensor parallelism, memory optimization, and serving configuration alongside FA3, see the vLLM production deployment guide.
Cost Impact: Tokens/sec/$ on Spheron H100 and H200
The attention kernel choice affects throughput directly, which changes cost per token. Using live pricing from Spheron as of 25 May 2026 (H100 SXM5: $3.84/hr, H200 SXM5: $4.56/hr on-demand):
| Setup | GPU | Backend | Tokens/sec (70B, batch 32) | $/hr (on-demand) | $/1M tokens |
|---|---|---|---|---|---|
| FA2 BF16 | H100 SXM5 | FA2 | ~12,000 | $3.84 | $0.089 |
| FA3 BF16 | H100 SXM5 | FA3 | ~18,000 | $3.84 | $0.059 |
| FA3 + FP8 attn | H100 SXM5 | FA3 + FP8 | ~21,000 | $3.84 | $0.051 |
| FA2 BF16 | H200 SXM5 | FA2 | ~16,000 | $4.56 | $0.079 |
| FA3 BF16 | H200 SXM5 | FA3 | ~23,000 | $4.56 | $0.055 |
Formula: Cost per 1M tokens = ($/hr) / (tokens/sec × 3,600) × 1,000,000
Tokens/sec figures are directional estimates for Llama 3.3 70B at 8K context, batch 32. Actual throughput depends on model architecture, batch size, and input length distribution.
Switching from FA2 to FA3 BF16 on H100 SXM5 drops cost from $0.089 to $0.059 per million tokens with the same hardware at the same hourly rate. FA3 + FP8 attention cuts it further to $0.051. On H200 SXM5, FA3 BF16 at $0.055/1M tokens is competitive with H100 FA3 + FP8 attention ($0.051), with H200's 141 GB VRAM headroom as the additional advantage for larger models or longer KV caches.
You can benchmark these configurations on H100 GPU rental or try H200 SXM5 for long-context inference on Spheron, both available on-demand with no commitment. Running both back-to-back on identical hardware takes under an hour and gives you real numbers for your specific workload.
Pricing fluctuates based on GPU availability. The prices above are based on 25 May 2026 and may have changed. Check current GPU pricing → for live rates.
FA3 is active by default on H100 and H200 in vLLM 0.6+ and SGLang 0.4+. If you want to benchmark FA2 vs FA3 side-by-side on identical hardware, H100 SXM5 and H200 SXM5 are both available on Spheron at listed per-hour rates with no commitment.
Quick Setup Guide
Run nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader to confirm compute capability 9.0 (Hopper). H100 SXM5 and H200 SXM5 both report 9.0, which is required for FA3. Also run nvcc --version to confirm CUDA 12.3 or later. FA3 will not activate on CUDA 12.2 or older even on the correct GPU architecture; vLLM and SGLang will silently fall back to FA2 without any error message.
Run pip install "flash-attn>=2.7.0" --no-build-isolation to install the version that includes FA3 kernel support for Hopper. The --no-build-isolation flag speeds up the CUDA compilation step by reusing the system's CUDA headers. After installation, verify FA3 availability by running python -c 'from flash_attn.flash_attn_interface import flash_attn_func_v3; print("FA3 available")'. If the import fails, your CUDA version may be below 12.3 or the GPU architecture is not SM90.
Launch vLLM with --dtype bfloat16 and FA3 activates automatically on H100/H200 in vLLM v0.6+. Check the startup logs for 'Using FlashAttention-3 backend' to confirm FA3 is active. For explicit control, pass --attention-backend flash_attn_v3 to force FA3 or --attention-backend flash_attn to force FA2 for side-by-side comparison benchmarking.
SGLang v0.4+ auto-selects FA3 on Hopper when launched with --dtype bfloat16. To explicitly specify FA3, pass --attention-backend flashattn3 to the launch_server command. Verify the backend is active by checking SGLang startup output for the attention kernel selection message. Both FA3 BF16 and FA3 FP8 attention modes are available in SGLang 0.4+.
Use vLLM's built-in benchmark to measure tokens/sec before and after switching to FA3: python -m vllm.benchmarks.benchmark_throughput --model <model-id> --input-len 8192 --output-len 512 --num-prompts 100. Run once with --attention-backend flash_attn (FA2) and once with --attention-backend flash_attn_v3 (FA3), then compare throughput. Use 8K input-len to capture the regime where FA3's gains are most pronounced, and record GPU utilization during both runs.
Enable FP8 attention in vLLM with --kv-cache-dtype fp8 and run your evaluation benchmarks on a held-out test set. Compare output quality metrics (perplexity, task accuracy, or generation quality) against a BF16 attention baseline. FP8 attention is generally safe at 2K-32K context lengths but may degrade at 128K+ where numerical precision matters more. Run at least 500 evaluation samples to catch tail-case quality regressions before enabling FP8 attention in production.
Frequently Asked Questions
Yes. On H100 SXM5, FA3 is roughly 1.5-1.75x faster than FA2 for attention computation at 8K+ sequence lengths. At short contexts (2K), the gain is closer to 1.2x because the workload is more compute-bound than memory-bound. FA3's warp specialization and async pipelining deliver the largest gains when memory bandwidth is the bottleneck, which happens at longer context lengths and larger batch sizes.
Yes. FA3 exposes an FP8 compute path for the Q@K^T and softmax(.)@V matrix multiplications inside the attention kernel. The output is upcast to BF16 before returning to the model. FP8 attention adds roughly 15-25% throughput on top of FA3 BF16 at 8K+ context. It requires CUDA 12.3+, flash-attn 2.7+, and per-tensor scaling headers from NVIDIA Transformer Engine. FP8 attention is separate from FP8 weight quantization, which is configured at the model precision level in vLLM or SGLang.
vLLM v0.6+ auto-selects FA3 on Hopper GPUs (H100, H200) when you run with --dtype bfloat16 or float16. No flag is required for the default path. To explicitly force FA3, pass --attention-backend flash_attn_v3. To revert to FA2 for debugging, use --attention-backend flash_attn. Ensure flash-attn >= 2.7.0 is installed in your container, since older versions silently fall back to FA2 on Hopper without any error.
No. FA3's warp specialization and async data pipelining require Hopper's SM90 architecture. The flash-attn package installs on A100 (SM80) and L40S (SM89) without errors, but the FA3 compute path does not activate. vLLM and SGLang automatically fall back to FA2 on Ampere and Ada Lovelace GPUs. For A100 and L40S workloads, FA2 with GQA/MQA is the correct configuration.
Stick with FA2 if you are on CUDA 12.2 or older, on PyTorch 2.1 or older, running on non-Hopper GPUs (A100, L40S, RTX 4090), or using attention patterns FA3 does not fully support, such as non-causal masking with non-standard head dimensions. If your stack passes all these checks, upgrading to FA3 is low-risk since vLLM and SGLang handle backend selection automatically.
At 128K sequence length on H100 SXM5, FA3 achieves roughly 2.4x the attention TFLOPS of FA2. The gap grows with context length because attention becomes increasingly memory-bandwidth-bound at longer sequences, and FA3's async pipelining hides memory latency more effectively than FA2's synchronous approach. At 2K context, the gain is only about 1.2x since the workload is more compute-bound.
H200 SXM5 shows higher absolute attention throughput with FA3 than H100 SXM5, due to its 4.8 TB/s HBM3e bandwidth versus H100's 3.35 TB/s. The relative speedup of FA3 over FA2 is slightly smaller on H200, because H200's higher baseline bandwidth compensates for some of the latency that FA3 hides. In practice, both H100 and H200 see 1.2-2.4x FA3 gains depending on sequence length, with H200 always showing higher absolute throughput.
