FlashAttention-4 is the attention kernel built for NVIDIA's Blackwell SM100 architecture, and it changes the throughput math for long-context inference workloads. If you're running LLMs on Spheron B200 instances or planning to move workloads from Hopper to Blackwell, FA4 is the primary reason attention-heavy tasks get substantially faster on the new hardware.
This post covers how FA4's SM100 tile architecture works, what the benchmarks show versus FA3 and FA2, how to set it up in vLLM and SGLang, and how to decide whether a migration from H100 makes sense for your workload. For background on what makes Blackwell different from Hopper at the hardware level, see the NVIDIA B200 complete guide.
What Changed in FlashAttention-4: The SM100 Tile Architecture
FlashAttention (FA) has always been about keeping attention computation in fast SRAM rather than repeatedly reading and writing to HBM. FA2 improved the original tiling algorithm to better utilize GPU parallelism. FA3 on Hopper (H100, H200) added warp specialization: separate warps handle softmax and matrix multiplications concurrently, and asynchronous data pipelines overlap compute with HBM prefetch. FA3 achieved roughly 1.75x the throughput of FA2 on H100.
FA4 for Blackwell replaces FA3's warp-specialized pipeline with SM100's Tensor Memory Accelerator (TMA) and a new tile-based execution model. The key differences:
| Attribute | FA2 | FA3 (Hopper) | FA4 (Blackwell SM100) |
|---|---|---|---|
| Kernel model | Thread block tiling | Warp specialization + async pipelining | TMA tile prefetch + SM100 tile execution |
| Hardware target | Ampere/Ada (SM80/SM89) | Hopper (SM90) | Blackwell data-center (SM100/SM103) |
| Memory management | Explicit SRAM tile management | Async warp-level prefetch | TMA hardware-managed tile DMA |
| Key optimization | IO-optimal tiling, online softmax | Overlapped softmax/GEMM warps | TMA prefetch eliminates softmax pipeline stalls |
| Relative throughput | Baseline | ~1.5-1.75x FA2 (H100) | ~1.5-2x FA3 (B200, long sequences) |
The TMA on SM100 lets the GPU autonomously manage tile transfers between HBM and SRAM without warp-level code. In FA3, warp-specialized code coordinates the overlap between attention softmax and matrix multiply phases. In FA4, TMA hardware takes over the prefetch scheduling, which removes a class of pipeline stall that FA3 couldn't eliminate in software.
The practical result is that FA4's tile execution on SM100 sustains near-peak HBM bandwidth utilization at long sequence lengths, where FA3 leaves more bandwidth on the table due to synchronization overhead.
FA4 is built using the same CuTeDSL tile abstraction that powers custom kernel development in CUDA 13. For context on writing custom tile kernels with that API, see the CUDA 13 tile programming guide.
FlashAttention-4 vs FlashAttention-3 vs FlashAttention-2: Performance Benchmarks
The throughput advantage of FA4 over FA3 grows with sequence length. At short sequences (2K-8K tokens), the attention computation is compute-bound, not memory-bound, so the HBM optimization delivers modest gains. At long sequences (32K-512K), attention becomes memory-bandwidth-bound, and FA4's TMA-managed tiling cuts HBM reads dramatically.
Directional benchmarks based on published FA4 research from Tri Dao and collaborators (Together AI) on B200 and H100 hardware:
| Sequence length | FA2 (A100) | FA3 (H100 SXM5) | FA4 (B200 SXM6) | FA4 / FA3 ratio (directional) |
|---|---|---|---|---|
| 2K | ~120 TFLOPS | ~350 TFLOPS | ~450 TFLOPS | ~1.3x |
| 8K | ~100 TFLOPS | ~310 TFLOPS | ~480 TFLOPS | ~1.5x |
| 32K | ~65 TFLOPS | ~270 TFLOPS | ~510 TFLOPS | ~1.9x |
| 128K | ~30 TFLOPS | ~200 TFLOPS | ~430 TFLOPS | ~2.1x |
| 512K | ~10 TFLOPS | ~150 TFLOPS | ~390 TFLOPS | ~2.6x |
These are directional figures based on published benchmarks. Results vary by model architecture, head dimension, batch size, and precision format. Verify against your workload before using these numbers in architecture decisions.
The FA4 vs FA3 column tells the real story: the longer the context, the bigger the gap. For workloads running at 2K-8K context lengths, migrating to Blackwell for FA4 gives a meaningful but not transformative attention speedup. For 32K+ workloads, the gap is large enough to change the cost model entirely.
For end-to-end inference framework comparisons that incorporate these attention kernel differences, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
Hardware Requirements: Which GPUs Support FA4 and Fallback Options
Framework auto-detection (vLLM v0.17+, SGLang v0.4+) picks the correct attention backend based on GPU compute capability. No manual configuration is needed. FA4 activates on SM100 (B200) and SM103 (B300); FA3 on SM90 (H100/H200); FA2 on SM80/SM89 (A100, L40S, RTX 4090) and SM120 (RTX 5090, RTX PRO 6000, which lack the TMEM subsystem FA4 requires).
| GPU | FA version | Compute capability | VRAM | Notes |
|---|---|---|---|---|
| B200 SXM6 | FA4 | 10.0 (SM100) | 192 GB HBM3e | Primary FA4 target |
| B300 SXM6 (Blackwell Ultra) | FA4 | 10.3 (SM103) | 288 GB HBM3e | Highest throughput |
| RTX 5090 | FA2 | 12.0 (SM120) | 32 GB GDDR7 | Consumer Blackwell, no TMEM |
| RTX PRO 6000 | FA2 | 12.0 (SM120) | 96 GB GDDR7 | Workstation Blackwell, no TMEM |
| H100 SXM5/PCIe | FA3 | 9.0 | 80 GB HBM3/HBM2e | Best Hopper option |
| H200 SXM5 | FA3 | 9.0 | 141 GB HBM3e | High-VRAM Hopper |
| A100 SXM4/PCIe | FA2 | 8.0 | 40/80 GB HBM2e | FA2 with MQA/GQA |
| L40S | FA2 | 8.9 | 48 GB GDDR6 | Ada Lovelace |
| RTX 4090 | FA2 | 8.9 | 24 GB GDDR6X | Consumer Ada |
For teams staying on Hopper hardware and wanting to get the most out of FA3, see H100 GPU rental on Spheron and the vLLM production deployment guide for FA3 configuration.
FA4 and FP4 quantization are complementary on Blackwell. FA4 handles the attention computation (using BF16 or FP8 precision), while FP4 quantizes the model weights. Running both together is the highest-throughput configuration available on B200. See the FP4 quantization guide for the full decision tree on when FP4 quality tradeoffs are worth it.
Setting Up FlashAttention-4 with vLLM and SGLang on Spheron Blackwell Instances
vLLM Setup
vLLM v0.17.0+ enables FA4 automatically on Blackwell hardware. No special flags are needed:
docker run --gpus all --ipc=host -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype bfloat16 \
--max-model-len 131072 \
--gpu-memory-utilization 0.92Verify FA4 is active by checking startup logs for Using FlashAttention-4 backend. If you need to force a specific backend for testing or comparison purposes:
# Force FA4 explicitly (Blackwell only)
--attention-backend flash_attn
# Force FA3 for comparison on Blackwell
--attention-backend flash_attn_3SGLang Setup
SGLang v0.4+ also auto-selects FA4 on SM100:
docker run --gpus all --ipc=host -p 30000:30000 \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.3-70B-Instruct \
--tp 1 \
--context-length 131072 \
--dtype bfloat16 \
--mem-fraction-static 0.88The --mem-fraction-static 0.88 flag reserves memory for FA4's larger tile buffers on B200. If you see OOM errors during prefill at long context lengths, lower this to 0.85.
For production configuration options for both frameworks, see the SGLang production deployment guide and vLLM production deployment guide.
TensorRT-LLM: FP4 + FA4 for Maximum Throughput
TensorRT-LLM combines NVFP4 weight quantization with FA4 attention for the highest raw throughput on B200:
import tensorrt_llm
from tensorrt_llm import LLM, SamplingParams
llm = LLM(
model="nvidia/Llama-3.1-8B-Instruct-NVFP4",
dtype="fp4",
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain FlashAttention tiling"], sampling_params)FP4 handles the weight compute; FA4 handles the attention. The combination is the correct default for any Blackwell deployment where FP4 quality is acceptable.
Long-Context Performance: FA4's Impact on 128K-1M Token Inference Latency
Standard attention has O(n²) memory complexity. At 128K tokens, the full attention matrix would be 128K × 128K = 16 billion entries. Even in FP16, that's 32 GB just for the attention matrix, repeated across every attention layer. FlashAttention avoids materializing this matrix by computing attention in tiles that stay in SRAM. The effectiveness of tiling at different sequence lengths is what separates FA2, FA3, and FA4.
FA4's TMA-managed tiles on SM100 are more efficient at very long sequences because the hardware handles all the tile DMA scheduling that FA3 does in software. The result shows up in time-to-first-token (TTFT) at long context lengths:
| Context length | H100 SXM5 + FA3 (TTFT) | B200 SXM6 + FA4 (TTFT) | Speedup |
|---|---|---|---|
| 8K | ~85 ms | ~65 ms | 1.3x |
| 32K | ~280 ms | ~155 ms | 1.8x |
| 128K | ~1,100 ms | ~490 ms | 2.2x |
| 512K | ~4,200 ms | ~1,500 ms | 2.8x |
These estimates are for a 70B parameter model at batch size 1 using BF16 precision. Actual latencies depend on model architecture, tensor parallelism degree, and hardware SKU. The directional pattern holds: FA4's advantage grows with context length.
For context windows where even B200's 192 GB HBM3e isn't enough to hold the full KV cache, see the KV cache optimization guide for prefix caching and chunked prefill strategies that reduce KV cache memory pressure. For workloads that push beyond HBM limits entirely, see NVMe KV cache offloading for LLM inference. For a broader explanation of why memory bandwidth is the dominant bottleneck in long-context inference, see the AI memory wall inference guide.
Real-World Throughput Gains: Tokens Per Second Before and After FA4
Attention is one component of total inference throughput, but it's the dominant factor at long context lengths and large batch sizes. The throughput gains below reflect combined prefill + decode throughput at batch size 32 using BF16 on a single GPU (multi-GPU figures scale approximately linearly with tensor parallelism):
| Model | H100 SXM5 + FA3 (tok/s) | B200 SXM6 + FA4 (tok/s) | Throughput gain (directional) |
|---|---|---|---|
| Llama 3.1 8B | ~75,000 | ~140,000 | ~1.9x |
| Llama 3.1 70B | ~18,000 | ~35,000 | ~1.9x |
| Llama 3.1 405B | ~3,200 (8xH100) | ~6,400 (8xB200) | ~2.0x |
These are directional estimates at 8K context length. The gap widens at 32K+.
Cost per million tokens (on-demand pricing, 70B at batch 32):
| Config | $/hr | Throughput (tok/s) | $/1M tokens |
|---|---|---|---|
| H100 SXM5 + FA3, on-demand | $2.54 | 18,000 | $0.039 |
| H100 SXM5 + FA3, spot | $0.80 | 18,000 | $0.012 |
| B200 SXM6 + FA4, on-demand | $5.54 | 35,000 | $0.044 |
| Spheron B200 instances + FA4, spot | $1.71 | 35,000 | $0.014 |
| Bare-metal B300 SXM6 + FA4, spot (extrapolated ~1.3x B200 throughput) | $2.45 | ~45,000 | ~$0.015 |
Formula: Cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000
B200 SXM6 on-demand at $0.044/1M is roughly 13% above H100 on-demand at $0.039/1M. The higher B200 hourly rate ($5.54/hr vs $2.54/hr) outpaces the throughput advantage at current pricing. The better comparison for throughput-constrained teams: getting 35K tok/s from one B200 at $5.54/hr vs two H100s at $5.08/hr total. The gap is small enough that B200 on-demand is a reasonable call when consolidating to fewer nodes matters. The spot story is different in kind: B200 spot ($0.014/1M) runs a modest premium over H100 spot ($0.012/1M) per token, but delivers the throughput of two H100s in a single GPU, reducing tensor parallelism overhead for latency-sensitive workloads.
Pricing fluctuates based on GPU availability. The prices above are based on 21 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Migration Guide: Upgrading Your Existing Inference Stack to FlashAttention-4
Step 1: Audit your GPU fleet compute capability
nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader10.0or10.3: Blackwell data-center (B200 SM100, B300 SM103), FA4 supported12.0: Consumer/workstation Blackwell (RTX 5090, RTX PRO 6000, SM120), FA2 only, no TMEM9.0: Hopper (H100/H200), FA3 is correct8.0or8.9: Ampere/Ada, FA2 only
Step 2: Check your framework version
- vLLM: FA4 requires v0.17.0+ (released March 7, 2026). Check with
python -c "import vllm; print(vllm.__version__)" - SGLang: FA4 requires v0.4+. Check with
python -c "import sglang; print(sglang.__version__)"
Step 3: Benchmark on a spot B200 instance first
Spot pricing makes evaluation cheap. Run the standard vLLM benchmark before committing to on-demand:
python -m vllm.benchmarks.benchmark_throughput \
--model meta-llama/Llama-3.3-70B-Instruct \
--input-len 2048 \
--output-len 512 \
--num-prompts 100Compare against your current H100 numbers. If FA4 throughput doesn't cover the migration effort at current pricing, H100 spot may still be the right call for your cost model.
Step 4: Evaluate the FP4 quality tradeoff
If you're migrating to B200, FA4 + FP4 is the recommended target configuration. FP4 adds another ~1.5-2x throughput on top of FA4's gains for weight-compute-bound models. The quality tradeoff is task-dependent. For the full FP4 decision framework, see the FP4 quantization guide.
FA3 fallback for teams staying on Hopper
If you're not ready to migrate to Blackwell, FA3 on H100/H200 is fully supported and requires no configuration in vLLM or SGLang. To explicitly force FA3 (for example, if auto-detection selects an older backend):
# vLLM: force FA3 on Hopper
--attention-backend flash_attn_3Migration cost comparison
| Config | Hourly rate | 70B throughput | $/1M tokens |
|---|---|---|---|
| H100 SXM5 FA3, on-demand | $2.54/hr | 18K tok/s | $0.039 |
| H100 SXM5 FA3, spot | $0.80/hr | 18K tok/s | $0.012 |
| B200 SXM6 FA4, on-demand | $5.54/hr | 35K tok/s | $0.044 |
| B200 SXM6 FA4, spot | $1.71/hr | 35K tok/s | $0.014 |
Pricing fluctuates based on GPU availability. The prices above are based on 21 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
The migration break-even depends on your usage pattern. On-demand B200 SXM6 FA4 costs $0.044/1M tokens versus $0.039/1M for H100 on-demand, so on a raw per-token basis H100 is slightly cheaper. The more useful comparison for throughput-constrained teams: one B200 at $5.54/hr delivers 35K tok/s, while two H100s to match that capacity cost ~$5.08/hr total. The ~9% premium for B200 may be worth paying to run a single GPU instead of coordinating two nodes. Spot B200 FA4 at $1.71/hr delivers nearly 2x the throughput of H100 spot in one GPU, which makes it the better option for teams where capacity and simplicity outweigh marginal per-token cost differences.
For broader cost reduction strategies beyond attention kernels, see the GPU cost optimization playbook.
FlashAttention-4 is live on Spheron's Blackwell instances today. Rent a B200 or B300 by the minute with no contracts, and see FA4 throughput gains on your actual workload before committing.
Rent B200 on Spheron → | Rent B300 on Spheron → | View all GPU pricing →
