FlashAttention-4 on GPU Cloud: Blackwell Inference Guide (2026)

FlashAttention-4 is the attention kernel built for NVIDIA's Blackwell SM100 architecture, and it changes the throughput math for long-context inference workloads. If you're running LLMs on Spheron B200 instances or planning to move workloads from Hopper to Blackwell, FA4 is the primary reason attention-heavy tasks get substantially faster on the new hardware.

This post covers how FA4's SM100 tile architecture works, what the benchmarks show versus FA3 and FA2, how to set it up in vLLM and SGLang, and how to decide whether a migration from H100 makes sense for your workload. For background on what makes Blackwell different from Hopper at the hardware level, see the NVIDIA B200 complete guide.

What Changed in FlashAttention-4: The SM100 Tile Architecture

FlashAttention (FA) has always been about keeping attention computation in fast SRAM rather than repeatedly reading and writing to HBM. FA2 improved the original tiling algorithm to better utilize GPU parallelism. FA3 on Hopper (H100, H200) added warp specialization: separate warps handle softmax and matrix multiplications concurrently, and asynchronous data pipelines overlap compute with HBM prefetch. FA3 achieved roughly 1.75x the throughput of FA2 on H100.

FA4 for Blackwell replaces FA3's warp-specialized pipeline with SM100's Tensor Memory Accelerator (TMA) and a new tile-based execution model. The key differences:

Attribute	FA2	FA3 (Hopper)	FA4 (Blackwell SM100)
Kernel model	Thread block tiling	Warp specialization + async pipelining	TMA tile prefetch + SM100 tile execution
Hardware target	Ampere/Ada (SM80/SM89)	Hopper (SM90)	Blackwell data-center (SM100/SM103)
Memory management	Explicit SRAM tile management	Async warp-level prefetch	TMA hardware-managed tile DMA
Key optimization	IO-optimal tiling, online softmax	Overlapped softmax/GEMM warps	TMA prefetch eliminates softmax pipeline stalls
Relative throughput	Baseline	~1.5-1.75x FA2 (H100)	~1.5-2x FA3 (B200, long sequences)

The TMA on SM100 lets the GPU autonomously manage tile transfers between HBM and SRAM without warp-level code. In FA3, warp-specialized code coordinates the overlap between attention softmax and matrix multiply phases. In FA4, TMA hardware takes over the prefetch scheduling, which removes a class of pipeline stall that FA3 couldn't eliminate in software.

The practical result is that FA4's tile execution on SM100 sustains near-peak HBM bandwidth utilization at long sequence lengths, where FA3 leaves more bandwidth on the table due to synchronization overhead.

FA4 is built using the same CuTeDSL tile abstraction that powers custom kernel development in CUDA 13. For context on writing custom tile kernels with that API, see the CUDA 13 tile programming guide. For writing custom attention patterns and fused operators in Python without C++ on Hopper, see the Triton kernel development guide. For custom mask patterns on Hopper and Blackwell without CUDA, see the PyTorch FlexAttention production guide.

FlashAttention-4 vs FlashAttention-3 vs FlashAttention-2: Performance Benchmarks

The throughput advantage of FA4 over FA3 grows with sequence length. At short sequences (2K-8K tokens), the attention computation is compute-bound, not memory-bound, so the HBM optimization delivers modest gains. At long sequences (32K-512K), attention becomes memory-bandwidth-bound, and FA4's TMA-managed tiling cuts HBM reads dramatically.

Directional benchmarks based on published FA4 research from Tri Dao and collaborators (Together AI) on B200 and H100 hardware:

Sequence length	FA2 (A100)	FA3 (H100 SXM5)	FA4 (B200 SXM6)	FA4 / FA3 ratio (directional)
2K	~120 TFLOPS	~350 TFLOPS	~450 TFLOPS	~1.3x
8K	~100 TFLOPS	~310 TFLOPS	~480 TFLOPS	~1.5x
32K	~65 TFLOPS	~270 TFLOPS	~510 TFLOPS	~1.9x
128K	~30 TFLOPS	~200 TFLOPS	~430 TFLOPS	~2.1x
512K	~10 TFLOPS	~150 TFLOPS	~390 TFLOPS	~2.6x

These are directional figures based on published benchmarks. Results vary by model architecture, head dimension, batch size, and precision format. Verify against your workload before using these numbers in architecture decisions.

The FA4 vs FA3 column tells the real story: the longer the context, the bigger the gap. For workloads running at 2K-8K context lengths, migrating to Blackwell for FA4 gives a meaningful but not transformative attention speedup. For 32K+ workloads, the gap is large enough to change the cost model entirely.

For end-to-end inference framework comparisons that incorporate these attention kernel differences, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Hardware Requirements: Which GPUs Support FA4 and Fallback Options

Framework auto-detection (vLLM v0.17+, SGLang v0.4+) picks the correct attention backend based on GPU compute capability. No manual configuration is needed. FA4 activates on SM100 (B200) and SM103 (B300); FA3 on SM90 (H100/H200); FA2 on SM80/SM89 (A100, L40S, RTX 4090) and SM120 (RTX 5090, RTX PRO 6000, which lack the TMEM subsystem FA4 requires).

GPU	FA version	Compute capability	VRAM	Notes
B200 SXM6	FA4	10.0 (SM100)	192 GB HBM3e	Primary FA4 target
B300 SXM6 (Blackwell Ultra)	FA4	10.3 (SM103)	288 GB HBM3e	Highest throughput
RTX 5090	FA2	12.0 (SM120)	32 GB GDDR7	Consumer Blackwell, no TMEM
RTX PRO 6000	FA2	12.0 (SM120)	96 GB GDDR7	Workstation Blackwell, no TMEM
H100 SXM5/PCIe	FA3	9.0	80 GB HBM3/HBM2e	Best Hopper option
H200 SXM5	FA3	9.0	141 GB HBM3e	High-VRAM Hopper
A100 SXM4/PCIe	FA2	8.0	40/80 GB HBM2e	FA2 with MQA/GQA
L40S	FA2	8.9	48 GB GDDR6	Ada Lovelace
RTX 4090	FA2	8.9	24 GB GDDR6X	Consumer Ada

For teams staying on Hopper hardware and wanting to get the most out of FA3, see H100 GPU rental on Spheron and the vLLM production deployment guide for FA3 configuration.

FA4 and FP4 quantization are complementary on Blackwell. FA4 handles the attention computation (using BF16 or FP8 precision), while FP4 quantizes the model weights. Running both together is the highest-throughput configuration available on B200. See the FP4 quantization guide for the full decision tree on when FP4 quality tradeoffs are worth it.

Setting Up FlashAttention-4 with vLLM and SGLang on Spheron Blackwell Instances

vLLM Setup

vLLM v0.17.0+ enables FA4 automatically on Blackwell hardware. No special flags are needed:

bash

docker run --gpus all --ipc=host -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype bfloat16 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92

Verify FA4 is active by checking startup logs for Using FlashAttention-4 backend. If you need to force a specific backend for testing or comparison purposes:

bash

# Force FA4 explicitly (Blackwell only)
--attention-backend flash_attn

# Force FA3 for comparison on Blackwell
--attention-backend flash_attn_3

When using FlashAttention-4 with torch.compile, register FA4 as a custom op via torch.library so Dynamo does not try to trace through its CUDA kernel directly. The torch.compile and CUDA Graphs guide covers custom op boundary handling in detail, including how to use torch.library.custom_op and torch.compiler.disable() to isolate FA kernels from the compiled graph.

SGLang Setup

SGLang v0.4+ also auto-selects FA4 on SM100:

bash

docker run --gpus all --ipc=host -p 30000:30000 \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --tp 1 \
  --context-length 131072 \
  --dtype bfloat16 \
  --mem-fraction-static 0.88

The --mem-fraction-static 0.88 flag reserves memory for FA4's larger tile buffers on B200. If you see OOM errors during prefill at long context lengths, lower this to 0.85.

For production configuration options for both frameworks, see the SGLang production deployment guide and vLLM production deployment guide.

TensorRT-LLM: FP4 + FA4 for Maximum Throughput

TensorRT-LLM combines NVFP4 weight quantization with FA4 attention for the highest raw throughput on B200:

python

import tensorrt_llm
from tensorrt_llm import LLM, SamplingParams

llm = LLM(
    model="nvidia/Llama-3.1-8B-Instruct-NVFP4",
    dtype="fp4",
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain FlashAttention tiling"], sampling_params)

FP4 handles the weight compute; FA4 handles the attention. The combination is the correct default for any Blackwell deployment where FP4 quality is acceptable.

Long-Context Performance: FA4's Impact on 128K-1M Token Inference Latency

Standard attention has O(n²) memory complexity. At 128K tokens, the full attention matrix would be 128K × 128K = 16 billion entries. Even in FP16, that's 32 GB just for the attention matrix, repeated across every attention layer. FlashAttention avoids materializing this matrix by computing attention in tiles that stay in SRAM. The effectiveness of tiling at different sequence lengths is what separates FA2, FA3, and FA4.

FA4's TMA-managed tiles on SM100 are more efficient at very long sequences because the hardware handles all the tile DMA scheduling that FA3 does in software. The result shows up in time-to-first-token (TTFT) at long context lengths:

Context length	H100 SXM5 + FA3 (TTFT)	B200 SXM6 + FA4 (TTFT)	Speedup
8K	~85 ms	~65 ms	1.3x
32K	~280 ms	~155 ms	1.8x
128K	~1,100 ms	~490 ms	2.2x
512K	~4,200 ms	~1,500 ms	2.8x

These estimates are for a 70B parameter model at batch size 1 using BF16 precision. Actual latencies depend on model architecture, tensor parallelism degree, and hardware SKU. The directional pattern holds: FA4's advantage grows with context length.

For context windows where even B200's 192 GB HBM3e isn't enough to hold the full KV cache, see the KV cache optimization guide for prefix caching and chunked prefill strategies that reduce KV cache memory pressure. For workloads that push beyond HBM limits entirely, see NVMe KV cache offloading for LLM inference. For a broader explanation of why memory bandwidth is the dominant bottleneck in long-context inference, see the AI memory wall inference guide. For workloads where full transformer attention is cost-prohibitive at 512K-1M tokens even with FA4, the log-linear attention inference guide covers O(N log N) alternatives and their recall tradeoffs versus full attention.

Real-World Throughput Gains: Tokens Per Second Before and After FA4

Attention is one component of total inference throughput, but it's the dominant factor at long context lengths and large batch sizes. The throughput gains below reflect combined prefill + decode throughput at batch size 32 using BF16 on a single GPU (multi-GPU figures scale approximately linearly with tensor parallelism):

Model	H100 SXM5 + FA3 (tok/s)	B200 SXM6 + FA4 (tok/s)	Throughput gain (directional)
Llama 3.1 8B	~75,000	~140,000	~1.9x
Llama 3.1 70B	~18,000	~35,000	~1.9x
Llama 3.1 405B	~3,200 (8xH100)	~6,400 (8xB200)	~2.0x

These are directional estimates at 8K context length. The gap widens at 32K+.

Cost per million tokens (on-demand pricing, 70B at batch 32):

Config	$/hr	Throughput (tok/s)	$/1M tokens
H100 SXM5 + FA3, on-demand	$2.54	18,000	$0.039
H100 SXM5 + FA3, spot	$0.80	18,000	$0.012
B200 SXM6 + FA4, on-demand	$5.54	35,000	$0.044
Spheron B200 instances + FA4, spot	$1.71	35,000	$0.014
Bare-metal B300 SXM6 + FA4, spot (extrapolated ~1.3x B200 throughput)	$2.45	~45,000	~$0.015

Formula: Cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000

B200 SXM6 on-demand at $0.044/1M is roughly 13% above H100 on-demand at $0.039/1M. The higher B200 hourly rate ($5.54/hr vs $2.54/hr) outpaces the throughput advantage at current pricing. The better comparison for throughput-constrained teams: getting 35K tok/s from one B200 at $5.54/hr vs two H100s at $5.08/hr total. The gap is small enough that B200 on-demand is a reasonable call when consolidating to fewer nodes matters. The spot story is different in kind: B200 spot ($0.014/1M) runs a modest premium over H100 spot ($0.012/1M) per token, but delivers the throughput of two H100s in a single GPU, reducing tensor parallelism overhead for latency-sensitive workloads.

Pricing fluctuates based on GPU availability. The prices above are based on 21 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Migration Guide: Upgrading Your Existing Inference Stack to FlashAttention-4

Step 1: Audit your GPU fleet compute capability

bash

nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader

10.0 or 10.3: Blackwell data-center (B200 SM100, B300 SM103), FA4 supported
12.0: Consumer/workstation Blackwell (RTX 5090, RTX PRO 6000, SM120), FA2 only, no TMEM
9.0: Hopper (H100/H200), FA3 is correct
8.0 or 8.9: Ampere/Ada, FA2 only

Step 2: Check your framework version

vLLM: FA4 requires v0.17.0+ (released March 7, 2026). Check with python -c "import vllm; print(vllm.__version__)"
SGLang: FA4 requires v0.4+. Check with python -c "import sglang; print(sglang.__version__)"

Step 3: Benchmark on a spot B200 instance first

Spot pricing makes evaluation cheap. Run the standard vLLM benchmark before committing to on-demand:

bash

python -m vllm.benchmarks.benchmark_throughput \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --input-len 2048 \
  --output-len 512 \
  --num-prompts 100

Compare against your current H100 numbers. If FA4 throughput doesn't cover the migration effort at current pricing, H100 spot may still be the right call for your cost model.

Step 4: Evaluate the FP4 quality tradeoff

If you're migrating to B200, FA4 + FP4 is the recommended target configuration. FP4 adds another ~1.5-2x throughput on top of FA4's gains for weight-compute-bound models. The quality tradeoff is task-dependent. For the full FP4 decision framework, see the FP4 quantization guide.

FA3 fallback for teams staying on Hopper

If you're not ready to migrate to Blackwell, FA3 on H100/H200 is fully supported and requires no configuration in vLLM or SGLang. To explicitly force FA3 (for example, if auto-detection selects an older backend):

bash

# vLLM: force FA3 on Hopper
--attention-backend flash_attn_3

Migration cost comparison

Config	Hourly rate	70B throughput	$/1M tokens
H100 SXM5 FA3, on-demand	$2.54/hr	18K tok/s	$0.039
H100 SXM5 FA3, spot	$0.80/hr	18K tok/s	$0.012
B200 SXM6 FA4, on-demand	$5.54/hr	35K tok/s	$0.044
B200 SXM6 FA4, spot	$1.71/hr	35K tok/s	$0.014

Pricing fluctuates based on GPU availability. The prices above are based on 21 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The migration break-even depends on your usage pattern. On-demand B200 SXM6 FA4 costs $0.044/1M tokens versus $0.039/1M for H100 on-demand, so on a raw per-token basis H100 is slightly cheaper. The more useful comparison for throughput-constrained teams: one B200 at $5.54/hr delivers 35K tok/s, while two H100s to match that capacity cost ~$5.08/hr total. The ~9% premium for B200 may be worth paying to run a single GPU instead of coordinating two nodes. Spot B200 FA4 at $1.71/hr delivers nearly 2x the throughput of H100 spot in one GPU, which makes it the better option for teams where capacity and simplicity outweigh marginal per-token cost differences.

For broader cost reduction strategies beyond attention kernels, see the GPU cost optimization playbook.

FlashAttention-4 is live on Spheron's Blackwell instances today. Rent a B200 or B300 by the minute with no contracts, and see FA4 throughput gains on your actual workload before committing.
B200 pricing on Spheron → | Rent B300 on Spheron → | View all GPU pricing →

STEPS / 07

Quick Setup Guide

Provision a Blackwell GPU instance on Spheron
Log into app.spheron.ai. Select a B200 SXM6 or B300 SXM6 instance from the GPU catalog, or reserve [GB200 NVL72](/gpu-rental/gb200/) or [GB300 NVL72](/gpu-rental/gb300/) rack capacity for the largest FA4 workloads. Choose Ubuntu 22.04 with CUDA 12.8+ or an NGC base container. The instance will be ready in under 2 minutes with full SSH root access.
Verify SM100 architecture and CUDA version
SSH into the instance and run: nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader. Confirm compute capability is 10.0 (B200 SXM6, SM100) or 10.3 (B300 SXM6, SM103). Consumer Blackwell GPUs like RTX 5090 and RTX PRO 6000 report 12.0 (SM120) and do not support FA4. Run nvcc --version to confirm CUDA 12.8 or later. FA4 requires SM100/SM103 and CUDA 12.8+.
Deploy vLLM with FlashAttention-4 (auto-detected)
Run the standard vLLM Docker command with no FA4 flag needed: docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.3-70B-Instruct --dtype bfloat16 --max-model-len 131072 --gpu-memory-utilization 0.92. vLLM v0.17.0+ detects SM100 and enables FA4 automatically. Confirm by checking logs for 'Using FlashAttention-4 backend'.
Deploy SGLang with FlashAttention-4
SGLang v0.4+ also auto-selects FA4 on Blackwell: docker run --gpus all --ipc=host -p 30000:30000 lmsysorg/sglang:latest python -m sglang.launch_server --model-path meta-llama/Llama-3.3-70B-Instruct --tp 1 --context-length 131072 --dtype bfloat16. Use --mem-fraction-static 0.88 to reserve memory for FA4's larger tile buffers on B200.
Enable FP4 quantization alongside FA4 for maximum throughput
With TensorRT-LLM, load an NVFP4 checkpoint (e.g. nvidia/Llama-3.1-8B-Instruct-NVFP4) and activate FP4 via the Python API using dtype='fp4', not as a CLI flag. With vLLM, load an NVFP4 checkpoint directly and vLLM handles the precision path. This combination of FP4 weights and FA4 attention is the highest-throughput configuration for B200. Benchmark tokens/sec against your BF16 baseline to measure the quality-throughput tradeoff.
Benchmark FA4 throughput on your workload
Run a load test with vLLM's built-in benchmarking tool: python -m vllm.benchmarks.benchmark_throughput --model <model-id> --input-len 2048 --output-len 512 --num-prompts 100. Record tokens/sec. Compare against the same benchmark on an H100 SXM with FA3 to quantify the FA4 uplift on your specific model.
Migrate from H100/FA3 to B200/FA4 - decision checklist
Before migrating: (1) confirm your model supports BF16 or FP8 precision; (2) check whether FP4 quality tradeoff is acceptable for your task; (3) compare cost-per-million-tokens on B200 FA4 vs H100 FA3 using current pricing at spheron.network/pricing/; (4) run a short spot-instance benchmark on B200 before committing to on-demand.

FAQ / 05

Frequently Asked Questions

FlashAttention-4 targets NVIDIA Blackwell data-center GPUs with the SM100/SM103 architecture: the B200 (SM100, compute capability 10.0) and B300 Blackwell Ultra (SM103, compute capability 10.3). Consumer and workstation Blackwell GPUs (RTX 5090 and RTX PRO 6000) use SM120 (compute capability 12.0), which lacks the TMEM subsystem FA4 depends on, so they stay on FA2 via Triton. Hopper GPUs (H100, H200) use FlashAttention-3. Ampere (A100) and Ada Lovelace (L40S, RTX 4090) use FlashAttention-2. vLLM v0.17.0+ automatically selects the correct backend based on the detected GPU architecture with no manual configuration needed.

On Blackwell hardware, FlashAttention-4 achieves roughly 1.5-2x higher attention throughput than FlashAttention-3 due to the SM100 tile-based execution model and Tensor Memory Accelerator (TMA) integration. The gains are most pronounced at long sequence lengths (32K-1M tokens), where FA4's IO-optimal tiling on SM100 keeps HBM bandwidth saturation minimal. For standard context lengths (2K-8K), the improvement is smaller (10-30%) since compute rather than memory often becomes the bottleneck.

No. Both vLLM (v0.17.0+) and SGLang (v0.4+) auto-detect the GPU architecture and select FA4 on Blackwell, FA3 on Hopper, and FA2 on older hardware. No attention backend flag is required for the default path. You can override with --attention-backend flash_attn if needed, but in practice the automatic detection is reliable.

Yes. FlashAttention-4 works alongside FP4 weight quantization on Blackwell. The attention computation itself uses BF16 or FP8 precision (not FP4), but FA4's memory access pattern is optimized for Blackwell's HBM3e bandwidth, complementing FP4's compute efficiency. The combination of NVFP4 weights plus FA4 attention is currently the highest-throughput configuration for Blackwell inference.

Nothing breaks. vLLM and SGLang fall back gracefully: FA3 on Hopper (H100/H200), FA2 on Ampere/Ada. If you are deploying the same container image across mixed GPU fleets, the framework handles backend selection automatically per node. The migration is infrastructure-level: move the workload to a B200 or B300 instance, and FA4 activates without code changes.

What Changed in FlashAttention-4: The SM100 Tile Architecture

FlashAttention-4 vs FlashAttention-3 vs FlashAttention-2: Performance Benchmarks

Hardware Requirements: Which GPUs Support FA4 and Fallback Options

Setting Up FlashAttention-4 with vLLM and SGLang on Spheron Blackwell Instances

vLLM Setup

SGLang Setup

TensorRT-LLM: FP4 + FA4 for Maximum Throughput

Long-Context Performance: FA4's Impact on 128K-1M Token Inference Latency

Real-World Throughput Gains: Tokens Per Second Before and After FA4

Migration Guide: Upgrading Your Existing Inference Stack to FlashAttention-4

Step 1: Audit your GPU fleet compute capability

Step 2: Check your framework version

Step 3: Benchmark on a spot B200 instance first

Step 4: Evaluate the FP4 quality tradeoff

FA3 fallback for teams staying on Hopper

Migration cost comparison

Quick Setup Guide

Provision a Blackwell GPU instance on Spheron

Verify SM100 architecture and CUDA version

Deploy vLLM with FlashAttention-4 (auto-detected)

Deploy SGLang with FlashAttention-4

Enable FP4 quantization alongside FA4 for maximum throughput

Benchmark FA4 throughput on your workload

Migrate from H100/FA3 to B200/FA4 - decision checklist

Frequently Asked Questions

01What GPUs support FlashAttention-4?

02How much faster is FlashAttention-4 compared to FlashAttention-3?

03Do I need to configure FlashAttention-4 manually in vLLM or SGLang?

04Is FlashAttention-4 compatible with FP4 (NVFP4) quantization on B200?

05What happens on older GPUs (H100, A100) when I upgrade my stack to use FA4?

Try It on Real GPUs