Engineering

FlashAttention-4 on GPU Cloud: Blackwell Inference Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 21, 2026
FlashAttention 4BlackwellGPU CloudLLM InferencevLLMSGLangB200Long ContextAttention Kernel
FlashAttention-4 on GPU Cloud: Blackwell Inference Guide (2026)

FlashAttention-4 is the attention kernel built for NVIDIA's Blackwell SM100 architecture, and it changes the throughput math for long-context inference workloads. If you're running LLMs on Spheron B200 instances or planning to move workloads from Hopper to Blackwell, FA4 is the primary reason attention-heavy tasks get substantially faster on the new hardware.

This post covers how FA4's SM100 tile architecture works, what the benchmarks show versus FA3 and FA2, how to set it up in vLLM and SGLang, and how to decide whether a migration from H100 makes sense for your workload. For background on what makes Blackwell different from Hopper at the hardware level, see the NVIDIA B200 complete guide.

What Changed in FlashAttention-4: The SM100 Tile Architecture

FlashAttention (FA) has always been about keeping attention computation in fast SRAM rather than repeatedly reading and writing to HBM. FA2 improved the original tiling algorithm to better utilize GPU parallelism. FA3 on Hopper (H100, H200) added warp specialization: separate warps handle softmax and matrix multiplications concurrently, and asynchronous data pipelines overlap compute with HBM prefetch. FA3 achieved roughly 1.75x the throughput of FA2 on H100.

FA4 for Blackwell replaces FA3's warp-specialized pipeline with SM100's Tensor Memory Accelerator (TMA) and a new tile-based execution model. The key differences:

AttributeFA2FA3 (Hopper)FA4 (Blackwell SM100)
Kernel modelThread block tilingWarp specialization + async pipeliningTMA tile prefetch + SM100 tile execution
Hardware targetAmpere/Ada (SM80/SM89)Hopper (SM90)Blackwell data-center (SM100/SM103)
Memory managementExplicit SRAM tile managementAsync warp-level prefetchTMA hardware-managed tile DMA
Key optimizationIO-optimal tiling, online softmaxOverlapped softmax/GEMM warpsTMA prefetch eliminates softmax pipeline stalls
Relative throughputBaseline~1.5-1.75x FA2 (H100)~1.5-2x FA3 (B200, long sequences)

The TMA on SM100 lets the GPU autonomously manage tile transfers between HBM and SRAM without warp-level code. In FA3, warp-specialized code coordinates the overlap between attention softmax and matrix multiply phases. In FA4, TMA hardware takes over the prefetch scheduling, which removes a class of pipeline stall that FA3 couldn't eliminate in software.

The practical result is that FA4's tile execution on SM100 sustains near-peak HBM bandwidth utilization at long sequence lengths, where FA3 leaves more bandwidth on the table due to synchronization overhead.

FA4 is built using the same CuTeDSL tile abstraction that powers custom kernel development in CUDA 13. For context on writing custom tile kernels with that API, see the CUDA 13 tile programming guide.

FlashAttention-4 vs FlashAttention-3 vs FlashAttention-2: Performance Benchmarks

The throughput advantage of FA4 over FA3 grows with sequence length. At short sequences (2K-8K tokens), the attention computation is compute-bound, not memory-bound, so the HBM optimization delivers modest gains. At long sequences (32K-512K), attention becomes memory-bandwidth-bound, and FA4's TMA-managed tiling cuts HBM reads dramatically.

Directional benchmarks based on published FA4 research from Tri Dao and collaborators (Together AI) on B200 and H100 hardware:

Sequence lengthFA2 (A100)FA3 (H100 SXM5)FA4 (B200 SXM6)FA4 / FA3 ratio (directional)
2K~120 TFLOPS~350 TFLOPS~450 TFLOPS~1.3x
8K~100 TFLOPS~310 TFLOPS~480 TFLOPS~1.5x
32K~65 TFLOPS~270 TFLOPS~510 TFLOPS~1.9x
128K~30 TFLOPS~200 TFLOPS~430 TFLOPS~2.1x
512K~10 TFLOPS~150 TFLOPS~390 TFLOPS~2.6x

These are directional figures based on published benchmarks. Results vary by model architecture, head dimension, batch size, and precision format. Verify against your workload before using these numbers in architecture decisions.

The FA4 vs FA3 column tells the real story: the longer the context, the bigger the gap. For workloads running at 2K-8K context lengths, migrating to Blackwell for FA4 gives a meaningful but not transformative attention speedup. For 32K+ workloads, the gap is large enough to change the cost model entirely.

For end-to-end inference framework comparisons that incorporate these attention kernel differences, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Hardware Requirements: Which GPUs Support FA4 and Fallback Options

Framework auto-detection (vLLM v0.17+, SGLang v0.4+) picks the correct attention backend based on GPU compute capability. No manual configuration is needed. FA4 activates on SM100 (B200) and SM103 (B300); FA3 on SM90 (H100/H200); FA2 on SM80/SM89 (A100, L40S, RTX 4090) and SM120 (RTX 5090, RTX PRO 6000, which lack the TMEM subsystem FA4 requires).

GPUFA versionCompute capabilityVRAMNotes
B200 SXM6FA410.0 (SM100)192 GB HBM3ePrimary FA4 target
B300 SXM6 (Blackwell Ultra)FA410.3 (SM103)288 GB HBM3eHighest throughput
RTX 5090FA212.0 (SM120)32 GB GDDR7Consumer Blackwell, no TMEM
RTX PRO 6000FA212.0 (SM120)96 GB GDDR7Workstation Blackwell, no TMEM
H100 SXM5/PCIeFA39.080 GB HBM3/HBM2eBest Hopper option
H200 SXM5FA39.0141 GB HBM3eHigh-VRAM Hopper
A100 SXM4/PCIeFA28.040/80 GB HBM2eFA2 with MQA/GQA
L40SFA28.948 GB GDDR6Ada Lovelace
RTX 4090FA28.924 GB GDDR6XConsumer Ada

For teams staying on Hopper hardware and wanting to get the most out of FA3, see H100 GPU rental on Spheron and the vLLM production deployment guide for FA3 configuration.

FA4 and FP4 quantization are complementary on Blackwell. FA4 handles the attention computation (using BF16 or FP8 precision), while FP4 quantizes the model weights. Running both together is the highest-throughput configuration available on B200. See the FP4 quantization guide for the full decision tree on when FP4 quality tradeoffs are worth it.

Setting Up FlashAttention-4 with vLLM and SGLang on Spheron Blackwell Instances

vLLM Setup

vLLM v0.17.0+ enables FA4 automatically on Blackwell hardware. No special flags are needed:

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype bfloat16 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92

Verify FA4 is active by checking startup logs for Using FlashAttention-4 backend. If you need to force a specific backend for testing or comparison purposes:

bash
# Force FA4 explicitly (Blackwell only)
--attention-backend flash_attn

# Force FA3 for comparison on Blackwell
--attention-backend flash_attn_3

SGLang Setup

SGLang v0.4+ also auto-selects FA4 on SM100:

bash
docker run --gpus all --ipc=host -p 30000:30000 \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --tp 1 \
  --context-length 131072 \
  --dtype bfloat16 \
  --mem-fraction-static 0.88

The --mem-fraction-static 0.88 flag reserves memory for FA4's larger tile buffers on B200. If you see OOM errors during prefill at long context lengths, lower this to 0.85.

For production configuration options for both frameworks, see the SGLang production deployment guide and vLLM production deployment guide.

TensorRT-LLM: FP4 + FA4 for Maximum Throughput

TensorRT-LLM combines NVFP4 weight quantization with FA4 attention for the highest raw throughput on B200:

python
import tensorrt_llm
from tensorrt_llm import LLM, SamplingParams

llm = LLM(
    model="nvidia/Llama-3.1-8B-Instruct-NVFP4",
    dtype="fp4",
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain FlashAttention tiling"], sampling_params)

FP4 handles the weight compute; FA4 handles the attention. The combination is the correct default for any Blackwell deployment where FP4 quality is acceptable.

Long-Context Performance: FA4's Impact on 128K-1M Token Inference Latency

Standard attention has O(n²) memory complexity. At 128K tokens, the full attention matrix would be 128K × 128K = 16 billion entries. Even in FP16, that's 32 GB just for the attention matrix, repeated across every attention layer. FlashAttention avoids materializing this matrix by computing attention in tiles that stay in SRAM. The effectiveness of tiling at different sequence lengths is what separates FA2, FA3, and FA4.

FA4's TMA-managed tiles on SM100 are more efficient at very long sequences because the hardware handles all the tile DMA scheduling that FA3 does in software. The result shows up in time-to-first-token (TTFT) at long context lengths:

Context lengthH100 SXM5 + FA3 (TTFT)B200 SXM6 + FA4 (TTFT)Speedup
8K~85 ms~65 ms1.3x
32K~280 ms~155 ms1.8x
128K~1,100 ms~490 ms2.2x
512K~4,200 ms~1,500 ms2.8x

These estimates are for a 70B parameter model at batch size 1 using BF16 precision. Actual latencies depend on model architecture, tensor parallelism degree, and hardware SKU. The directional pattern holds: FA4's advantage grows with context length.

For context windows where even B200's 192 GB HBM3e isn't enough to hold the full KV cache, see the KV cache optimization guide for prefix caching and chunked prefill strategies that reduce KV cache memory pressure. For workloads that push beyond HBM limits entirely, see NVMe KV cache offloading for LLM inference. For a broader explanation of why memory bandwidth is the dominant bottleneck in long-context inference, see the AI memory wall inference guide.

Real-World Throughput Gains: Tokens Per Second Before and After FA4

Attention is one component of total inference throughput, but it's the dominant factor at long context lengths and large batch sizes. The throughput gains below reflect combined prefill + decode throughput at batch size 32 using BF16 on a single GPU (multi-GPU figures scale approximately linearly with tensor parallelism):

ModelH100 SXM5 + FA3 (tok/s)B200 SXM6 + FA4 (tok/s)Throughput gain (directional)
Llama 3.1 8B~75,000~140,000~1.9x
Llama 3.1 70B~18,000~35,000~1.9x
Llama 3.1 405B~3,200 (8xH100)~6,400 (8xB200)~2.0x

These are directional estimates at 8K context length. The gap widens at 32K+.

Cost per million tokens (on-demand pricing, 70B at batch 32):

Config$/hrThroughput (tok/s)$/1M tokens
H100 SXM5 + FA3, on-demand$2.5418,000$0.039
H100 SXM5 + FA3, spot$0.8018,000$0.012
B200 SXM6 + FA4, on-demand$5.5435,000$0.044
Spheron B200 instances + FA4, spot$1.7135,000$0.014
Bare-metal B300 SXM6 + FA4, spot (extrapolated ~1.3x B200 throughput)$2.45~45,000~$0.015

Formula: Cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000

B200 SXM6 on-demand at $0.044/1M is roughly 13% above H100 on-demand at $0.039/1M. The higher B200 hourly rate ($5.54/hr vs $2.54/hr) outpaces the throughput advantage at current pricing. The better comparison for throughput-constrained teams: getting 35K tok/s from one B200 at $5.54/hr vs two H100s at $5.08/hr total. The gap is small enough that B200 on-demand is a reasonable call when consolidating to fewer nodes matters. The spot story is different in kind: B200 spot ($0.014/1M) runs a modest premium over H100 spot ($0.012/1M) per token, but delivers the throughput of two H100s in a single GPU, reducing tensor parallelism overhead for latency-sensitive workloads.

Pricing fluctuates based on GPU availability. The prices above are based on 21 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Migration Guide: Upgrading Your Existing Inference Stack to FlashAttention-4

Step 1: Audit your GPU fleet compute capability

bash
nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader
  • 10.0 or 10.3: Blackwell data-center (B200 SM100, B300 SM103), FA4 supported
  • 12.0: Consumer/workstation Blackwell (RTX 5090, RTX PRO 6000, SM120), FA2 only, no TMEM
  • 9.0: Hopper (H100/H200), FA3 is correct
  • 8.0 or 8.9: Ampere/Ada, FA2 only

Step 2: Check your framework version

  • vLLM: FA4 requires v0.17.0+ (released March 7, 2026). Check with python -c "import vllm; print(vllm.__version__)"
  • SGLang: FA4 requires v0.4+. Check with python -c "import sglang; print(sglang.__version__)"

Step 3: Benchmark on a spot B200 instance first

Spot pricing makes evaluation cheap. Run the standard vLLM benchmark before committing to on-demand:

bash
python -m vllm.benchmarks.benchmark_throughput \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --input-len 2048 \
  --output-len 512 \
  --num-prompts 100

Compare against your current H100 numbers. If FA4 throughput doesn't cover the migration effort at current pricing, H100 spot may still be the right call for your cost model.

Step 4: Evaluate the FP4 quality tradeoff

If you're migrating to B200, FA4 + FP4 is the recommended target configuration. FP4 adds another ~1.5-2x throughput on top of FA4's gains for weight-compute-bound models. The quality tradeoff is task-dependent. For the full FP4 decision framework, see the FP4 quantization guide.

FA3 fallback for teams staying on Hopper

If you're not ready to migrate to Blackwell, FA3 on H100/H200 is fully supported and requires no configuration in vLLM or SGLang. To explicitly force FA3 (for example, if auto-detection selects an older backend):

bash
# vLLM: force FA3 on Hopper
--attention-backend flash_attn_3

Migration cost comparison

ConfigHourly rate70B throughput$/1M tokens
H100 SXM5 FA3, on-demand$2.54/hr18K tok/s$0.039
H100 SXM5 FA3, spot$0.80/hr18K tok/s$0.012
B200 SXM6 FA4, on-demand$5.54/hr35K tok/s$0.044
B200 SXM6 FA4, spot$1.71/hr35K tok/s$0.014

Pricing fluctuates based on GPU availability. The prices above are based on 21 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The migration break-even depends on your usage pattern. On-demand B200 SXM6 FA4 costs $0.044/1M tokens versus $0.039/1M for H100 on-demand, so on a raw per-token basis H100 is slightly cheaper. The more useful comparison for throughput-constrained teams: one B200 at $5.54/hr delivers 35K tok/s, while two H100s to match that capacity cost ~$5.08/hr total. The ~9% premium for B200 may be worth paying to run a single GPU instead of coordinating two nodes. Spot B200 FA4 at $1.71/hr delivers nearly 2x the throughput of H100 spot in one GPU, which makes it the better option for teams where capacity and simplicity outweigh marginal per-token cost differences.

For broader cost reduction strategies beyond attention kernels, see the GPU cost optimization playbook.


FlashAttention-4 is live on Spheron's Blackwell instances today. Rent a B200 or B300 by the minute with no contracts, and see FA4 throughput gains on your actual workload before committing.

Rent B200 on Spheron → | Rent B300 on Spheron → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.