Tutorial

Deploy FlashInfer on GPU Cloud: LLM Inference Kernels for vLLM and SGLang (2026 Guide)

FlashInferFlashInfer vLLMLLM Inference KernelsFlashInfer SGLangBlock-Sparse KV CacheJIT Attention KernelsGPU CloudH200vLLMSGLang
Deploy FlashInfer on GPU Cloud: LLM Inference Kernels for vLLM and SGLang (2026 Guide)

If you are running vLLM or SGLang today, you may already be using FlashInfer. SGLang routes attention through FlashInfer by default on both Hopper and Blackwell. In vLLM, FlashInfer is the default on Blackwell (B200/B300); on Hopper (H100/H200), FlashAttention is vLLM's default and FlashInfer is opt-in. The question is not whether to use it but whether you have configured it correctly, understand what it is doing under the hood, and are getting the full performance gains it offers. For broader context on how these frameworks use FlashInfer, the vLLM production deployment guide and SGLang production deployment guide both cover the serving layer in more depth.

This guide covers what FlashInfer actually is, why it exists separately from FlashAttention, how to verify it is running, how to enable MLA support in SGLang, and how to measure what it is worth in TTFT, ITL, and cost-per-token terms on H100, H200, and B200 instances.

What FlashInfer Is

FlashInfer, developed jointly by researchers at NVIDIA, CMU, and UW, is not a newer version of FlashAttention. They serve different purposes and should not be confused.

FlashAttention-2, FlashAttention-3, and FlashAttention-4 are pure attention kernel libraries. They compute scaled dot-product attention faster than the naive implementation by tiling the computation to fit in SRAM and fusing the softmax into the matrix multiply. That is all they do. For context on how FA2 and FA3 differ, see the FlashAttention-2 vs FlashAttention-3 guide.

FlashInfer is a broader inference operator library. It handles:

  • Block-sparse KV cache attention (used by PagedAttention in vLLM)
  • Multi-head latent attention (MLA, the compressed-KV variant used by DeepSeek-style architectures)
  • Grouped GEMM for MoE expert layers
  • FP4/FP8 quantized attention for Blackwell
ScopeFlashAttention (2/3/4)FlashInfer
Attention kernel mathYesYes
Block-sparse KV cacheNoYes
MLA (DeepSeek-style)NoYes
Grouped GEMM for MoENoYes
FP4 attention (Blackwell)FA4 onlyYes
JIT kernel compilationNoYes

FlashAttention-4 overlaps with FlashInfer on Blackwell FP8/FP4 attention, but FlashInfer's broader scope is why it has become the default attention operator in major serving frameworks rather than raw FlashAttention. For the Blackwell-specific attention story, see the FlashAttention-4 guide.

Which Frameworks Route Through FlashInfer

As of late 2024 and into 2025, FlashInfer is the default or preferred attention backend across the major open-source serving frameworks:

vLLM: In recent vLLM releases, FlashInfer is the default attention backend on Blackwell (B200/B300). On Hopper (H100/H200), FlashAttention is vLLM's default; opt in with VLLM_ATTENTION_BACKEND=FLASHINFER. FlashInfer handles block-sparse KV attention via flashinfer.BatchPrefillWithPagedKVCacheWrapper and flashinfer.BatchDecodeWithPagedKVCacheWrapper. These functions operate directly on vLLM's paged KV blocks without requiring a copy into contiguous memory first.

SGLang: Uses FlashInfer as the default attention backend for Hopper and Blackwell. For DeepSeek-style MLA models, pass --attention-backend flashinfer to route through the FlashInfer MLA kernel path. MLA is enabled by default for DeepSeek models; FA3 is SGLang's current default MLA backend.

TensorRT-LLM: Uses a subset of FlashInfer kernels for GQA and MQA layers internally, though TRT-LLM's primary kernel path is its own compiled engine.

MLC-Engine: Bundles FlashInfer for its attention kernels in the Apache TVM compilation path.

FrameworkFlashInfer enabled by default?Flag or condition
vLLM (recent)Blackwell only; Hopper opt-inVLLM_ATTENTION_BACKEND=FLASHINFER on Hopper
SGLangYes (Hopper, Blackwell)--attention-backend flashinfer
SGLang (MLA via FlashInfer)No (FA3 is default)--attention-backend flashinfer
TensorRT-LLMPartialInternal; not user-configurable
MLC-EngineYesBundled, always active

Architecture: JIT Kernels, Block-Sparse KV, and Unified APIs

Three architectural choices make FlashInfer different from a standard attention library.

JIT kernel compilation. FlashInfer compiles CUDA kernels at first call for each unique (batch_size, num_heads, head_dim, page_size) tuple. Compiled artifacts are cached to disk at FLASHINFER_CACHE_DIR (defaults to a temp directory). The first 1-5 requests are slower while the JIT cache builds up. After that, all subsequent calls for the same shape hit the cached path with zero recompilation overhead.

Block-sparse KV cache. Standard attention operates on a contiguous KV matrix. vLLM's PagedAttention, described in the continuous batching and PagedAttention guide, stores KV cache in non-contiguous memory pages to handle variable-length sequences efficiently. The traditional approach is to copy these pages into a contiguous buffer before running attention. FlashInfer's block-sparse attention kernel operates directly on the paged format, eliminating the copy step entirely. At 32K+ context lengths, this copy overhead becomes significant, which is why FlashInfer shows 28-30% speedup on long-context workloads.

Unified GEMM and attention. A single library handling both matrix-multiply layers and attention enables operator fusion that reduces intermediate tensor allocations between layers. On Blackwell with FP4 attention, this matters particularly because the quantization and dequantization steps can be fused into the attention kernel itself.

Measured Performance Gains

The performance numbers below come from FlashInfer's published benchmarks and vLLM/SGLang release documentation. They represent directional ranges rather than exact values, since results vary by model architecture, sequence length, and hardware generation.

  • 29-69% ITL reduction vs Triton-based kernel implementations at medium-to-high concurrency (batch 8-64). The gain is larger at higher concurrency where the batched decode kernel's efficiency advantage is most pronounced.
  • 28-30% long-context speedup at 32K+ tokens, primarily from eliminating the KV page copy before attention.
  • 13-17% parallel-generation gains from the fused batched decode kernel at batch 1-4.

Directional benchmark estimates for Llama-3.1-70B on H200 SXM5, batch 32, 2048 input / 512 output tokens per request:

SetupTTFT p50 (ms)TTFT p95 (ms)ITL p50 (ms)Throughput (tok/s)
H200 + torch_sdpa (default)~120~480~28~22,000
H200 + FlashInfer~82~310~18~28,000
H100 + FlashInfer~155~620~26~17,000

Results are directional estimates based on published FlashInfer benchmarks and vLLM release notes. Actual results depend on model size, batch composition, and context distribution. For end-to-end serving engine comparisons that include FlashInfer-backed vLLM, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.

Install and Enable FlashInfer in vLLM on a Spheron Instance

Provision an H100, H200, or B200 instance

Follow the Spheron getting-started guide to provision your GPU instance, choose Ubuntu 22.04 with CUDA 12.4+, and connect via SSH. FlashInfer requires CUDA 12.3 or later; anything below that will fail to compile the SM90 kernel targets. H200 SXM5 on Spheron is the best balance of memory bandwidth (4.8 TB/s) and HBM capacity (141 GB) for 70B model serving with FlashInfer. For smaller workloads or cost-sensitive setups, H100 SXM5 instances run the same FlashInfer kernel path at a lower hourly rate.

Verify FlashInfer is already active in vLLM

bash
python -c "import vllm; print(vllm.__version__)"
# Should output a recent vLLM version

docker run --gpus all --ipc=host -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.92
# Check logs for: "Using FlashInfer backend" or "attention_backend=FlashInfer"

Force or override the attention backend

bash
# On Hopper (H100/H200): opt in via env var (recommended) or CLI flag
export VLLM_ATTENTION_BACKEND=FLASHINFER
# or: --attention-backend flashinfer

# On Blackwell (B200/B300): FlashInfer is the default; no flag needed

# Revert to XFORMERS for comparison
export VLLM_ATTENTION_BACKEND=XFORMERS
# or: --attention-backend xformers

# Revert to native SDPA for baseline
export VLLM_ATTENTION_BACKEND=TORCH_SDPA
# or: --attention-backend torch_sdpa

On Blackwell (B200/B300), FlashInfer is vLLM's default. On Hopper (H100/H200), FlashAttention is the default; set VLLM_ATTENTION_BACKEND=FLASHINFER to opt in. Only override this for debugging or A/B latency testing.

JIT cache warm-up

bash
# Set cache directory to persist across restarts
export FLASHINFER_CACHE_DIR=/tmp/flashinfer_cache

The first 1-5 requests trigger JIT compilation for each unique (batch_size, num_heads, head_dim, page_size) shape your workload uses. This adds 1-3 seconds per unique shape on first encounter. To avoid exposing users to this latency, send a representative warm-up batch through the model before opening the endpoint to live traffic. The compiled kernels persist at FLASHINFER_CACHE_DIR across server restarts.

Enable FlashInfer MLA in SGLang

For DeepSeek V3, V4, and similar models using multi-head latent attention, SGLang routes MLA through FlashInfer when you pass --attention-backend flashinfer. MLA is enabled by default for DeepSeek models in SGLang; FA3 is the current default MLA backend. Passing --attention-backend flashinfer switches to the FlashInfer MLA kernel path explicitly:

DeepSeek-V3 is a 671B parameter MoE model. In bfloat16, the weights alone require roughly 1.3 TB of GPU memory, so you need a multi-GPU instance. The command below uses --tp 8 (tensor parallelism across 8 GPUs), which requires an 8×B200 node (192 GB × 8 = 1.536 TB). An 8×H200 node (141 GB × 8 = 1.128 TB) does not have enough VRAM for bfloat16 at this model size and will OOM during model loading:

bash
docker run --gpus all --ipc=host -p 30000:30000 \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp 8 \
  --context-length 32768 \
  --dtype bfloat16 \
  --attention-backend flashinfer \
  --mem-fraction-static 0.88

MLA (multi-head latent attention) compresses the KV cache by projecting keys and values into a lower-dimensional latent space. DeepSeek models use this to reduce KV cache memory by 60-70% compared to standard MHA. When using --attention-backend flashinfer, SGLang routes MLA computation through FlashInfer's dedicated MLA kernel, which handles the compressed format directly and reduces attention compute time by 20-35% compared to the decomposed expansion-then-attention path. FA3 is the current default MLA backend and is used when no explicit attention backend is passed.

Architecture-Specific Kernel Selection: Hopper vs Blackwell

FlashInfer selects and JIT-compiles different kernel targets based on the GPU's compute capability:

GPUArchitectureFlashInfer kernel pathJIT target
H100 SXM5SM90 (Hopper)FA3-style warp-specializedsm_90a
H200 SXM5SM90 (Hopper)Same as H100, more HBM bandwidthsm_90a
B200 SXM6SM100 (Blackwell)TMA tile-based, FP4 attentionsm_100a
B300 SXM6SM103 (Blackwell Ultra)Same as B200, 288 GB HBM3esm_103a
A100 SXM4SM80 (Ampere)FA2-style, no FP8sm_80
L40SSM89 (Ada)FA2 + FP8 experimentalsm_89

For B200 SXM6 instances on Spheron, the SM100 kernel path unlocks TMA (tensor memory accelerator) tile-based attention and native FP4 attention support, which FA3 on Hopper cannot use. The architectural shift from Hopper to Blackwell for FlashInfer is similar to what FA4 provides in its kernel path.

JIT compile caveat for mixed fleets. Each unique kernel target compiles separately. If you run a fleet with both H200 and B200 nodes, mount FLASHINFER_CACHE_DIR per GPU architecture type or each instance will recompile from scratch at startup. A shared NFS mount works but requires separate subdirectories per architecture.

Benchmark Walkthrough: TTFT and ITL Before vs After on H200

Run both backends sequentially on the same H200 instance:

bash
# Step 1: Start with torch_sdpa baseline
docker run --gpus all --ipc=host -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --dtype bfloat16 \
  --attention-backend torch_sdpa \
  --gpu-memory-utilization 0.92 \
  --max-model-len 16384

# Step 2: Benchmark
python -m vllm.benchmarks.benchmark_serving \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --dataset-name random \
  --random-input-len 2048 \
  --random-output-len 512 \
  --num-prompts 200 \
  --concurrency 32 \
  --host localhost --port 8000

# Step 3: Stop, restart with FlashInfer
docker run --gpus all --ipc=host -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --dtype bfloat16 \
  --attention-backend flashinfer \
  --gpu-memory-utilization 0.92 \
  --max-model-len 16384

# Step 4: Run same benchmark again

Representative results for Llama-3.1-70B at batch 32, with live-fetched Spheron pricing as of 19 Jun 2026:

ConfigTTFT p50 (ms)TTFT p95 (ms)ITL p50 (ms)Throughput (tok/s)$/hr on-demand$/1M tokens
H200 + torch_sdpa~120~480~28~22,000$4.82~$0.061
H200 + FlashInfer~82~310~18~28,000$4.82~$0.048
H100 + FlashInfer~155~620~26~17,000$4.06~$0.066

Cost per 1M tokens formula: ($/hr) / (tok/s × 3600) × 1,000,000

H200 + FlashInfer example: 4.82 / (28000 × 3600) × 1,000,000 = 4.82 / 100,800,000 × 1,000,000 = $0.048 per 1M tokens

The ~$0.013 gap between FlashInfer and the default backend on H200 compounds quickly at production scale. At 100M tokens per day, that is roughly $1.30/day in savings with no infrastructure change. For deeper cost modeling across batch sizes and concurrency levels, see the AI inference cost economics guide.

Pricing fluctuates based on GPU availability. The prices above are based on 19 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

FlashInfer vs TensorRT-LLM Kernels: When to Use Each

ScenarioFlashInfer (via vLLM/SGLang)TensorRT-LLM kernels
General LLM serving (Llama, Qwen, Mistral)Best defaultOverkill unless at extreme throughput
MoE models (DeepSeek V3/V4, Kimi K2)FlashInfer MLA in SGLangTRT-LLM for maximum throughput
Fixed production model, max throughputFineCompile once, cache; 10-20% ahead
Dynamic model switchingBest fitCompile cost makes this painful
Long-context (32K+ tokens)Strong fitComparable
B200 FP4 workloadsFP4 attention kernelTRT-LLM NVFP4 path is equivalent

The practical distinction: if you need to swap models without a 60-120 minute engine recompile, FlashInfer via vLLM or SGLang is the right answer. If you have a locked-down production setup with a single model that has been running for weeks and you need the last 10-15% of throughput, TRT-LLM's static compilation will edge ahead. For a head-to-head comparison with full benchmark data, see the TensorRT-LLM production deployment guide.

For MoE models, the FlashInfer MLA path in SGLang combined with DeepGEMM for grouped GEMM gives throughput competitive with TRT-LLM without the static compilation step. See the DeepEP and DeepGEMM guide for the full MoE kernel setup.

Troubleshooting: Common FlashInfer Issues

JIT compilation errors on first request. Check that CUDA is 12.3+:

bash
nvcc --version
python -c "import torch; print(torch.cuda.get_device_capability())"
# Should return (9, 0) for Hopper or (10, 0) for Blackwell

If get_device_capability() returns (8, 0) (A100), the SM90 kernel target will not compile. Verify you are on an H100 or newer GPU.

VLLM_USE_FLASHINFER_MOE_FP4=1 is MoE-specific. This environment variable activates FlashInfer's FP4 MoE kernel on Blackwell only. It applies to MoE expert layers and has no effect on dense models. Do not set it for dense NVFP4 models like Llama on B200; they use a separate FP4 code path. Setting it on non-MoE models is harmless but does nothing.

SGLang MLA backend selection. To use the FlashInfer MLA kernel path in SGLang for DeepSeek-style models, pass --attention-backend flashinfer. MLA is enabled by default for DeepSeek models in SGLang; FA3 is the current default MLA backend and is used if no backend flag is passed. Note that --attention-backend flashinfer also exists in vLLM (for standard Hopper attention), but vLLM handles MLA separately from the attention backend selector, so the two flags are not equivalent across frameworks.

Cache directory permission errors in Docker. If running without a volume mount, FLASHINFER_CACHE_DIR defaults to a temp path that may not persist between container restarts. Set it explicitly to a mounted volume:

bash
-v /host/flashinfer-cache:/cache \
-e FLASHINFER_CACHE_DIR=/cache

This ensures JIT artifacts survive container restarts and do not recompile on every deployment.


Spheron's H200 instances give you a clean environment to A/B FlashInfer-backed vs default kernels, with per-minute billing so the latency win translates directly into a documented cost-per-token number before you commit.

H200 on Spheron → | Rent H100 GPU → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Provision an H100, H200, or B200 GPU instance on Spheron

    Log in to app.spheron.ai, select your GPU model (H100 SXM5, H200 SXM5, or B200 SXM6), choose Ubuntu 22.04 with CUDA 12.4+, and deploy. FlashInfer requires CUDA 12.3 or later. The instance is live in under 2 minutes with SSH root access.

  2. Install FlashInfer standalone or via vLLM

    FlashInfer is bundled with recent vLLM releases. Install vLLM with: pip install vllm. To install FlashInfer standalone for custom kernel work: pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.6/. Verify with: python -c 'import flashinfer; print(flashinfer.__version__)'.

  3. Enable FlashInfer as the attention backend in vLLM

    On Blackwell (B200/B300), FlashInfer is vLLM's default attention backend in recent releases. On Hopper (H100/H200), FlashAttention is the default; set VLLM_ATTENTION_BACKEND=FLASHINFER or add --attention-backend flashinfer to your vllm serve command to use FlashInfer. Check startup logs for 'Using FlashInfer backend'. To revert: VLLM_ATTENTION_BACKEND=TORCH_SDPA or --attention-backend torch_sdpa for baseline comparison.

  4. Enable FlashInfer MLA attention in SGLang

    For DeepSeek-style models with multi-head latent attention, pass --attention-backend flashinfer to your SGLang server launch command. MLA is enabled by default for DeepSeek models in SGLang; FA3 is the current default MLA backend. Passing --attention-backend flashinfer switches the serving stack to the FlashInfer MLA kernel path explicitly. Do not confuse this with vLLM's attention backend flag, which is a separate setting for non-MLA attention.

  5. Benchmark TTFT and ITL before and after on H200

    Start vLLM twice with --attention-backend torch_sdpa and --attention-backend flashinfer respectively. Run: python -m vllm.benchmarks.benchmark_serving --model meta-llama/Llama-3.1-70B-Instruct --dataset-name random --random-input-len 2048 --random-output-len 512 --num-prompts 200 --concurrency 32. Compare TTFT p50/p95 and ITL p50 between runs.

  6. Calculate cost per token and compare against default kernels

    Use the formula: ($/hr) divided by (tok/s times 3600) times 1,000,000. On an H200 at $4.82/hr running 28,000 tok/s with FlashInfer: 4.82 / (28000 x 3600) x 1,000,000 = $0.048 per 1M tokens. With the default torch_sdpa backend at 22,000 tok/s: $0.061 per 1M tokens. The ~27% throughput gain from FlashInfer translates to a direct reduction in cost per token.

FAQ / 05

Frequently Asked Questions

FlashInfer (developed at NVIDIA, CMU, and UW) is a unified operator library for LLM inference that covers block-sparse KV cache attention, multi-head latent attention (MLA), grouped GEMM for MoE layers, and FP4/FP8 quantized attention for Blackwell. vLLM uses FlashInfer as the default attention backend on Blackwell (B200/B300) in recent releases, because FlashInfer handles PagedAttention's block-sparse KV format directly, eliminating the KV copy step before attention computation. On Hopper (H100/H200), FlashAttention is vLLM's default; set VLLM_ATTENTION_BACKEND=FLASHINFER to opt in. SGLang uses FlashInfer as the default attention backend on both Hopper and Blackwell, with a dedicated MLA kernel path for DeepSeek-style architectures.

In vLLM, FlashInfer is the default attention backend on Blackwell (B200/B300). On Hopper (H100/H200), FlashAttention is vLLM's default; set VLLM_ATTENTION_BACKEND=FLASHINFER (env var) or pass --attention-backend flashinfer (CLI) to enable FlashInfer explicitly. Check the startup logs for 'Using FlashInfer backend' or 'attention_backend=FlashInfer' to confirm it is active. To revert to a different backend for comparison testing, use VLLM_ATTENTION_BACKEND=TORCH_SDPA or --attention-backend torch_sdpa.

Yes. FlashInfer supports SM80 (A100), SM89 (L40S, RTX 4090), SM90 (H100, H200 - Hopper), SM100 (B200 - Blackwell data-center), and SM103 (B300). On Hopper, it compiles warp-specialized kernels targeting sm_90a. On Blackwell, it uses TMA tile-based kernels with FP4 attention support targeting sm_100a. Each architecture gets a separate JIT-compiled kernel that is cached to disk after the first request.

Benchmarks from FlashInfer's own published data and vLLM/SGLang release notes show 29-69% inter-token latency (ITL) reduction compared to Triton-based kernel implementations at medium-to-high concurrency, 28-30% speedup on long-context workloads (32K+ tokens) due to eliminating the KV copy overhead from block-sparse paging, and 13-17% parallel-generation gains from fused batched decode kernels. Results vary by model architecture, batch size, and context length.

FlashInfer via vLLM or SGLang is the right default for general LLM serving with dynamic model switching, long-context workloads, MoE models with DeepSeek-style MLA, and any setup where you need to swap models without a recompile step. TensorRT-LLM's native kernels are worth the setup cost for fixed production models at extreme throughput where a one-time 1-2 hour engine build is acceptable - they typically run 10-20% ahead of FlashInfer on throughput-bound benchmarks with a static model.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.