FP4 quantization is native to Blackwell GPUs - the B200, B300, RTX 5090, and RTX PRO 6000 - and is not available on any previous GPU generation. For inference workloads, Blackwell's FP4 tensor cores deliver roughly double the TFLOPS of FP8 on the same hardware. If you're currently paying for H100 or H200 instances running FP8, a Blackwell GPU with FP4 can deliver significantly more throughput per dollar - provided your model and task can handle the precision reduction.
That qualifier matters. FP4 comes with real quality tradeoffs. The precision loss is minor for many production AI applications, but measurable for others. The wrong assumption is that FP4 is simply "free throughput." The right framing is: FP4 is a precision-performance lever that changes your cost model, and your job is to verify whether the tradeoff is acceptable for your specific use case.
This post gives you the full picture: what FP4 is, which hardware supports it, real throughput and cost comparisons, framework status, and a decision framework for when to migrate.
What FP4 Is - The Short Version
Floating-point precision formats define how many bits are used to represent each number in a computation. More bits means more precision and more memory consumed per value. Fewer bits means faster computation but more rounding error.
The progression from highest to lowest precision:
| Format | Bit width | Bytes per value | GPU support |
|---|---|---|---|
| FP32 | 32-bit | 4 bytes | All GPUs - training standard |
| BF16 / FP16 | 16-bit | 2 bytes | All modern GPUs - standard inference format |
| FP8 | 8-bit | 1 byte | Hopper (H100, H200) and Blackwell - ~1.5-2x throughput vs FP16 |
| FP4 | 4-bit | 0.5 bytes | Blackwell only (B200, B300, RTX 5090, RTX PRO 6000) - ~2x throughput vs FP8 |
Each halving of bit width roughly doubles the number of values that fit in the same memory and doubles the number of operations you can perform per second - because you're moving twice as many values per memory access, and tensor cores can process twice as many values per clock. The penalty is rounding error: smaller representations can't capture fine-grained numerical differences, which introduces quantization noise into computations.
FP4 vs INT4: why the distinction matters. Both use 4 bits per value, but they represent numbers differently. INT4 is a fixed-range integer format with no exponent - it can represent 16 discrete integer values. NVIDIA's NVFP4 (the E2M1 format: 1 sign bit, 2 exponent bits, 1 mantissa bit) is a floating-point format that can represent values across a much wider dynamic range with variable precision. For transformer inference, where activation and weight distributions span many orders of magnitude, FP4's floating-point representation typically preserves more information than INT4 at the same bit width. This is why NVFP4 tends to produce better output quality than INT4 at the same memory footprint.
What NVFP4 is not: it's not the same as bitsandbytes' NF4 format used in QLoRA fine-tuning. NF4 is a different 4-bit quantization scheme optimized for weight storage during fine-tuning. NVFP4 is a native tensor core format in Blackwell hardware, accelerated by dedicated compute units that didn't exist in Hopper or earlier architectures.
Which Blackwell GPUs Support FP4
| GPU | FP4 Support | Generation | VRAM |
|---|---|---|---|
| RTX 5090 | ✅ Yes | Blackwell consumer | 32 GB GDDR7 |
| RTX PRO 6000 | ✅ Yes | Blackwell workstation | 96 GB GDDR7 |
| B200 | ✅ Yes | Blackwell datacenter | 192 GB HBM3e |
| B300 (Blackwell Ultra) | ✅ Yes | Blackwell Ultra datacenter | 288 GB HBM3e |
| H200 SXM | ❌ No | Hopper (FP8 max) | 141 GB HBM3e |
| H100 SXM / PCIe | ❌ No | Hopper (FP8 max) | 80 GB HBM3 / HBM2e |
| A100 | ❌ No | Ampere (INT8 max) | 80 GB HBM2e |
| RTX 4090 | ❌ No | Ada Lovelace (FP8 partial) | 24 GB GDDR6X |
The key implication: FP4 requires a hardware migration. If you're running on H100 or H200 - even with FP8 fully optimized - you cannot access FP4 without moving to Blackwell hardware. FP8 is the precision ceiling for the Hopper generation.
FP4 vs FP8 vs FP16 - Performance and Quality Comparison
| Format | GPU support | Relative throughput | VRAM per parameter | Quality vs FP16 |
|---|---|---|---|---|
| FP16 / BF16 | All modern GPUs | 1× (baseline) | 2 bytes | Reference |
| FP8 | H100, H200, Blackwell | ~1.5-2× | 1 byte | Negligible difference for most tasks |
| INT8 | Most modern GPUs | ~1.5× | 1 byte | Model-dependent |
| INT4 | Most modern GPUs | ~2-3× | 0.5 bytes | Model-dependent, typically worse than FP4 |
| FP4 (NVFP4) | Blackwell only | ~2-4× vs FP16 | 0.5 bytes | Noticeably below for some tasks |
Important caveats on the throughput column: the "2-4× vs FP16" range for FP4 reflects two compounding factors - reduced memory pressure (25% the memory per weight vs FP16 means 4× more weights fit per memory bandwidth unit) and dedicated FP4 tensor core operations on Blackwell. At the TFLOPS level, the B200 delivers ~18,000 sparse FP4 TFLOPS vs ~9,000 sparse FP8 TFLOPS on the same chip, a clean 2× ratio. But real inference throughput gains are workload-dependent. Memory-bandwidth-bound workloads (large models at small batch sizes) will see gains closer to the theoretical maximum. Compute-bound workloads (small models at large batch sizes) will see less. Real-world FP4 vs FP8 improvement in production inference typically falls in the 1.5-2× range.
Quality tradeoffs: the honest version. FP4 introduces more quantization error than FP8. Where FP8 inference is considered production-safe for virtually all tasks (quality difference vs FP16 is negligible across standard benchmarks), FP4 is task-dependent:
- Low impact: conversational AI, creative writing, general instruction-following, classification, summarization. Most end users cannot perceive the difference.
- Moderate impact: code generation, factual Q&A on specific domains. Small but measurable accuracy gaps may appear.
- Higher impact: complex multi-step reasoning, math problems, scientific analysis. FP4 quantization errors can compound through reasoning chains, leading to more frequent incorrect conclusions.
The variability depends on whether the model was calibrated for FP4. Models that went through FP4-aware post-training quantization (PTQ) using tools like NVIDIA ModelOpt preserve significantly more quality than models that are dynamically quantized at inference time. For your specific deployment, the only reliable answer is to benchmark your model on your task set.
Real Throughput Numbers on Blackwell FP4
MLPerf Inference v5.0 (April 2025) published the first official B200 FP4 benchmark results for Llama 2 70B. MLPerf Inference v5.1 (September 2025) improved further on those results. Single-GPU FP4 benchmark data for the RTX 5090 is still pending as of March 2026. Here's what's confirmed and what's estimated:
| GPU | Precision | Model | Tokens/sec | Source |
|---|---|---|---|---|
| H100 SXM | FP8 | Llama 2 70B | ~24,525 | MLPerf Inference v4.1 (8-GPU system) |
| H200 SXM | FP8 | Llama 2 70B | ~34,988 | MLPerf Inference v5.0 (8-GPU system, offline) |
| B200 (8-GPU, FP4) | FP4 | Llama 2 70B | ~98,858 | MLPerf Inference v5.0 (8-GPU system, offline) - first official B200 FP4 result |
| B200 (8-GPU, FP4) | FP4 | Llama 2 70B | ~102,725 | MLPerf Inference v5.1 (8-GPU system, offline) |
| B200 single GPU | FP8 | Llama 70B-class | ~6,972* | Estimated: per-GPU H100 SXM (~3,066 tok/s) scaled by B200/H100 FP8 TFLOPS ratio (9,000 ÷ 3,958 ≈ 2.274) |
| B200 single GPU | FP4 | Llama 70B-class | ~13,944* | Estimated as 2× per-GPU FP8 estimate, based on B200 FP4/FP8 TFLOPS ratio (18,000 to 9,000 sparse TFLOPS) |
| RTX 5090 | FP16 | Llama 3.1 8B | ~3,500 | vLLM benchmark (Spheron) |
| RTX 5090 | FP4 | Llama 3.1 8B | benchmark pending | No confirmed public number as of March 2026 |
Estimated values: Single-GPU B200 FP8 figure derived from the corrected per-GPU H100 SXM baseline (24,525 ÷ 8 = ~3,066 tok/s) scaled by the B200/H100 FP8 TFLOPS ratio (9,000 ÷ 3,958 ≈ 2.274), giving ~6,972 tok/s. FP4 estimate assumes 2× FP8 throughput based on B200's FP4/FP8 TFLOPS ratio (18,000 to 9,000 sparse TFLOPS), giving ~13,944 tok/s. MLPerf Inference v5.0 (April 2025) published the first official 8xB200 FP4 result at ~98,858 tokens/sec on Llama 2 70B in offline mode, roughly 2.8x better than 8xH200 (~34,988 tokens/sec) from the same round. MLPerf Inference v5.1 (September 2025) improved the 8xB200 result to ~102,725 tokens/sec. MLPerf Inference v4.1 reported 8xH100 at ~24,525 tokens/sec on Llama 2 70B. Multi-GPU (tensor-parallel) per-GPU throughput is lower than single-GPU estimates due to communication overhead. Real throughput will differ based on model architecture, batch size, sequence length, and serving framework efficiency. Use your own benchmark before making production decisions.
The RTX 5090 FP16 number (3,500 tok/s for Llama 3.1 8B) is from direct vLLM measurement. FP4 inference numbers for the RTX 5090 are not yet published - framework support is still maturing (see the framework section below).
Cost Per Million Tokens - FP4 vs FP8 vs FP16
The formula for cost per million tokens:
Cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000For 70B-class models (e.g., Llama 3.3 70B), where B200 and H100/H200 are the relevant choices:
| Config | $/hr (on-demand) | $/hr (spot) | Tokens/sec | Cost/M tokens (on-demand) | Cost/M tokens (spot) |
|---|---|---|---|---|---|
| H100 SXM, FP8 | $2.50 | $0.99 | ~3,066* | $0.227* | $0.090* |
| H200 SXM, FP8 | $1.56 | - | ~4,374* | $0.099* | - |
| B200, FP8 | $5.89 | - | ~6,972* | $0.235* | - |
| B200, FP4 | $5.89 | - | ~13,944* | $0.117* | - |
Estimated values: H100 SXM and H200 SXM throughput figures (~3,066 and ~4,374 tok/s) are per-GPU estimates derived by dividing the 8-GPU MLPerf Inference results (24,525 and 34,988 tok/s respectively) by 8, then applied to single-GPU on-demand pricing. B200 throughput figures for single-GPU configurations are extrapolated from NVIDIA's published FP8/FP4 TFLOPS ratios, not confirmed public benchmarks. Treat all starred figures as directional estimates and run your own benchmarks before making production decisions.
For 8B-class models (e.g., Llama 3.1 8B, Mistral 7B), where RTX 5090 and H100 PCIe are more relevant:
| Config | $/hr (on-demand) | $/hr (spot) | Tokens/sec | Cost/M tokens (on-demand) | Cost/M tokens (spot) |
|---|---|---|---|---|---|
| RTX 5090, FP16 | $0.76 | - | ~3,500 | $0.060 | - |
| RTX 5090, FP4 | $0.76 | - | ~7,000* | $0.030* | - |
| H100 PCIe, FP16 | $2.01 | - | ~3,900 | $0.143 | - |
| H100 PCIe, FP8 | $2.01 | - | ~5,900 | $0.095 | - |
Estimated values: RTX 5090 FP4 throughput and cost figures are estimated at 2x FP16 based on Blackwell's FP4 tensor core ratio. No confirmed public benchmark exists as of March 2026. Verify before using for production cost planning.
GPU pricing fluctuates over time based on availability and market conditions. The rates above are based on Spheron marketplace pricing as of March 15, 2026. Spot pricing varies by GPU and availability. Check current GPU pricing for live rates before making infrastructure decisions.
What the table shows: On-demand B200 at FP4 achieves a cost-per-token (~$0.117/M) that is roughly 2x cheaper than H100 SXM FP8 (~$0.227/M) at current on-demand rates. H100 SXM spot pricing ($0.99/hr) brings its cost-per-token down to ~$0.090/M, which is more competitive with B200 FP4 on-demand. H200 SXM FP8 at $1.56/hr remains the lowest cost-per-token option at ~$0.099/M for 70B-class inference in this comparison. The throughput advantage of B200 FP4 is most meaningful when comparing against H100 SXM on-demand or when VRAM capacity (192 GB on B200 vs 80 GB on H100) is the limiting factor for your workload.
Which Models Support FP4 Well
Not all models tolerate FP4 quantization equally. The key variable is whether the model has been calibrated for FP4 using PTQ tooling, versus dynamically quantized at inference time.
Better FP4 quality outcomes:
- Models specifically trained or calibrated for FP4 - NVIDIA has released FP4-ready weights for several model families via Hugging Face under the
nvidia/namespace, including Llama 3.1, Llama 3.3, Llama 4 Scout, DeepSeek-R1, and DeepSeek-V3.x, covering both dense and MoE architectures. Mistral Large 3 (675B) also has an NVFP4 checkpoint available atmistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4, built via llm-compressor in collaboration with vLLM and Red Hat teams. - Smaller, densely-trained models (7B-13B parameters) - less headroom for quantization error to propagate
- Instruction-following and conversational models - output quality for open-ended generation is less sensitive to small numerical errors than reasoning tasks
- Models with pre-calibrated FP4 weights on Hugging Face under the
nvidia/namespace. Confirmed NVFP4 releases as of March 2026 include: Llama 4 Scout 17B-16E (nvidia/Llama-4-Scout-17B-16E-Instruct-NVFP4), Llama 3.3 70B (nvidia/Llama-3.3-70B-Instruct-NVFP4), Llama 3.1 8B (nvidia/Llama-3.1-8B-Instruct-NVFP4), Llama 3.1 405B (nvidia/Llama-3.1-405B-Instruct-NVFP4), DeepSeek-R1 (nvidia/DeepSeek-R1-NVFP4), DeepSeek-R1-0528 (nvidia/DeepSeek-R1-0528-NVFP4), and DeepSeek-V3.2 (nvidia/DeepSeek-V3.2-NVFP4), all ready for TensorRT-LLM or vLLM inference on Blackwell
Higher FP4 quality loss:
- Models quantized without calibration (dynamic quantization from FP16) - errors are larger and distributed less evenly
- Complex reasoning tasks: math, multi-step logic, scientific analysis - quantization errors compound through reasoning chains
- Long-context tasks where small errors accumulate over hundreds of attention steps
- Very large models (200B+) where each layer's errors stack - though these also gain the most from FP4's VRAM reduction
The practical recommendation: if NVIDIA ModelOpt or a provider like Hugging Face offers pre-calibrated FP4 weights for your model, start there. If no calibrated weights exist, run your task-specific evaluation before committing to FP4. Useful benchmarks: MMLU for general knowledge, HumanEval for code generation, or your production task sample set. Don't rely only on perplexity - it doesn't capture task-specific quality degradation well.
Framework Support - How to Actually Use FP4
FP4 framework support is newer and less uniformly available than FP8. Here's the state as of March 2026:
TensorRT-LLM has the most mature FP4 support for Blackwell. Version 0.17 and higher includes native NVFP4 quantization for B200 and other Blackwell GPUs. The recommended workflow uses NVIDIA ModelOpt for PTQ calibration, then TensorRT-LLM for engine building and serving. This is the production-grade path for FP4 inference on datacenter Blackwell GPUs.
vLLM supports FP4 for both mixture-of-experts (MoE) models and dense models on Blackwell. MoE NVFP4 models use NVIDIA's FlashInfer FP4 kernel (enabled via VLLM_USE_FLASHINFER_MOE_FP4=1). Dense NVFP4 models like nvidia/Llama-3.1-8B-Instruct-NVFP4 can now also be served directly with vLLM on Blackwell hardware, using pre-quantized weights from Hugging Face or models quantized via llm-compressor. TensorRT-LLM remains the highest-throughput option for dense NVFP4 models. Check the vLLM changelog for the latest Blackwell FP4 updates.
To serve a pre-quantized FP4 MoE model with vLLM on Blackwell (example with Llama 4 Scout, which is a 17B active / 109B total mixture-of-experts model):
# For FP4 MoE models on Blackwell, enable the FlashInfer FP4 kernel
# Pass the variable explicitly with -e so it is set inside the container
docker run --gpus all -p 8000:8000 -e VLLM_USE_FLASHINFER_MOE_FP4=1 vllm/vllm-openai:latest \
--model nvidia/Llama-4-Scout-17B-16E-Instruct-FP4The quantization format is determined by the pre-quantized model weights, not a runtime flag. The VLLM_USE_FLASHINFER_MOE_FP4=1 flag activates the FlashInfer FP4 kernel for MoE layers specifically. For dense NVFP4 models, TensorRT-LLM delivers maximum throughput; vLLM and SGLang also support dense NVFP4 models on Blackwell hardware. NVIDIA publishes pre-quantized FP4 models on Hugging Face under the nvidia/ namespace. Search for -FP4 or -NVFP4 suffix variants. For on-the-fly quantization, use llm-compressor (vllm-project/llm-compressor) with the NVFP4 scheme to produce a quantized checkpoint first.
Hugging Face transformers + bitsandbytes: bitsandbytes supports NF4 (used for QLoRA fine-tuning) and INT4 formats, but does not currently support NVIDIA's native NVFP4 tensor core format for inference. These are different 4-bit formats. For NVFP4 inference on Blackwell hardware, all three major serving frameworks work: TensorRT-LLM (highest throughput, recommended for production), vLLM (flexible, Hugging Face-native), and SGLang.
SGLang: SGLang now supports NVFP4 on Blackwell GPUs, including models like Llama-3.1-8B-Instruct and DeepSeek-R1 variants. In collaboration with NVIDIA, SGLang serving DeepSeek-R1 with NVFP4 MoE kernels on Blackwell delivers up to 4x throughput improvement over Hopper for the same workload. Check the SGLang releases for the latest updates.
When NOT to Use FP4
FP4 is not appropriate for every deployment:
- When you haven't validated quality for your specific task - FP4 is not a drop-in replacement for FP8 or FP16. Quality varies by model and task. Always benchmark before production.
- When your model doesn't have calibrated FP4 weights - dynamic FP4 quantization from FP16 produces larger accuracy gaps than PTQ-calibrated weights. If calibrated weights aren't available, you're accepting more quality risk.
- For high-stakes applications - legal, medical, financial, and scientific tasks where small output errors have real consequences. The cost savings don't justify the quality risk in these domains.
- For production training - FP4 training on Blackwell has advanced to production-scale validation but is not yet broadly available in standard training frameworks. NVIDIA demonstrated NVFP4 pretraining at scale for the first time in MLPerf Training v5.1 (November 2025), achieving 3x faster time-to-train versus Hopper with 2,560 Blackwell GPUs. Despite that milestone, most production training workflows still use FP16, BF16, or FP8 for broad framework compatibility and stability. FP8 training remains the most mature reduced-precision training format available for general use today.
- When you're compute-bound, not memory-bound - FP4's advantage is largest when memory bandwidth is the bottleneck (large models, small batch sizes). If you're running small models at large batch sizes and are already compute-saturated, the gains are smaller.
- When framework support isn't yet stable for your stack - FP4 tooling is newer than FP8. If your production stack relies on a framework that doesn't have confirmed, stable FP4 support for Blackwell, deploying FP4 adds operational risk.
Should You Migrate from H100 to Blackwell for FP4?
Here's the decision framework:
Step 1: Run the cost math for your workload.
Calculate your current cost-per-token on H100 (FP8). Then estimate what it would be on B200 FP4, using the table above as a starting point. If B200 FP4 cost-per-token is lower (and it will be for large-batch, large-model workloads), migration makes financial sense if quality holds.
Step 2: Validate quality on your task.
Get FP4-calibrated weights for your model (from NVIDIA NGC, Hugging Face, or via ModelOpt PTQ). Run your task benchmark comparing FP16, FP8, and FP4 outputs. If the quality delta is below your acceptable threshold, proceed.
Step 3: Check availability.
Blackwell GPUs (B200, B300) are newer hardware with constrained supply. Verify that the GPU you need is available on Spheron for your target region and volume before building a migration plan around it.
When migration makes sense:
- You're running 70B+ models where B200's 192 GB VRAM gives you headroom H100 can't provide, and your cost-per-token on B200 FP4 (~$0.117/M on-demand at March 2026 rates) is lower than running H100 SXM on-demand (~$0.227/M)
- Your model has validated FP4 quality on your task set
- You need the highest single-GPU throughput for latency-sensitive serving
When to stay on H100/H200:
- H100 SXM pricing on Spheron (currently $2.50/hr on-demand, $0.99/hr spot as of March 15, 2026) gives you cost-per-token around $0.227/M on-demand or ~$0.090/M on spot for single-GPU 70B FP8 serving; GPU pricing fluctuates with availability
- Your task requires quality that FP4 can't reliably deliver
- You're running models under 30B parameters where the RTX 5090 may be more cost-effective than either
For context on choosing between GPU generations more broadly, see the RTX 5090 vs H100 vs B200 comparison, our GPU memory requirements guide for LLMs, and the NVIDIA B300 Blackwell Ultra guide for where Blackwell Ultra fits in the stack.
Blackwell GPUs with native FP4 support, including the RTX 5090 and B200 - are available on Spheron. Run your own throughput and quality benchmarks before committing to a migration. Start with the cost math, validate on your task, then move.
