NVFP4 vs MXFP4: 4-Bit Quantization Format Decision Guide for LLM Inference (2026)

NVFP4 and MXFP4 are not the same format. Both use E2M1 four-bit values for weights, but they differ in two places that matter for accuracy: block size and scale precision. NVFP4 uses 16-element blocks with an FP8 (E4M3) scale. MXFP4 uses 32-element blocks with an E8M0 scale. On paper, these look like minor technical differences. In practice, they affect which toolchain you use, which hardware supports your checkpoint, and how much quality you lose at inference time.

If you're already running 4-bit inference on Blackwell and want the cost and throughput picture, the FP4 quantization guide covers the full hardware context. For the complete MXFP4 microscaling standard and MR-GPTQ workflow, see the MXFP4 deep dive. This post focuses on the direct format comparison: where the two differ, when each is the better choice, and how to deploy both on Spheron Blackwell GPUs.

The Two Standards: Format Spec Comparison

Both formats store weights in E2M1 encoding: 1 sign bit, 2 exponent bits, 1 mantissa bit. That part is identical. The difference is how each format handles the block-level scaling that makes 4-bit inference work.

Format	Block size	Scale type	Scale bits per block	Hardware native support
OCP MXFP4	32 elements	E8M0	8 bits	NVIDIA Blackwell, AMD MI355X
NVFP4	16 elements	FP8 (E4M3)	8 bits	NVIDIA Blackwell only

The overhead per weight value is identical for both: 8 scale bits divided across the block size. For OCP MXFP4, that's 8 bits / 32 = 0.25 bits of overhead per weight. For NVFP4, it's 8 bits / 16 = 0.5 bits per weight. So NVFP4 carries double the scale storage overhead of MXFP4.

Where NVFP4 pays more in overhead, it gains in accuracy. Smaller blocks mean each group of 16 values shares a scale that fits their local distribution more tightly, whereas a 32-element block in MXFP4 has to accommodate the full range of values across double the elements. When weight distributions have outliers or high within-block variance, NVFP4's smaller blocks preserve more precision.

The scale format difference also matters. E8M0 (8 bits, all exponent, no mantissa) gives a wide dynamic range with 255 distinct scale values but no fine-grained mantissa precision. FP8 E4M3 (4-bit exponent, 3-bit mantissa) trades some dynamic range for 127 distinct scale values with finer mantissa coverage. For typical transformer weight distributions, the higher mantissa bits in FP8 scale tend to outweigh E8M0's wider range.

When discussing quantization tooling for producing either format, see the TensorRT Model Optimizer guide for the full calibration workflow.

Accuracy Head-to-Head

Three factors drive the quality difference between NVFP4 and MXFP4.

Block size. Smaller blocks capture more local distribution shape. In transformer weight matrices, rows often contain 2-5 outlier values with magnitudes much larger than the rest of the block. In a 32-element MXFP4 block, a single outlier pulls the shared scale up, causing the non-outlier values to be rounded aggressively. In a 16-element NVFP4 block, that same outlier affects half as many neighbors. For attention projection layers and certain feedforward weight matrices with high row variance, this difference is measurable in downstream task benchmarks.

Scale precision. FP8 E4M3 scale in NVFP4 provides finer mantissa resolution than E8M0's pure-exponent scale. Most weight distributions do not need the extreme dynamic range of E8M0. They benefit more from finer scale resolution at their working range. This is why NVFP4 tends to outperform MXFP4 at equivalent calibration quality, even controlling for block size.

Calibration interaction. With MR-GPTQ (ICLR 2026), MXFP4 closes most of the gap. MR-GPTQ applies block-wise Hadamard rotation before quantization, distributing outliers across all channels in each block. After rotation, the 32-element blocks no longer have the worst-case outlier concentration, and the E8M0 scale disadvantage shrinks. See the MXFP4 microscaling guide for the full MR-GPTQ explanation.

Directional accuracy summary for Llama-class models:

Configuration	vs BF16 on conversational tasks	vs BF16 on complex reasoning/math
MXFP4 (no MR-GPTQ)	Moderate gap, task-dependent	Noticeable gap
MXFP4 (MR-GPTQ)	Small gap, similar to FP8	Small-to-moderate gap
NVFP4 (MR-GPTQ)	Very small gap	Smaller gap than MXFP4 MR-GPTQ

Exact perplexity figures vary by model, calibration dataset, and task. The numbers above are directional. For confirmed accuracy benchmarks, consult NVIDIA's published ModelOpt results and the MLPerf Inference reports. Do not rely on invented specific figures from any source, including this post, for production quality decisions. Run your own evaluation on your model and task before committing to a format.

Hardware Support Matrix

FP4 requires Blackwell hardware. No pre-Blackwell NVIDIA GPU (H100, H200, A100) has FP4 tensor cores. On those GPUs, AWQ INT4 or GPTQ INT4 are the practical 4-bit options. See the AWQ quantization guide for INT4 on non-Blackwell hardware.

GPU	NVFP4	MXFP4	VRAM	Architecture
B200 on Spheron	Yes	Yes	192 GB HBM3e	Blackwell SM100
B300 on Spheron	Yes	Yes	288 GB HBM3e	Blackwell Ultra SM100
RTX 5090	Yes	Yes	32 GB GDDR7	Blackwell SM100 consumer
RTX PRO 6000 on Spheron	Yes	Yes	96 GB GDDR7	Blackwell SM100 workstation
AMD Instinct MI355X	No	Yes (MFMA)	288 GB HBM3e	CDNA4, ROCm 7.x
NVIDIA H100 / H200	No	No	80-141 GB HBM3	Hopper, FP8 max
NVIDIA A100	No	No	80 GB HBM2e	Ampere, INT8 max

NVFP4 is NVIDIA Blackwell-specific. Any AMD hardware, including the MI355X, runs the MXFP4 OCP standard via MFMA instructions in ROCm 7.x with a different calibration toolchain. If cross-platform portability matters for your deployment, MXFP4 is the only path.

On NVIDIA Blackwell, both formats execute through the same FP4 tensor core hardware. The checkpoint format differs, but the hardware path is identical at runtime. A B200 runs NVFP4 and MXFP4 checkpoints with the same underlying compute units.

Framework Support

Framework	NVFP4	MXFP4 (OCP)	Notes
TensorRT-LLM	Full (v0.17+)	Full	Recommended for datacenter B200/B300 maximum throughput
vLLM	Pre-quantized HF checkpoints	Full	MoE requires `VLLM_USE_FLASHINFER_MOE_FP4=1`
SGLang	Full	Full	Including MoE kernels on Blackwell
nvidia-modelopt	`NVFP4_DEFAULT_CFG`	MXFP4 scheme	Both PTQ workflows supported
llm-compressor	Full	Partial	Used for Mistral Large 3 NVFP4 checkpoint
bitsandbytes	No	No	NF4 is a different, unrelated format

For TensorRT-LLM deployment specifics, including engine build commands and multi-GPU tensor parallelism, see the TensorRT-LLM deployment guide.

vLLM NVFP4 notes. Dense NVFP4 models (e.g., nvidia/Llama-3.1-8B-Instruct-NVFP4, nvidia/Llama-3.3-70B-Instruct-NVFP4) load directly on Blackwell without extra flags. MoE NVFP4 models (e.g., nvidia/Llama-4-Scout-17B-16E-Instruct-FP4, nvidia/DeepSeek-R1-NVFP4) require VLLM_USE_FLASHINFER_MOE_FP4=1 as an environment variable. Dense vs MoE is the only flag distinction in vLLM; the checkpoint format (NVFP4 vs MXFP4 OCP) is handled automatically based on the quantization metadata in the saved checkpoint.

llm-compressor is vllm-project's standalone quantization tool tightly integrated with vLLM. It supports NVFP4 and was used to produce some of the Mistral NVFP4 checkpoints on Hugging Face. If your workflow is vLLM-centric and you do not want the full ModelOpt dependency, llm-compressor is a lighter alternative for NVFP4 calibration.

Throughput and Cost-Per-Token

NVFP4 and MXFP4 reach the same hardware throughput ceiling on Blackwell. Both execute through the same FP4 tensor cores, and the format difference does not change the per-operation compute cost. The choice between them is an accuracy and toolchain decision, not a throughput decision.

For Llama 70B-class models on a single B200, derived from MLPerf Inference v5.1 8-GPU results (102,725 tok/s divided by 8):

Cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000

GPU	Format	$/hr (on-demand)	$/hr (spot)	Est. tok/s (Llama 70B)	Cost/M tokens (on-demand)	Cost/M tokens (spot)
B200 SXM6	FP4	$7.37	$2.71	~12,841*	~$0.159*	~$0.059*
B300 SXM6	FP4	$9.02	$3.29	~12,841*	~$0.195*	~$0.071*

The throughput figure is the same for NVFP4 and MXFP4 on the same GPU since both run through identical hardware tensor cores.

RTX 5090 and RTX PRO 6000 on-demand availability fluctuates; check the pricing page for live rates on those SKUs.

Pricing fluctuates based on GPU availability. The prices above are based on 13 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

B300 throughput benchmarks are still maturing in MLPerf reports. The ~12,841 tok/s estimate above uses the B200 per-GPU figure as a proxy, since B300 shares the same SM100 compute architecture. Actual B300 throughput may differ for large models that benefit from its additional 96 GB HBM3e. Run your own benchmarks before making production cost decisions.

When to Use NVFP4 vs MXFP4

Scenario	Recommended format
NVIDIA Blackwell only, maximum output quality	NVFP4
NVIDIA Blackwell, checkpoint needs AMD compatibility	MXFP4
AMD MI355X target	MXFP4 only
Pre-quantized `nvidia/` HuggingFace checkpoint exists	NVFP4 (already packaged)
DIY quantization via ModelOpt on Blackwell	NVFP4 recommended
MoE model (Llama 4, DeepSeek R1) on vLLM	NVFP4 (`VLLM_USE_FLASHINFER_MOE_FP4=1`)
Complex reasoning / math workload	NVFP4 or FP8, run task eval before committing
On-the-fly calibration with llm-compressor	NVFP4

In most NVIDIA-only deployments, NVFP4 is the default choice. The quality advantage is real, pre-quantized checkpoints are available for popular models under the nvidia/ namespace, and the toolchain (ModelOpt, TRT-LLM, vLLM) supports it first-class.

The main reason to pick MXFP4 over NVFP4 is cross-platform portability. If your organization runs a mix of NVIDIA Blackwell and AMD MI355X hardware, MXFP4 is the only format supported by both. MXFP4 with MR-GPTQ calibration closes most of the accuracy gap, so the cost of choosing the cross-platform format is small in most scenarios.

For reasoning-heavy tasks (complex math, multi-step logic, code generation), neither format is a guaranteed substitute for FP8. The additional quantization noise from 4-bit precision compounds through long reasoning chains. Benchmark your specific model and task set before assuming FP4 is acceptable for production.

Migration Steps

Step 1: Provision a Blackwell GPU on Spheron

Go to app.spheron.ai and provision a B200, B300, RTX 5090, or RTX PRO 6000 instance. All four support native FP4 tensor core acceleration. B200 and B300 are the datacenter options with the highest throughput per GPU.

Step 2: Check for a pre-quantized NVFP4 checkpoint

Search Hugging Face for your model under the nvidia/ namespace:

bash

# Examples of available NVFP4 checkpoints as of June 2026:
# nvidia/Llama-3.1-8B-Instruct-NVFP4
# nvidia/Llama-3.3-70B-Instruct-NVFP4
# nvidia/DeepSeek-R1-NVFP4
# nvidia/Llama-4-Scout-17B-16E-Instruct-FP4  (MoE)

Pre-quantized checkpoints skip calibration, are production-validated by NVIDIA, and can load directly into vLLM or TRT-LLM on Blackwell hardware without additional steps.

Step 3: Quantize with ModelOpt if no pre-built checkpoint exists

bash

pip install "nvidia-modelopt[all]>=0.17"

python

import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct", torch_dtype="auto")

# For NVFP4 (Blackwell recommended):
mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop=calibration_dataloader)

# For MXFP4 OCP (cross-platform):
# mtq.quantize(model, mtq.MXFP4_DEFAULT_CFG, forward_loop=calibration_dataloader)

mtq.export_hf_checkpoint(model, output_dir="./quantized-nvfp4")

Use 512-1024 domain-matched calibration samples. See the MXFP4 microscaling guide for the full MR-GPTQ workflow and calibration dataset guidance.

Step 4: Serve with vLLM on Blackwell

bash

# Dense NVFP4 model (no extra flags needed on Blackwell):
vllm serve nvidia/Llama-3.3-70B-Instruct-NVFP4 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90

# MoE NVFP4 model (requires FlashInfer MoE kernel flag):
VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90

Step 5: Validate before production rollout

Run lm-evaluation-harness benchmarks (MMLU, GSM8K, HumanEval) on the quantized checkpoint against your BF16 baseline. Accept the deployment if MMLU degradation is under 1.5%. For higher regression, switch to FP8 or use QAT to recover quality on sensitive layers. For reasoning-heavy tasks, set a tighter threshold.

NVFP4 and MXFP4 both reach their throughput ceiling on Blackwell hardware. B200, B300, RTX 5090, and RTX PRO 6000 all support native FP4 tensor cores. Spheron provides on-demand and spot access to all four with per-minute billing and no contracts.
View GPU pricing → | Get started on Spheron →

STEPS / 04

Quick Setup Guide

Identify your target GPU and runtime
Confirm you are on a Blackwell GPU (B200, B300, RTX 5090, or RTX PRO 6000). Check nvidia-smi for compute capability 10.0 (Blackwell SM100). Choose your serving framework: TensorRT-LLM for maximum throughput on datacenter B200/B300, vLLM for flexibility and Hugging Face model compatibility, SGLang for multi-turn serving.
Check if a pre-quantized NVFP4 checkpoint exists
Search Hugging Face for your model under the nvidia/ namespace with -NVFP4 or -FP4 suffix (e.g., nvidia/Llama-3.3-70B-Instruct-NVFP4, nvidia/DeepSeek-R1-NVFP4). Pre-quantized checkpoints skip the calibration step and are production-validated by NVIDIA. If one exists, prefer it over DIY quantization for accuracy and reliability.
Quantize with ModelOpt if no pre-built checkpoint exists
Install nvidia-modelopt: pip install nvidia-modelopt[all]. Load the BF16 base model. Run modelopt.torch.quantization.quantize() with NVFP4_DEFAULT_CFG for NVFP4 or the MXFP4 scheme for OCP-compliant output. Pass 512-1024 domain-matched calibration samples. Export to TensorRT-LLM or vLLM format.
Deploy on Spheron Blackwell GPU
Provision a B200, B300, RTX 5090, or RTX PRO 6000 instance on Spheron. For vLLM, run vllm serve with the quantized checkpoint path. Dense NVFP4 models load automatically on Blackwell. MoE NVFP4 models require VLLM_USE_FLASHINFER_MOE_FP4=1. Verify with a curl request to the OpenAI-compatible endpoint.

FAQ / 05

Frequently Asked Questions

MXFP4 is the open OCP standard for 4-bit microscaling quantization using 32-element blocks with an E8M0 (8-bit exponent, 0 mantissa) shared scale per block. NVFP4 is NVIDIA's Blackwell tensor core implementation: it uses E2M1 values (1 sign, 2 exponent, 1 mantissa) per weight, but with a 16-element block and an FP8 (E4M3) scale instead of E8M0. Finer block granularity and higher-precision scale give NVFP4 better per-block accuracy at the cost of 2x more scale overhead compared to the OCP MXFP4 standard.

NVFP4 generally achieves lower perplexity and higher task scores than standard MXFP4 for the same model and calibration dataset. The advantage comes from the smaller block size (16 vs 32) and the higher-precision FP8 scale (E4M3 vs E8M0). With MR-GPTQ calibration, MXFP4 closes most of the gap and both formats achieve results close to FP8 for most conversational and instruction-following tasks. For complex reasoning and math, the quality difference is more pronounced.

Yes. vLLM on Blackwell hardware supports pre-quantized NVFP4 model checkpoints from Hugging Face (e.g., nvidia/Llama-3.1-8B-Instruct-NVFP4 for dense models, nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 for MoE with VLLM_USE_FLASHINFER_MOE_FP4=1). MXFP4 OCP-format checkpoints produced by ModelOpt or llm-compressor are also compatible since NVIDIA's Blackwell hardware executes both through the same FP4 tensor core path.

NVIDIA Blackwell GPUs support both: B200 (192 GB HBM3e), B300 Blackwell Ultra (288 GB HBM3e), RTX 5090 (32 GB GDDR7), and RTX PRO 6000 (96 GB GDDR7). AMD Instinct MI355X supports MXFP4 via MFMA instructions under ROCm 7.x but does not support NVFP4. No pre-Blackwell NVIDIA GPU (H100, H200, A100) has FP4 tensor cores.

NVIDIA TensorRT Model Optimizer (nvidia-modelopt) is the recommended tool for both. Use NVFP4_DEFAULT_CFG for NVFP4 checkpoints targeting Blackwell inference via TensorRT-LLM or vLLM. Use the MXFP4 (OCP) scheme for cross-platform or AMD-compatible checkpoints. llm-compressor (vllm-project/llm-compressor) also supports NVFP4 and is tightly integrated with vLLM.

The Two Standards: Format Spec Comparison

Accuracy Head-to-Head

Hardware Support Matrix

Framework Support

Throughput and Cost-Per-Token

When to Use NVFP4 vs MXFP4

Migration Steps

Step 1: Provision a Blackwell GPU on Spheron

Step 2: Check for a pre-quantized NVFP4 checkpoint

Step 3: Quantize with ModelOpt if no pre-built checkpoint exists

Step 4: Serve with vLLM on Blackwell

Step 5: Validate before production rollout

Quick Setup Guide

Identify your target GPU and runtime

Check if a pre-quantized NVFP4 checkpoint exists

Quantize with ModelOpt if no pre-built checkpoint exists

Deploy on Spheron Blackwell GPU

Frequently Asked Questions

01What is the difference between NVFP4 and MXFP4?

02Which is more accurate, NVFP4 or MXFP4?

03Does vLLM support both NVFP4 and MXFP4?

04Which GPUs support NVFP4 and MXFP4 in 2026?

05What quantization toolkit should I use to produce NVFP4 or MXFP4 checkpoints?

Try It on Real GPUs