Comparison

NVFP4 vs MXFP4: 4-Bit Quantization Format Decision Guide for LLM Inference (2026)

nvfp4 vs mxfp4NVFP4 QuantizationMXFP4 QuantizationFP4 InferenceBlackwellLLM InferenceTensorRT-LLMvLLM
NVFP4 vs MXFP4: 4-Bit Quantization Format Decision Guide for LLM Inference (2026)

NVFP4 and MXFP4 are not the same format. Both use E2M1 four-bit values for weights, but they differ in two places that matter for accuracy: block size and scale precision. NVFP4 uses 16-element blocks with an FP8 (E4M3) scale. MXFP4 uses 32-element blocks with an E8M0 scale. On paper, these look like minor technical differences. In practice, they affect which toolchain you use, which hardware supports your checkpoint, and how much quality you lose at inference time.

If you're already running 4-bit inference on Blackwell and want the cost and throughput picture, the FP4 quantization guide covers the full hardware context. For the complete MXFP4 microscaling standard and MR-GPTQ workflow, see the MXFP4 deep dive. This post focuses on the direct format comparison: where the two differ, when each is the better choice, and how to deploy both on Spheron Blackwell GPUs.

The Two Standards: Format Spec Comparison

Both formats store weights in E2M1 encoding: 1 sign bit, 2 exponent bits, 1 mantissa bit. That part is identical. The difference is how each format handles the block-level scaling that makes 4-bit inference work.

FormatBlock sizeScale typeScale bits per blockHardware native support
OCP MXFP432 elementsE8M08 bitsNVIDIA Blackwell, AMD MI355X
NVFP416 elementsFP8 (E4M3)8 bitsNVIDIA Blackwell only

The overhead per weight value is identical for both: 8 scale bits divided across the block size. For OCP MXFP4, that's 8 bits / 32 = 0.25 bits of overhead per weight. For NVFP4, it's 8 bits / 16 = 0.5 bits per weight. So NVFP4 carries double the scale storage overhead of MXFP4.

Where NVFP4 pays more in overhead, it gains in accuracy. Smaller blocks mean each group of 16 values shares a scale that fits their local distribution more tightly, whereas a 32-element block in MXFP4 has to accommodate the full range of values across double the elements. When weight distributions have outliers or high within-block variance, NVFP4's smaller blocks preserve more precision.

The scale format difference also matters. E8M0 (8 bits, all exponent, no mantissa) gives a wide dynamic range with 255 distinct scale values but no fine-grained mantissa precision. FP8 E4M3 (4-bit exponent, 3-bit mantissa) trades some dynamic range for 127 distinct scale values with finer mantissa coverage. For typical transformer weight distributions, the higher mantissa bits in FP8 scale tend to outweigh E8M0's wider range.

When discussing quantization tooling for producing either format, see the TensorRT Model Optimizer guide for the full calibration workflow.

Accuracy Head-to-Head

Three factors drive the quality difference between NVFP4 and MXFP4.

Block size. Smaller blocks capture more local distribution shape. In transformer weight matrices, rows often contain 2-5 outlier values with magnitudes much larger than the rest of the block. In a 32-element MXFP4 block, a single outlier pulls the shared scale up, causing the non-outlier values to be rounded aggressively. In a 16-element NVFP4 block, that same outlier affects half as many neighbors. For attention projection layers and certain feedforward weight matrices with high row variance, this difference is measurable in downstream task benchmarks.

Scale precision. FP8 E4M3 scale in NVFP4 provides finer mantissa resolution than E8M0's pure-exponent scale. Most weight distributions do not need the extreme dynamic range of E8M0. They benefit more from finer scale resolution at their working range. This is why NVFP4 tends to outperform MXFP4 at equivalent calibration quality, even controlling for block size.

Calibration interaction. With MR-GPTQ (ICLR 2026), MXFP4 closes most of the gap. MR-GPTQ applies block-wise Hadamard rotation before quantization, distributing outliers across all channels in each block. After rotation, the 32-element blocks no longer have the worst-case outlier concentration, and the E8M0 scale disadvantage shrinks. See the MXFP4 microscaling guide for the full MR-GPTQ explanation.

Directional accuracy summary for Llama-class models:

Configurationvs BF16 on conversational tasksvs BF16 on complex reasoning/math
MXFP4 (no MR-GPTQ)Moderate gap, task-dependentNoticeable gap
MXFP4 (MR-GPTQ)Small gap, similar to FP8Small-to-moderate gap
NVFP4 (MR-GPTQ)Very small gapSmaller gap than MXFP4 MR-GPTQ

Exact perplexity figures vary by model, calibration dataset, and task. The numbers above are directional. For confirmed accuracy benchmarks, consult NVIDIA's published ModelOpt results and the MLPerf Inference reports. Do not rely on invented specific figures from any source, including this post, for production quality decisions. Run your own evaluation on your model and task before committing to a format.

Hardware Support Matrix

FP4 requires Blackwell hardware. No pre-Blackwell NVIDIA GPU (H100, H200, A100) has FP4 tensor cores. On those GPUs, AWQ INT4 or GPTQ INT4 are the practical 4-bit options. See the AWQ quantization guide for INT4 on non-Blackwell hardware.

GPUNVFP4MXFP4VRAMArchitecture
B200 on SpheronYesYes192 GB HBM3eBlackwell SM100
B300 on SpheronYesYes288 GB HBM3eBlackwell Ultra SM100
RTX 5090YesYes32 GB GDDR7Blackwell SM100 consumer
RTX PRO 6000 on SpheronYesYes96 GB GDDR7Blackwell SM100 workstation
AMD Instinct MI355XNoYes (MFMA)288 GB HBM3eCDNA4, ROCm 7.x
NVIDIA H100 / H200NoNo80-141 GB HBM3Hopper, FP8 max
NVIDIA A100NoNo80 GB HBM2eAmpere, INT8 max

NVFP4 is NVIDIA Blackwell-specific. Any AMD hardware, including the MI355X, runs the MXFP4 OCP standard via MFMA instructions in ROCm 7.x with a different calibration toolchain. If cross-platform portability matters for your deployment, MXFP4 is the only path.

On NVIDIA Blackwell, both formats execute through the same FP4 tensor core hardware. The checkpoint format differs, but the hardware path is identical at runtime. A B200 runs NVFP4 and MXFP4 checkpoints with the same underlying compute units.

Framework Support

FrameworkNVFP4MXFP4 (OCP)Notes
TensorRT-LLMFull (v0.17+)FullRecommended for datacenter B200/B300 maximum throughput
vLLMPre-quantized HF checkpointsFullMoE requires VLLM_USE_FLASHINFER_MOE_FP4=1
SGLangFullFullIncluding MoE kernels on Blackwell
nvidia-modeloptNVFP4_DEFAULT_CFGMXFP4 schemeBoth PTQ workflows supported
llm-compressorFullPartialUsed for Mistral Large 3 NVFP4 checkpoint
bitsandbytesNoNoNF4 is a different, unrelated format

For TensorRT-LLM deployment specifics, including engine build commands and multi-GPU tensor parallelism, see the TensorRT-LLM deployment guide.

vLLM NVFP4 notes. Dense NVFP4 models (e.g., nvidia/Llama-3.1-8B-Instruct-NVFP4, nvidia/Llama-3.3-70B-Instruct-NVFP4) load directly on Blackwell without extra flags. MoE NVFP4 models (e.g., nvidia/Llama-4-Scout-17B-16E-Instruct-FP4, nvidia/DeepSeek-R1-NVFP4) require VLLM_USE_FLASHINFER_MOE_FP4=1 as an environment variable. Dense vs MoE is the only flag distinction in vLLM; the checkpoint format (NVFP4 vs MXFP4 OCP) is handled automatically based on the quantization metadata in the saved checkpoint.

llm-compressor is vllm-project's standalone quantization tool tightly integrated with vLLM. It supports NVFP4 and was used to produce some of the Mistral NVFP4 checkpoints on Hugging Face. If your workflow is vLLM-centric and you do not want the full ModelOpt dependency, llm-compressor is a lighter alternative for NVFP4 calibration.

Throughput and Cost-Per-Token

NVFP4 and MXFP4 reach the same hardware throughput ceiling on Blackwell. Both execute through the same FP4 tensor cores, and the format difference does not change the per-operation compute cost. The choice between them is an accuracy and toolchain decision, not a throughput decision.

For Llama 70B-class models on a single B200, derived from MLPerf Inference v5.1 8-GPU results (102,725 tok/s divided by 8):

Cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000
GPUFormat$/hr (on-demand)$/hr (spot)Est. tok/s (Llama 70B)Cost/M tokens (on-demand)Cost/M tokens (spot)
B200 SXM6FP4$7.37$2.71~12,841*~$0.159*~$0.059*
B300 SXM6FP4$9.02$3.29~12,841*~$0.195*~$0.071*

The throughput figure is the same for NVFP4 and MXFP4 on the same GPU since both run through identical hardware tensor cores.

RTX 5090 and RTX PRO 6000 on-demand availability fluctuates; check the pricing page for live rates on those SKUs.

Pricing fluctuates based on GPU availability. The prices above are based on 13 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

B300 throughput benchmarks are still maturing in MLPerf reports. The ~12,841 tok/s estimate above uses the B200 per-GPU figure as a proxy, since B300 shares the same SM100 compute architecture. Actual B300 throughput may differ for large models that benefit from its additional 96 GB HBM3e. Run your own benchmarks before making production cost decisions.

When to Use NVFP4 vs MXFP4

ScenarioRecommended format
NVIDIA Blackwell only, maximum output qualityNVFP4
NVIDIA Blackwell, checkpoint needs AMD compatibilityMXFP4
AMD MI355X targetMXFP4 only
Pre-quantized nvidia/ HuggingFace checkpoint existsNVFP4 (already packaged)
DIY quantization via ModelOpt on BlackwellNVFP4 recommended
MoE model (Llama 4, DeepSeek R1) on vLLMNVFP4 (VLLM_USE_FLASHINFER_MOE_FP4=1)
Complex reasoning / math workloadNVFP4 or FP8, run task eval before committing
On-the-fly calibration with llm-compressorNVFP4

In most NVIDIA-only deployments, NVFP4 is the default choice. The quality advantage is real, pre-quantized checkpoints are available for popular models under the nvidia/ namespace, and the toolchain (ModelOpt, TRT-LLM, vLLM) supports it first-class.

The main reason to pick MXFP4 over NVFP4 is cross-platform portability. If your organization runs a mix of NVIDIA Blackwell and AMD MI355X hardware, MXFP4 is the only format supported by both. MXFP4 with MR-GPTQ calibration closes most of the accuracy gap, so the cost of choosing the cross-platform format is small in most scenarios.

For reasoning-heavy tasks (complex math, multi-step logic, code generation), neither format is a guaranteed substitute for FP8. The additional quantization noise from 4-bit precision compounds through long reasoning chains. Benchmark your specific model and task set before assuming FP4 is acceptable for production.

Migration Steps

Step 1: Provision a Blackwell GPU on Spheron

Go to app.spheron.ai and provision a B200, B300, RTX 5090, or RTX PRO 6000 instance. All four support native FP4 tensor core acceleration. B200 and B300 are the datacenter options with the highest throughput per GPU.

Step 2: Check for a pre-quantized NVFP4 checkpoint

Search Hugging Face for your model under the nvidia/ namespace:

bash
# Examples of available NVFP4 checkpoints as of June 2026:
# nvidia/Llama-3.1-8B-Instruct-NVFP4
# nvidia/Llama-3.3-70B-Instruct-NVFP4
# nvidia/DeepSeek-R1-NVFP4
# nvidia/Llama-4-Scout-17B-16E-Instruct-FP4  (MoE)

Pre-quantized checkpoints skip calibration, are production-validated by NVIDIA, and can load directly into vLLM or TRT-LLM on Blackwell hardware without additional steps.

Step 3: Quantize with ModelOpt if no pre-built checkpoint exists

bash
pip install "nvidia-modelopt[all]>=0.17"
python
import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct", torch_dtype="auto")

# For NVFP4 (Blackwell recommended):
mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop=calibration_dataloader)

# For MXFP4 OCP (cross-platform):
# mtq.quantize(model, mtq.MXFP4_DEFAULT_CFG, forward_loop=calibration_dataloader)

mtq.export_hf_checkpoint(model, output_dir="./quantized-nvfp4")

Use 512-1024 domain-matched calibration samples. See the MXFP4 microscaling guide for the full MR-GPTQ workflow and calibration dataset guidance.

Step 4: Serve with vLLM on Blackwell

bash
# Dense NVFP4 model (no extra flags needed on Blackwell):
vllm serve nvidia/Llama-3.3-70B-Instruct-NVFP4 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90

# MoE NVFP4 model (requires FlashInfer MoE kernel flag):
VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90

Step 5: Validate before production rollout

Run lm-evaluation-harness benchmarks (MMLU, GSM8K, HumanEval) on the quantized checkpoint against your BF16 baseline. Accept the deployment if MMLU degradation is under 1.5%. For higher regression, switch to FP8 or use QAT to recover quality on sensitive layers. For reasoning-heavy tasks, set a tighter threshold.


NVFP4 and MXFP4 both reach their throughput ceiling on Blackwell hardware. B200, B300, RTX 5090, and RTX PRO 6000 all support native FP4 tensor cores. Spheron provides on-demand and spot access to all four with per-minute billing and no contracts.

View GPU pricing → | Get started on Spheron →

STEPS / 04

Quick Setup Guide

  1. Identify your target GPU and runtime

    Confirm you are on a Blackwell GPU (B200, B300, RTX 5090, or RTX PRO 6000). Check nvidia-smi for compute capability 10.0 (Blackwell SM100). Choose your serving framework: TensorRT-LLM for maximum throughput on datacenter B200/B300, vLLM for flexibility and Hugging Face model compatibility, SGLang for multi-turn serving.

  2. Check if a pre-quantized NVFP4 checkpoint exists

    Search Hugging Face for your model under the nvidia/ namespace with -NVFP4 or -FP4 suffix (e.g., nvidia/Llama-3.3-70B-Instruct-NVFP4, nvidia/DeepSeek-R1-NVFP4). Pre-quantized checkpoints skip the calibration step and are production-validated by NVIDIA. If one exists, prefer it over DIY quantization for accuracy and reliability.

  3. Quantize with ModelOpt if no pre-built checkpoint exists

    Install nvidia-modelopt: pip install nvidia-modelopt[all]. Load the BF16 base model. Run modelopt.torch.quantization.quantize() with NVFP4_DEFAULT_CFG for NVFP4 or the MXFP4 scheme for OCP-compliant output. Pass 512-1024 domain-matched calibration samples. Export to TensorRT-LLM or vLLM format.

  4. Deploy on Spheron Blackwell GPU

    Provision a B200, B300, RTX 5090, or RTX PRO 6000 instance on Spheron. For vLLM, run vllm serve with the quantized checkpoint path. Dense NVFP4 models load automatically on Blackwell. MoE NVFP4 models require VLLM_USE_FLASHINFER_MOE_FP4=1. Verify with a curl request to the OpenAI-compatible endpoint.

FAQ / 05

Frequently Asked Questions

MXFP4 is the open OCP standard for 4-bit microscaling quantization using 32-element blocks with an E8M0 (8-bit exponent, 0 mantissa) shared scale per block. NVFP4 is NVIDIA's Blackwell tensor core implementation: it uses E2M1 values (1 sign, 2 exponent, 1 mantissa) per weight, but with a 16-element block and an FP8 (E4M3) scale instead of E8M0. Finer block granularity and higher-precision scale give NVFP4 better per-block accuracy at the cost of 2x more scale overhead compared to the OCP MXFP4 standard.

NVFP4 generally achieves lower perplexity and higher task scores than standard MXFP4 for the same model and calibration dataset. The advantage comes from the smaller block size (16 vs 32) and the higher-precision FP8 scale (E4M3 vs E8M0). With MR-GPTQ calibration, MXFP4 closes most of the gap and both formats achieve results close to FP8 for most conversational and instruction-following tasks. For complex reasoning and math, the quality difference is more pronounced.

Yes. vLLM on Blackwell hardware supports pre-quantized NVFP4 model checkpoints from Hugging Face (e.g., nvidia/Llama-3.1-8B-Instruct-NVFP4 for dense models, nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 for MoE with VLLM_USE_FLASHINFER_MOE_FP4=1). MXFP4 OCP-format checkpoints produced by ModelOpt or llm-compressor are also compatible since NVIDIA's Blackwell hardware executes both through the same FP4 tensor core path.

NVIDIA Blackwell GPUs support both: B200 (192 GB HBM3e), B300 Blackwell Ultra (288 GB HBM3e), RTX 5090 (32 GB GDDR7), and RTX PRO 6000 (96 GB GDDR7). AMD Instinct MI355X supports MXFP4 via MFMA instructions under ROCm 7.x but does not support NVFP4. No pre-Blackwell NVIDIA GPU (H100, H200, A100) has FP4 tensor cores.

NVIDIA TensorRT Model Optimizer (nvidia-modelopt) is the recommended tool for both. Use NVFP4_DEFAULT_CFG for NVFP4 checkpoints targeting Blackwell inference via TensorRT-LLM or vLLM. Use the MXFP4 (OCP) scheme for cross-platform or AMD-compatible checkpoints. llm-compressor (vllm-project/llm-compressor) also supports NVFP4 and is tightly integrated with vLLM.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.