NVFP4 and MXFP4 are not the same format. Both use E2M1 four-bit values for weights, but they differ in two places that matter for accuracy: block size and scale precision. NVFP4 uses 16-element blocks with an FP8 (E4M3) scale. MXFP4 uses 32-element blocks with an E8M0 scale. On paper, these look like minor technical differences. In practice, they affect which toolchain you use, which hardware supports your checkpoint, and how much quality you lose at inference time.
If you're already running 4-bit inference on Blackwell and want the cost and throughput picture, the FP4 quantization guide covers the full hardware context. For the complete MXFP4 microscaling standard and MR-GPTQ workflow, see the MXFP4 deep dive. This post focuses on the direct format comparison: where the two differ, when each is the better choice, and how to deploy both on Spheron Blackwell GPUs.
The Two Standards: Format Spec Comparison
Both formats store weights in E2M1 encoding: 1 sign bit, 2 exponent bits, 1 mantissa bit. That part is identical. The difference is how each format handles the block-level scaling that makes 4-bit inference work.
| Format | Block size | Scale type | Scale bits per block | Hardware native support |
|---|---|---|---|---|
| OCP MXFP4 | 32 elements | E8M0 | 8 bits | NVIDIA Blackwell, AMD MI355X |
| NVFP4 | 16 elements | FP8 (E4M3) | 8 bits | NVIDIA Blackwell only |
The overhead per weight value is identical for both: 8 scale bits divided across the block size. For OCP MXFP4, that's 8 bits / 32 = 0.25 bits of overhead per weight. For NVFP4, it's 8 bits / 16 = 0.5 bits per weight. So NVFP4 carries double the scale storage overhead of MXFP4.
Where NVFP4 pays more in overhead, it gains in accuracy. Smaller blocks mean each group of 16 values shares a scale that fits their local distribution more tightly, whereas a 32-element block in MXFP4 has to accommodate the full range of values across double the elements. When weight distributions have outliers or high within-block variance, NVFP4's smaller blocks preserve more precision.
The scale format difference also matters. E8M0 (8 bits, all exponent, no mantissa) gives a wide dynamic range with 255 distinct scale values but no fine-grained mantissa precision. FP8 E4M3 (4-bit exponent, 3-bit mantissa) trades some dynamic range for 127 distinct scale values with finer mantissa coverage. For typical transformer weight distributions, the higher mantissa bits in FP8 scale tend to outweigh E8M0's wider range.
When discussing quantization tooling for producing either format, see the TensorRT Model Optimizer guide for the full calibration workflow.
Accuracy Head-to-Head
Three factors drive the quality difference between NVFP4 and MXFP4.
Block size. Smaller blocks capture more local distribution shape. In transformer weight matrices, rows often contain 2-5 outlier values with magnitudes much larger than the rest of the block. In a 32-element MXFP4 block, a single outlier pulls the shared scale up, causing the non-outlier values to be rounded aggressively. In a 16-element NVFP4 block, that same outlier affects half as many neighbors. For attention projection layers and certain feedforward weight matrices with high row variance, this difference is measurable in downstream task benchmarks.
Scale precision. FP8 E4M3 scale in NVFP4 provides finer mantissa resolution than E8M0's pure-exponent scale. Most weight distributions do not need the extreme dynamic range of E8M0. They benefit more from finer scale resolution at their working range. This is why NVFP4 tends to outperform MXFP4 at equivalent calibration quality, even controlling for block size.
Calibration interaction. With MR-GPTQ (ICLR 2026), MXFP4 closes most of the gap. MR-GPTQ applies block-wise Hadamard rotation before quantization, distributing outliers across all channels in each block. After rotation, the 32-element blocks no longer have the worst-case outlier concentration, and the E8M0 scale disadvantage shrinks. See the MXFP4 microscaling guide for the full MR-GPTQ explanation.
Directional accuracy summary for Llama-class models:
| Configuration | vs BF16 on conversational tasks | vs BF16 on complex reasoning/math |
|---|---|---|
| MXFP4 (no MR-GPTQ) | Moderate gap, task-dependent | Noticeable gap |
| MXFP4 (MR-GPTQ) | Small gap, similar to FP8 | Small-to-moderate gap |
| NVFP4 (MR-GPTQ) | Very small gap | Smaller gap than MXFP4 MR-GPTQ |
Exact perplexity figures vary by model, calibration dataset, and task. The numbers above are directional. For confirmed accuracy benchmarks, consult NVIDIA's published ModelOpt results and the MLPerf Inference reports. Do not rely on invented specific figures from any source, including this post, for production quality decisions. Run your own evaluation on your model and task before committing to a format.
Hardware Support Matrix
FP4 requires Blackwell hardware. No pre-Blackwell NVIDIA GPU (H100, H200, A100) has FP4 tensor cores. On those GPUs, AWQ INT4 or GPTQ INT4 are the practical 4-bit options. See the AWQ quantization guide for INT4 on non-Blackwell hardware.
| GPU | NVFP4 | MXFP4 | VRAM | Architecture |
|---|---|---|---|---|
| B200 on Spheron | Yes | Yes | 192 GB HBM3e | Blackwell SM100 |
| B300 on Spheron | Yes | Yes | 288 GB HBM3e | Blackwell Ultra SM100 |
| RTX 5090 | Yes | Yes | 32 GB GDDR7 | Blackwell SM100 consumer |
| RTX PRO 6000 on Spheron | Yes | Yes | 96 GB GDDR7 | Blackwell SM100 workstation |
| AMD Instinct MI355X | No | Yes (MFMA) | 288 GB HBM3e | CDNA4, ROCm 7.x |
| NVIDIA H100 / H200 | No | No | 80-141 GB HBM3 | Hopper, FP8 max |
| NVIDIA A100 | No | No | 80 GB HBM2e | Ampere, INT8 max |
NVFP4 is NVIDIA Blackwell-specific. Any AMD hardware, including the MI355X, runs the MXFP4 OCP standard via MFMA instructions in ROCm 7.x with a different calibration toolchain. If cross-platform portability matters for your deployment, MXFP4 is the only path.
On NVIDIA Blackwell, both formats execute through the same FP4 tensor core hardware. The checkpoint format differs, but the hardware path is identical at runtime. A B200 runs NVFP4 and MXFP4 checkpoints with the same underlying compute units.
Framework Support
| Framework | NVFP4 | MXFP4 (OCP) | Notes |
|---|---|---|---|
| TensorRT-LLM | Full (v0.17+) | Full | Recommended for datacenter B200/B300 maximum throughput |
| vLLM | Pre-quantized HF checkpoints | Full | MoE requires VLLM_USE_FLASHINFER_MOE_FP4=1 |
| SGLang | Full | Full | Including MoE kernels on Blackwell |
| nvidia-modelopt | NVFP4_DEFAULT_CFG | MXFP4 scheme | Both PTQ workflows supported |
| llm-compressor | Full | Partial | Used for Mistral Large 3 NVFP4 checkpoint |
| bitsandbytes | No | No | NF4 is a different, unrelated format |
For TensorRT-LLM deployment specifics, including engine build commands and multi-GPU tensor parallelism, see the TensorRT-LLM deployment guide.
vLLM NVFP4 notes. Dense NVFP4 models (e.g., nvidia/Llama-3.1-8B-Instruct-NVFP4, nvidia/Llama-3.3-70B-Instruct-NVFP4) load directly on Blackwell without extra flags. MoE NVFP4 models (e.g., nvidia/Llama-4-Scout-17B-16E-Instruct-FP4, nvidia/DeepSeek-R1-NVFP4) require VLLM_USE_FLASHINFER_MOE_FP4=1 as an environment variable. Dense vs MoE is the only flag distinction in vLLM; the checkpoint format (NVFP4 vs MXFP4 OCP) is handled automatically based on the quantization metadata in the saved checkpoint.
llm-compressor is vllm-project's standalone quantization tool tightly integrated with vLLM. It supports NVFP4 and was used to produce some of the Mistral NVFP4 checkpoints on Hugging Face. If your workflow is vLLM-centric and you do not want the full ModelOpt dependency, llm-compressor is a lighter alternative for NVFP4 calibration.
Throughput and Cost-Per-Token
NVFP4 and MXFP4 reach the same hardware throughput ceiling on Blackwell. Both execute through the same FP4 tensor cores, and the format difference does not change the per-operation compute cost. The choice between them is an accuracy and toolchain decision, not a throughput decision.
For Llama 70B-class models on a single B200, derived from MLPerf Inference v5.1 8-GPU results (102,725 tok/s divided by 8):
Cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000| GPU | Format | $/hr (on-demand) | $/hr (spot) | Est. tok/s (Llama 70B) | Cost/M tokens (on-demand) | Cost/M tokens (spot) |
|---|---|---|---|---|---|---|
| B200 SXM6 | FP4 | $7.37 | $2.71 | ~12,841* | ~$0.159* | ~$0.059* |
| B300 SXM6 | FP4 | $9.02 | $3.29 | ~12,841* | ~$0.195* | ~$0.071* |
The throughput figure is the same for NVFP4 and MXFP4 on the same GPU since both run through identical hardware tensor cores.
RTX 5090 and RTX PRO 6000 on-demand availability fluctuates; check the pricing page for live rates on those SKUs.
Pricing fluctuates based on GPU availability. The prices above are based on 13 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
B300 throughput benchmarks are still maturing in MLPerf reports. The ~12,841 tok/s estimate above uses the B200 per-GPU figure as a proxy, since B300 shares the same SM100 compute architecture. Actual B300 throughput may differ for large models that benefit from its additional 96 GB HBM3e. Run your own benchmarks before making production cost decisions.
When to Use NVFP4 vs MXFP4
| Scenario | Recommended format |
|---|---|
| NVIDIA Blackwell only, maximum output quality | NVFP4 |
| NVIDIA Blackwell, checkpoint needs AMD compatibility | MXFP4 |
| AMD MI355X target | MXFP4 only |
Pre-quantized nvidia/ HuggingFace checkpoint exists | NVFP4 (already packaged) |
| DIY quantization via ModelOpt on Blackwell | NVFP4 recommended |
| MoE model (Llama 4, DeepSeek R1) on vLLM | NVFP4 (VLLM_USE_FLASHINFER_MOE_FP4=1) |
| Complex reasoning / math workload | NVFP4 or FP8, run task eval before committing |
| On-the-fly calibration with llm-compressor | NVFP4 |
In most NVIDIA-only deployments, NVFP4 is the default choice. The quality advantage is real, pre-quantized checkpoints are available for popular models under the nvidia/ namespace, and the toolchain (ModelOpt, TRT-LLM, vLLM) supports it first-class.
The main reason to pick MXFP4 over NVFP4 is cross-platform portability. If your organization runs a mix of NVIDIA Blackwell and AMD MI355X hardware, MXFP4 is the only format supported by both. MXFP4 with MR-GPTQ calibration closes most of the accuracy gap, so the cost of choosing the cross-platform format is small in most scenarios.
For reasoning-heavy tasks (complex math, multi-step logic, code generation), neither format is a guaranteed substitute for FP8. The additional quantization noise from 4-bit precision compounds through long reasoning chains. Benchmark your specific model and task set before assuming FP4 is acceptable for production.
Migration Steps
Step 1: Provision a Blackwell GPU on Spheron
Go to app.spheron.ai and provision a B200, B300, RTX 5090, or RTX PRO 6000 instance. All four support native FP4 tensor core acceleration. B200 and B300 are the datacenter options with the highest throughput per GPU.
Step 2: Check for a pre-quantized NVFP4 checkpoint
Search Hugging Face for your model under the nvidia/ namespace:
# Examples of available NVFP4 checkpoints as of June 2026:
# nvidia/Llama-3.1-8B-Instruct-NVFP4
# nvidia/Llama-3.3-70B-Instruct-NVFP4
# nvidia/DeepSeek-R1-NVFP4
# nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 (MoE)Pre-quantized checkpoints skip calibration, are production-validated by NVIDIA, and can load directly into vLLM or TRT-LLM on Blackwell hardware without additional steps.
Step 3: Quantize with ModelOpt if no pre-built checkpoint exists
pip install "nvidia-modelopt[all]>=0.17"import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct", torch_dtype="auto")
# For NVFP4 (Blackwell recommended):
mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop=calibration_dataloader)
# For MXFP4 OCP (cross-platform):
# mtq.quantize(model, mtq.MXFP4_DEFAULT_CFG, forward_loop=calibration_dataloader)
mtq.export_hf_checkpoint(model, output_dir="./quantized-nvfp4")Use 512-1024 domain-matched calibration samples. See the MXFP4 microscaling guide for the full MR-GPTQ workflow and calibration dataset guidance.
Step 4: Serve with vLLM on Blackwell
# Dense NVFP4 model (no extra flags needed on Blackwell):
vllm serve nvidia/Llama-3.3-70B-Instruct-NVFP4 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90
# MoE NVFP4 model (requires FlashInfer MoE kernel flag):
VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90Step 5: Validate before production rollout
Run lm-evaluation-harness benchmarks (MMLU, GSM8K, HumanEval) on the quantized checkpoint against your BF16 baseline. Accept the deployment if MMLU degradation is under 1.5%. For higher regression, switch to FP8 or use QAT to recover quality on sensitive layers. For reasoning-heavy tasks, set a tighter threshold.
NVFP4 and MXFP4 both reach their throughput ceiling on Blackwell hardware. B200, B300, RTX 5090, and RTX PRO 6000 all support native FP4 tensor cores. Spheron provides on-demand and spot access to all four with per-minute billing and no contracts.
Quick Setup Guide
Confirm you are on a Blackwell GPU (B200, B300, RTX 5090, or RTX PRO 6000). Check nvidia-smi for compute capability 10.0 (Blackwell SM100). Choose your serving framework: TensorRT-LLM for maximum throughput on datacenter B200/B300, vLLM for flexibility and Hugging Face model compatibility, SGLang for multi-turn serving.
Search Hugging Face for your model under the nvidia/ namespace with -NVFP4 or -FP4 suffix (e.g., nvidia/Llama-3.3-70B-Instruct-NVFP4, nvidia/DeepSeek-R1-NVFP4). Pre-quantized checkpoints skip the calibration step and are production-validated by NVIDIA. If one exists, prefer it over DIY quantization for accuracy and reliability.
Install nvidia-modelopt: pip install nvidia-modelopt[all]. Load the BF16 base model. Run modelopt.torch.quantization.quantize() with NVFP4_DEFAULT_CFG for NVFP4 or the MXFP4 scheme for OCP-compliant output. Pass 512-1024 domain-matched calibration samples. Export to TensorRT-LLM or vLLM format.
Provision a B200, B300, RTX 5090, or RTX PRO 6000 instance on Spheron. For vLLM, run vllm serve with the quantized checkpoint path. Dense NVFP4 models load automatically on Blackwell. MoE NVFP4 models require VLLM_USE_FLASHINFER_MOE_FP4=1. Verify with a curl request to the OpenAI-compatible endpoint.
Frequently Asked Questions
MXFP4 is the open OCP standard for 4-bit microscaling quantization using 32-element blocks with an E8M0 (8-bit exponent, 0 mantissa) shared scale per block. NVFP4 is NVIDIA's Blackwell tensor core implementation: it uses E2M1 values (1 sign, 2 exponent, 1 mantissa) per weight, but with a 16-element block and an FP8 (E4M3) scale instead of E8M0. Finer block granularity and higher-precision scale give NVFP4 better per-block accuracy at the cost of 2x more scale overhead compared to the OCP MXFP4 standard.
NVFP4 generally achieves lower perplexity and higher task scores than standard MXFP4 for the same model and calibration dataset. The advantage comes from the smaller block size (16 vs 32) and the higher-precision FP8 scale (E4M3 vs E8M0). With MR-GPTQ calibration, MXFP4 closes most of the gap and both formats achieve results close to FP8 for most conversational and instruction-following tasks. For complex reasoning and math, the quality difference is more pronounced.
Yes. vLLM on Blackwell hardware supports pre-quantized NVFP4 model checkpoints from Hugging Face (e.g., nvidia/Llama-3.1-8B-Instruct-NVFP4 for dense models, nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 for MoE with VLLM_USE_FLASHINFER_MOE_FP4=1). MXFP4 OCP-format checkpoints produced by ModelOpt or llm-compressor are also compatible since NVIDIA's Blackwell hardware executes both through the same FP4 tensor core path.
NVIDIA Blackwell GPUs support both: B200 (192 GB HBM3e), B300 Blackwell Ultra (288 GB HBM3e), RTX 5090 (32 GB GDDR7), and RTX PRO 6000 (96 GB GDDR7). AMD Instinct MI355X supports MXFP4 via MFMA instructions under ROCm 7.x but does not support NVFP4. No pre-Blackwell NVIDIA GPU (H100, H200, A100) has FP4 tensor cores.
NVIDIA TensorRT Model Optimizer (nvidia-modelopt) is the recommended tool for both. Use NVFP4_DEFAULT_CFG for NVFP4 checkpoints targeting Blackwell inference via TensorRT-LLM or vLLM. Use the MXFP4 (OCP) scheme for cross-platform or AMD-compatible checkpoints. llm-compressor (vllm-project/llm-compressor) also supports NVFP4 and is tightly integrated with vLLM.
