Tutorial

MXFP4 Quantization on GPU Cloud: Deploy LLMs at 4-Bit Precision (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 14, 2026
MXFP4 QuantizationFP4 InferenceBlackwellLLM InferencevLLMTensorRT-LLMGPU CloudCost Optimization
MXFP4 Quantization on GPU Cloud: Deploy LLMs at 4-Bit Precision (2026)

MXFP4 is the first 4-bit floating-point format with a viable path to production. Not because the format is new, but because Blackwell hardware now executes it natively, and MR-GPTQ (ICLR 2026) solved the calibration quality problem that made earlier FP4 methods impractical. If you're already running FP4 inference on Blackwell and want the cost and throughput context, start with the FP4 quantization guide. This post focuses specifically on the MXFP4 microscaling standard, the distinction from NVFP4, and the full quantization workflow using TensorRT Model Optimizer.

What Is MXFP4 Microscaling

Standard 4-bit quantization (INT4) assigns a fixed range to all values in a tensor. That works acceptably for weights but breaks down for activations, which can have outlier values spanning several orders of magnitude. MXFP4 addresses this with block scaling: instead of one scale factor per tensor, there is one shared 8-bit exponent per block of 32 values, each value encoded in E2M1 format (1 sign bit, 2 exponent bits, 1 mantissa bit).

The result is a much wider dynamic range per block without paying per-value exponent overhead. Each 32-value block gets to "zoom in" on its own range, rather than sharing a single scale with the entire tensor.

FormatBitsBytes/param70B model sizeGPU supportQuality vs BF16
BF16162 bytes~140 GBAll modern GPUsReference
FP881 byte~70 GBH100, H200, BlackwellNegligible diff
AWQ INT440.5 bytes~35 GBAll CUDA GPUsSmall diff on most tasks
GPTQ INT440.5 bytes~35 GBAll CUDA GPUsSlightly worse than AWQ
MXFP4 / NVFP440.5 bytes~35 GBBlackwell, AMD MI355Small diff with MR-GPTQ PTQ

For a full comparison of AWQ, GPTQ, and GGUF on non-Blackwell hardware, see the AWQ quantization guide and the GGUF deployment guide.

MXFP4 vs NVFP4 terminology. MXFP4 is the OCP open standard. NVFP4 is NVIDIA's hardware implementation of a compatible E2M1 format on Blackwell tensor cores. When NVIDIA publishes checkpoints under the nvidia/ namespace on Hugging Face, the weights are in a format compatible with both. AMD's MI355X implements the same MXFP4 block-scaling concept via MFMA instructions in ROCm 7.x, but requires a different toolchain. Throughout this post, "MXFP4" refers to the quantization method and checkpoint format; "NVFP4" refers to NVIDIA's specific Blackwell tensor core execution.

Hardware Support: Blackwell, AMD MI355, and RTX 5090

GPUFP4 SupportVRAMNotes
B200 SXM6Native NVFP4192 GB HBM3eHighest FP4 throughput per GPU
B300 (Blackwell Ultra)Native NVFP4288 GB HBM3eSame architecture, more memory
RTX 5090Native NVFP432 GB GDDR7Accessible for smaller models
RTX PRO 6000Native NVFP496 GB GDDR7Workstation Blackwell
AMD MI355XMXFP4 (MFMA)288 GB HBM3eROCm 7.x, different toolchain
H100 SXM / PCIeNo FP480 GB HBM3FP8 maximum
H200 SXMNo FP4141 GB HBM3eFP8 maximum
A100No FP480 GB HBM2eINT8 maximum
L40SNo FP448 GB GDDR6AWQ INT4 for 4-bit

The B200 delivers approximately 18,000 sparse FP4 TFLOPS vs 9,000 sparse FP8 TFLOPS on the same chip. That 2x hardware ratio is why FP4 on Blackwell produces real throughput gains, not just memory savings. For full B200 specs, benchmarks, and pricing context, see the NVIDIA B200 complete guide.

For AMD MI355X: the MFMA instructions support MXFP4 block-scaled computation, but the quantization toolchain (ROCm, HIP) differs from the NVIDIA path described in this post. Check AMD ROCm documentation for MI355X-specific setup. For a broader comparison of ROCm and CUDA toolchains for GPU cloud workloads, see the ROCm vs CUDA guide.

For workloads on H100, H200, or A100 where FP4 is not an option, see the GPU memory requirements guide for VRAM planning with INT8 and INT4 formats.

MR-GPTQ: The ICLR 2026 Technique That Makes FP4 Work

The core problem with naive FP4 quantization is weight outliers. In transformer models, a small percentage of weight channels have values much larger than the rest. When you apply a single block scale to 32 values that includes an outlier, the non-outlier values get rounded aggressively, and quantization error spikes.

MR-GPTQ (Micro-Rotated-GPTQ) addresses this with block-wise Hadamard transforms. Before quantization, it rotates the weight matrix basis using a structured Hadamard matrix so outlier values are distributed across all channels. After the rotation, the distribution within each 32-value block is much more uniform, and the block scale can represent all values accurately. At inference time, fused kernels apply the inverse transform, so the output is equivalent to the original weights.

The key results from the ICLR 2026 paper:

  • MMLU scores for MR-GPTQ FP4 match or exceed AWQ INT4 on Llama-class 70B models
  • B200 FP4 layer-wise speedup with MR-GPTQ is approximately 3.6x over BF16; end-to-end throughput for 70B Llama-class models is approximately 2.0-2.2x over BF16
  • The quality gap between MR-GPTQ FP4 and FP8 is smaller than the gap between naive FP4 and FP8

NVIDIA's TensorRT Model Optimizer (modelopt) supports FP4 quantization workflows for MXFP4/NVFP4 checkpoints targeting Blackwell hardware. Check the current modelopt release notes for which PTQ algorithms and versions are available in your installed version.

Step-by-Step: Quantize a 70B Model to MXFP4 with TensorRT Model Optimizer

Most production deployments should start with pre-quantized NVIDIA checkpoints (see the next section) rather than running quantization from scratch. If your model doesn't have an existing NVFP4 checkpoint, here's the modelopt workflow.

Installation:

bash
pip install nvidia-modelopt[torch]

Prepare calibration data. You need 128-512 representative samples from your target domain. For general models, the Pile or C4 subsets work. Domain-specific deployments benefit from domain-matched calibration.

Quantization script (pattern, not pinned API):

python
import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.3-70B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Build calibration dataloader (implement your own or use a dataset utility):
# calib_dataloader = build_your_dataloader(tokenizer, num_samples=128)

# Apply MR-GPTQ with NVFP4 scheme and save (uncomment all three after implementing calib_dataloader above):
# mtq.quantize(model, config=mtq.NVFP4_DEFAULT_CFG, forward_loop=calib_dataloader)
# model.save_pretrained("Llama-3.3-70B-Instruct-NVFP4")
# tokenizer.save_pretrained("Llama-3.3-70B-Instruct-NVFP4")

Note: Always check the TensorRT Model Optimizer repository for the current API. The quantize() signature and config names have changed between versions 0.17 and 0.21. The pattern above is illustrative; verify against the version you install.

Hardware requirements for the quantization step:

  • 7B-13B models: RTX 4090 (24 GB) or larger
  • 70B models: A100 80G or H100 (model must fit in VRAM during the calibration forward pass)
  • Estimated time: 45-90 minutes for 70B on a single A100 80G

Building the TensorRT-LLM engine (optional, for maximum throughput):

bash
# Build TensorRT-LLM engine from NVFP4 checkpoint
trtllm-build \
  --checkpoint-dir ./Llama-3.3-70B-Instruct-NVFP4 \
  --output-dir ./trt-engine-fp4 \
  --gemm-plugin fp4 \
  --tp-size 1 \
  --max-batch-size 16

Note: Engine build targets Blackwell SM100 architecture. Building on non-Blackwell hardware is not supported for FP4 engines. Check the TensorRT-LLM releases for current FP4 support flags, as --gemm-plugin and related options change per release.

Engine build time: approximately 5-20 minutes for a 70B model on a B200.

Deploy MXFP4 Models on GPU Cloud with vLLM

The fastest path to production uses pre-quantized NVIDIA checkpoints from Hugging Face. NVIDIA publishes NVFP4 checkpoints under the nvidia/ namespace; search for -NVFP4 or -FP4 suffix variants. Confirmed releases include nvidia/Llama-3.3-70B-Instruct-NVFP4, nvidia/Llama-3.1-8B-Instruct-NVFP4, nvidia/Llama-4-Scout-17B-16E-Instruct-FP4, nvidia/DeepSeek-R1-NVFP4, and others. These model IDs may be updated or supplemented after the publish date of this post.

Path A: Pre-quantized dense NVFP4 model (no extra flags needed):

bash
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model nvidia/Llama-3.3-70B-Instruct-NVFP4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 8192

Path B: MoE NVFP4 model (requires FlashInfer FP4 kernel):

bash
docker run --gpus all -p 8000:8000 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  vllm/vllm-openai:latest \
  --model nvidia/Llama-4-Scout-17B-16E-Instruct-FP4

Path C: Your own quantized checkpoint:

bash
docker run --gpus all -p 8000:8000 \
  -v /path/to/your-nvfp4-checkpoint:/model \
  vllm/vllm-openai:latest \
  --model /model \
  --gpu-memory-utilization 0.92

Verify the endpoint:

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/Llama-3.3-70B-Instruct-NVFP4",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
  }'

For multi-GPU setups and load balancing configuration, see the vLLM production deployment guide.

Deploy with TensorRT-LLM (Maximum Throughput Path)

TensorRT-LLM delivers higher throughput than vLLM for NVFP4 on Blackwell, at the cost of an engine build step and less flexibility for model swapping.

bash
# Serve with TensorRT-LLM after building the engine
python -m tensorrt_llm.serve \
  --engine-dir ./trt-engine-fp4 \
  --port 8000

TensorRT-LLM v0.17.0 and later support NVFP4. Check the TensorRT-LLM releases for current FP4 support status and updated build flags.

For a throughput comparison across vLLM, TensorRT-LLM, and SGLang on GPU cloud hardware, see inference framework benchmarks.

Benchmarks: MXFP4 vs AWQ vs FP8

GPUPrecisionTokens/sec (Llama 3.3 70B)VRAM (weights)$/hr (Spheron)Cost/1M tokens
B200 SXM6NVFP4~12,841*~35 GB$7.43~$0.161*
B200 SXM6FP8~6,972*~70 GB$7.43~$0.296*
H100 SXMFP8~3,066*~70 GB$2.90~$0.263*
A100 80G SXM4AWQ INT4~1,800*~35 GB$1.64~$0.253*

Formula: Cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000

B200 FP4 throughput is derived from MLPerf Inference v5.1 (8-GPU result of ~102,725 tok/s divided by 8). For the most recent MLPerf data, see the MLPerf Inference v6 results guide.

*Estimated values\:** Throughput figures are derived from MLPerf benchmarks and extrapolations (see the FP4 quantization guide for methodology). A100 AWQ throughput is estimated from vLLM benchmark data. Real throughput depends on batch size, sequence length, and serving framework. Run your own benchmarks before making production decisions.

Pricing fluctuates based on GPU availability. The prices above are based on 14 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

GPU Memory Savings: How Many Users Can You Serve at FP4 vs FP16

ModelBF16 VRAMNVFP4 VRAMSavingsSingle-GPU fitConcurrent users (2K ctx, BF16)Concurrent users (2K ctx, FP4)
Llama 3.1 8B~16 GB~4 GB75%RTX 5090 (FP4)~8 (RTX 5090)~28 (RTX 5090)
Llama 3.3 70B~140 GB~35 GB75%B200 (FP4 only)Needs 2x A100Single B200
Llama 3.1 405B~810 GB~200 GB75%2x B200 (TP)Not single-GPU2-3 B200 (TP)

For KV cache sizing methodology and full VRAM calculations, see the GPU memory requirements guide.

A 70B model that needed 2x A100 80G SXM4 in BF16 fits on a single B200 with NVFP4. Consolidating from 2x A100 ($3.28/hr combined) to a single B200 ($7.43/hr) that delivers roughly 7x the throughput and ~36% lower cost per million tokens.

Spheron GPU Pricing for MXFP4 Workloads

GPUVRAMFP4 SupportOn-Demand $/hrSpot $/hrBest for
B200192 GBNative NVFP4$7.43$1.7170B+ MXFP4 production
A100 80G80 GBAWQ INT4 only$1.64$0.4570B AWQ; no FP4
L40S 48G48 GBAWQ INT4 only$0.72$0.3234B AWQ; no FP4
RTX 509032 GBNative NVFP4$0.86-8B-13B MXFP4 dev

Pricing fluctuates based on GPU availability. The prices above are based on 14 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Common Pitfalls

Confusing MXFP4 with bitsandbytes NF4. NF4 is the 4-bit format used in QLoRA fine-tuning. NVFP4 is NVIDIA's native tensor core format. They are not interchangeable, and bitsandbytes does not support NVFP4 for inference.

Skipping calibration. Running naive FP4 without MR-GPTQ or equivalent PTQ produces much larger quantization errors than a calibrated checkpoint. If calibrated weights don't exist for your model, budget 45-90 minutes for the modelopt calibration pass before assuming FP4 quality is acceptable.

Forgetting VLLM_USE_FLASHINFER_MOE_FP4=1 for MoE models. Dense NVFP4 models load without it. MoE NVFP4 models (Llama 4 Scout, DeepSeek MoE variants) require this flag for the FlashInfer FP4 kernel to activate.

Expecting FP4 acceleration on H100 or A100. vLLM can load NVFP4 weights on H100/A100 via a software fallback (Marlin FP4), which saves memory but provides no throughput improvement over FP8. Hardware FP4 tensor core operations only run on Blackwell.

Wrong block_size during quantization. The MXFP4 standard specifies 32 values per block. Using a different block size produces a checkpoint that is incompatible with Blackwell's FP4 tensor core expectations. The modelopt default (NVFP4_DEFAULT_CFG) sets this correctly.

Not validating on reasoning or math tasks. FP4 errors compound through multi-step reasoning chains. A model that passes general instruction-following benchmarks can still fail on complex math or code generation. Test on your actual task before going to production.

Building TensorRT-LLM engines on non-Blackwell hardware. The SM100 compilation target requires a physical Blackwell GPU. You cannot cross-compile FP4 TensorRT engines on an H100 or A100 instance.


Blackwell B200 GPUs with native MXFP4 support are available on Spheron. Start with a pre-quantized checkpoint from Hugging Face, deploy with vLLM, and validate quality on your task before committing.

Rent B200 → | Rent RTX 5090 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.