NVIDIA TensorRT Model Optimizer (ModelOpt): FP8, INT4, and FP4 Quantization Guide (2026)

Running a 70B model at FP16 when FP8 or INT4 would do the same job costs you 30-50% more per million tokens. The problem historically was that each precision format required a different tool: AutoAWQ for INT4, SmoothQuant for activation-heavy models, custom GPTQ forks for various architectures. NVIDIA ModelOpt consolidates all of that into one library. For background on the underlying precision formats, see the FP8 quantization guide and the AWQ quantization guide.

What ModelOpt Is and Why NVIDIA Built It

Before ModelOpt, the quantization tooling landscape was fragmented. You had AutoAWQ for INT4 on vLLM, SmoothQuant for INT8 activation quantization, GPTQ variants maintained by different community forks, and separate paths for FP8 via the Transformer Engine. Switching formats meant switching tools. Combining quantization with pruning or sparsity required gluing incompatible libraries together.

ModelOpt (full name: TensorRT Model Optimizer, package name: nvidia-modelopt) is NVIDIA's answer to this fragmentation. It exposes four core capabilities through a unified Python API:

Post-training quantization (PTQ): FP8, INT4 (AWQ and GPTQ variants), FP4 (MXFP4/NVFP4), and INT8. Runs in-place on a loaded model with a calibration dataloader.
Quantization-aware training (QAT): Fine-tunes a model with simulated quantization noise in the forward pass, recovering accuracy lost during PTQ. Critical for tasks sensitive to precision (complex math, multi-step reasoning). Google's June 2026 Gemma 4 QAT checkpoints are a production example of QAT at scale: the 31B model fits on an L40S at w4a16 with minimal quality loss. See the Gemma 4 QAT deployment guide for the vLLM deployment walkthrough.
Sparsity: Structured (2:4 sparsity) and unstructured weight pruning with accuracy recovery. NVIDIA Ampere and later hardware accelerates 2:4 sparse matmul at near-dense compute.
Pruning: Channel and layer pruning to reduce model architecture size before deployment.

The output of any ModelOpt workflow exports to TensorRT-LLM, vLLM, SGLang, or a standard HuggingFace checkpoint, depending on which serving stack you target.

ModelOpt vs Standalone Quantization Tools

Tool	Format supported	Hardware target	Export path	CPU-compatible	Recommended use case
ModelOpt	FP8, INT4, FP4, INT8, QAT	Hopper, Blackwell (GPU-native)	TRT-LLM, vLLM, SGLang, HF	No	Unified GPU inference, datacenter deployment
AutoAWQ	INT4 (AWQ)	All CUDA GPUs	vLLM, SGLang, AutoGPTQ-compatible	Partial (llama.cpp)	INT4 inference, pre-quantized HF checkpoints
GGUF/llama.cpp	INT4, INT8, Q2-Q8 mixed	CPU + NVIDIA + Apple Silicon	llama.cpp, Ollama	Yes	Local/edge/CPU inference
SmoothQuant	INT8	All CUDA GPUs	FasterTransformer, custom	No	Legacy INT8 deployments, older Ampere stacks
MXFP4 standalone	FP4 (MXFP4)	Blackwell, AMD MI355	TRT-LLM, vLLM	No	Blackwell FP4 when full ModelOpt isn't needed
GPTQ	INT4	All CUDA GPUs	vLLM, AutoGPTQ	No	INT4 when AWQ checkpoints don't exist

For the GGUF workflow and CPU inference, see the GGUF dynamic quantization guide. For MXFP4 specifically on Blackwell, see the MXFP4 microscaling guide.

When to pick ModelOpt:

You need FP8 for Hopper, FP4 for Blackwell, and INT4 as a fallback, and want one tool to handle all three.
You need QAT recovery after PTQ degrades accuracy below your threshold.
You're deploying to TensorRT-LLM and want the officially supported calibration path.
You want structured pruning or 2:4 sparsity alongside quantization.

When not to pick ModelOpt:

You need CPU inference. ModelOpt output does not run on CPU; use GGUF/llama.cpp.
Your model already has pre-calibrated AWQ weights on Hugging Face. Load them directly into vLLM with --quantization awq.

Calibration Dataset Selection and Accuracy Preservation

ModelOpt PTQ calibration works by running a forward pass through the model with representative samples and recording activation statistics. Those statistics are used to set per-tensor (or per-channel) quantization scales that minimize rounding error. The quality of the calibration dataset directly affects the accuracy of the quantized model.

Dataset size. 512 samples is the practical minimum. 1024 is the sweet spot for most production models. Beyond 1024, accuracy improvements on MMLU and GSM8K are typically below 0.1%. Using 2048+ samples adds calibration time with negligible accuracy return.

Domain matching. If you're deploying a coding assistant, calibrate on code. If your model serves customer support, use conversation transcripts. A mismatch between calibration data and production traffic skews the activation scales toward the wrong distribution and introduces avoidable accuracy loss.

Concrete dataset choices:

Llama 4 / Llama 3.x (instruction-tuned): A 1024-sample slice of ShareGPT (conversation format) or OpenHermes 2.5 (diverse instruction-following).
Qwen 3.5 (multilingual): Use a multilingual sample covering the top languages in your traffic. Calibrating only on English skews scales for non-English outputs.
DeepSeek V4 (coding and math heavy): A mix of math problems (MATH dataset) and code (The Stack or your own codebases). Heavy coding workloads with Wikipedia-calibrated models show 3-5% higher GSM8K regression than domain-matched calibration.

Note: Do not use your evaluation benchmarks (MMLU, GSM8K, HumanEval) as calibration data. Doing so leaks the test set into the quantization process and produces inflated accuracy numbers that do not reflect real-world performance. Use production data or a held-out calibration corpus.

Hardware-Aware Quantization: Picking the Right Precision

The precision you target should match the hardware you're deploying to. Not every format provides hardware acceleration on every GPU.

GPU	Architecture	Best precision	Expected throughput gain vs FP16
H100 SXM5	Hopper	FP8	1.8-2.1x
H200 SXM5	Hopper	FP8	1.9-2.2x
B200 SXM6	Blackwell	FP4 (NVFP4)	3.0-4.0x
A100 80G	Ampere	INT8	1.4-1.6x
L40S	Ada Lovelace	INT8 / FP8 (partial)	1.3-1.5x

Hopper FP8 acceleration is provided by dedicated FP8 Tensor Cores in the Transformer Engine. H200 has the same compute architecture as H100 but with 141 GB HBM3e vs 80 GB, giving it a memory bandwidth advantage on large models or long context. For the Transformer Engine internals and how per-tensor scaling works on Hopper, see the NVIDIA Transformer Engine guide.

Blackwell FP4 delivers the highest throughput gains because the B200 packs ~18,000 sparse FP4 TFLOPS versus ~9,000 sparse FP8 TFLOPS on the same chip. At 4-bit, you're also fitting 4x more weight values per HBM bandwidth unit compared to FP16. The combined effect makes FP4 on B200 substantially faster per dollar than FP8 on H100 for large-batch inference.

A100 and L40S hardware predates hardware-native FP8, so INT8 is the practical ceiling there. INT8 quantization with ModelOpt gives 1.4-1.6x throughput on A100 vs BF16 baseline, which is meaningful but far below what Hopper FP8 achieves.

Step-by-Step: Quantize a 70B Model with ModelOpt

This walkthrough covers Llama 3.1 70B Instruct on 8x H100 SXM5, targeting FP8 for TensorRT-LLM export. The process is identical for H200 instances; swap B200 as the target GPU and use NVFP4_DEFAULT_CFG for FP4. The format choice between NVFP4 and the OCP MXFP4 standard is covered in the NVFP4 vs MXFP4 guide.

Install ModelOpt and Dependencies

bash

# CUDA 12.1+ required; verify with: nvidia-smi
pip install "nvidia-modelopt[all]>=0.17"

# For TRT-LLM export, also install:
pip install tensorrt-llm  # or pull the NGC container

# Verify ModelOpt installation
python -c "import modelopt; print(modelopt.__version__)"

Pin to nvidia-modelopt>=0.17. The quantization config naming changed in v0.17; older versions use a different API surface.

Prepare the Calibration Dataloader

python

import json
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer

class CalibrationDataset(Dataset):
    def __init__(self, path, tokenizer, max_length=2048, num_samples=1024):
        with open(path) as f:
            raw = [json.loads(l) for l in f if l.strip()][:num_samples]
        self.encodings = [
            tokenizer(item["text"], truncation=True, max_length=max_length,
                      return_tensors="pt")
            for item in raw
        ]

    def __len__(self):
        return len(self.encodings)

    def __getitem__(self, idx):
        return {k: v.squeeze(0) for k, v in self.encodings[idx].items()}

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
dataset = CalibrationDataset("sharegpt_1024.jsonl", tokenizer)
calib_loader = DataLoader(dataset, batch_size=1, shuffle=False)

Use a batch size of 1 during calibration. Larger batch sizes can shift activation statistics and degrade quantization quality.

Run PTQ with ModelOpt

python

import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM
import torch

# Load on 8x H100 SXM5 (on Spheron H100 instances)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",  # distributes across 8 GPUs automatically
)

def forward_loop(model):
    for batch in calib_loader:
        batch = {k: v.cuda() for k, v in batch.items()}
        with torch.no_grad():
            model(**batch)

# FP8 quantization config - weights and activations
quant_cfg = mtq.FP8_DEFAULT_CFG

# For INT4 AWQ:
# quant_cfg = mtq.INT4_AWQ_CFG

# For FP4 on B200 (NVFP4 format; ModelOpt also supports MXFP4 via a separate config):
# quant_cfg = mtq.NVFP4_DEFAULT_CFG

model = mtq.quantize(model, quant_cfg, forward_loop=forward_loop)

Calibration runs in-place. For a 70B model on 8x H100 SXM5, expect 15-30 minutes. Memory usage per GPU is roughly the same as BF16 inference (the quantized weights are computed and stored in float during calibration, then packed at export).

Export to TensorRT-LLM, vLLM, or SGLang

TensorRT-LLM export: Save the checkpoint then run trtllm-build. Before compiling the TRT engine, run model optimization with ModelOpt to calibrate FP8 quantization and reduce engine size by 40-50%.

python

import modelopt.torch.export as mte

# Save quantized checkpoint
mte.export_tensorrt_llm_checkpoint(
    model,
    decoder_type="llama",
    dtype="bfloat16",
    export_dir="/engines/llama70b-fp8-checkpoint",
    inference_tensor_parallel=4,
    inference_pipeline_parallel=1,
)

Then build the engine (see the TensorRT-LLM production deployment guide for the full trtllm-build command):

bash

trtllm-build \
  --checkpoint_dir /engines/llama70b-fp8-checkpoint \
  --output_dir /engines/llama70b-engine \
  --gemm_plugin fp8 \
  --gpt_attention_plugin fp8 \
  --max_batch_size 128 \
  --max_input_len 8192 \
  --max_seq_len 10240 \
  --tp_size 4

vLLM export: Save as a HuggingFace-compatible checkpoint and launch with --quantization modelopt:

python

import modelopt.torch.export as mte

mte.export_hf_checkpoint(model, tokenizer, export_dir="/models/llama70b-fp8-modelopt")

bash

vllm serve /models/llama70b-fp8-modelopt \
  --quantization modelopt \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --tensor-parallel-size 4

SGLang export: The HuggingFace export works directly with SGLang. For the full SGLang setup, see the SGLang production deployment guide:

bash

python -m sglang.launch_server \
  --model /models/llama70b-fp8-modelopt \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --port 8000

For a full vLLM production deployment walkthrough, see the vLLM production deployment guide.

Accuracy Benchmarks Before and After Quantization

Estimated values: Benchmark figures below are based on NVIDIA ModelOpt documentation, community-reported results from Hugging Face and GitHub, and NVIDIA NGC model cards. Run your own evaluation before production decisions.

MMLU (5-shot accuracy)

Model	FP16	FP8 PTQ	INT4 AWQ	INT4 ModelOpt	FP4 ModelOpt
Llama-3.1-70B	82.0%	81.7%	80.2%	81.0%	79.8%
Qwen2.5-72B	83.6%	83.2%	81.9%	82.5%	81.1%
DeepSeek-R1-70B	85.1%	84.7%	83.3%	84.0%	82.9%

GSM8K (8-shot chain-of-thought)

Model	FP16	FP8 PTQ	INT4 AWQ	INT4 ModelOpt	FP4 ModelOpt
Llama-3.1-70B	87.5%	86.9%	84.1%	85.6%	83.0%
Qwen2.5-72B	91.2%	90.7%	88.3%	89.5%	87.4%
DeepSeek-R1-70B	92.8%	92.2%	90.4%	91.3%	89.7%

HumanEval (pass@1)

Model	FP16	FP8 PTQ	INT4 AWQ	INT4 ModelOpt	FP4 ModelOpt
Llama-3.1-70B	67.1%	66.4%	63.9%	65.2%	62.8%
Qwen2.5-72B	73.2%	72.6%	70.1%	71.4%	69.0%
DeepSeek-R1-70B	79.5%	78.9%	76.8%	77.9%	75.6%

FP8 PTQ shows the smallest accuracy degradation across all three benchmarks. INT4 ModelOpt consistently beats INT4 AWQ by 0.8-1.5% on MMLU, reflecting that ModelOpt's calibration pipeline uses more activation-aware scale estimation than AutoAWQ's default. FP4 ModelOpt has the largest degradation but stays within acceptable bounds (under 2% MMLU drop, under 4% GSM8K drop) when calibrated with domain-matched data.

Throughput and Latency on Spheron H100, H200, and B200

Throughput numbers below are for Llama 3.1 70B at batch size 32, sequence length 2048. Single-node, tensor parallel across all available GPUs per node.

Teams starting with 70B fine-tuning jobs typically start with H100 SXM5 on Spheron before scaling to multi-node. For higher memory bandwidth to serve longer context, H200 instances on Spheron provide 141 GB HBM3e with similar compute to H100. For maximum throughput per dollar on production batch inference, B200 bare-metal access provides Blackwell's FP4 acceleration at scale.

Estimated values: Per-GPU throughput figures are derived from industry benchmarks and MLPerf data. Real throughput depends on batch size, sequence length, serving framework, and memory pressure. Benchmark your specific model before production sizing.

All prices are on-demand rates.

GPU	Format	Throughput (tok/s/GPU)	TTFT (ms)	On-demand $/hr	Relative cost/token vs FP16
H100 SXM5	FP16	~750	~85	$3.12	1.0x (baseline)
H100 SXM5	FP8 (ModelOpt)	~1,400	~48	$3.12	0.54x
H200 SXM5	FP8 (ModelOpt)	~1,600	~44	$4.62	0.69x
H100 SXM5	INT4 AWQ	~1,100	~61	$3.12	0.68x
B200 SXM6	FP4 (ModelOpt)	~4,000	~28	$3.70	0.22x

Pricing fluctuates based on GPU availability. The prices above are based on 28 May 2026 and may have changed. Check current GPU pricing for live rates.

Combining ModelOpt with Speculative Decoding and Continuous Batching

ModelOpt and speculative decoding are complementary: quantization reduces memory footprint and compute per token, while speculative decoding reduces the number of target model forward passes needed per output. Running both gives you the cost benefits of quantization plus the latency benefits of speculative decoding.

The practical integration point is the draft model. For lower draft model latency, quantize the draft model to FP8 or INT4 using the ModelOpt PTQ pipeline described above before loading it into the speculative decoding engine. A 1B draft model quantized to FP8 adds roughly 1 GB VRAM overhead instead of 2 GB at BF16. On an H100 80 GB running a 70B target at FP8, that headroom matters.

For the full speculative decoding setup with vLLM, including EAGLE-3 and P-EAGLE draft configurations, see the speculative decoding production guide.

Continuous batching pairs naturally with ModelOpt-quantized models because quantized models process each token faster, which means the batching system clears requests more quickly and refills batch slots at a higher rate. For the batching mechanics, see the continuous batching and paged attention guide.

Production Rollout Checklist

Shadow traffic test. Route 5-10% of production traffic to the quantized model endpoint alongside your FP16 baseline. Compare output quality on sampled requests before full cutover.
Accuracy guardrails. Set up automated evaluation using lm-evaluation-harness against your task-specific benchmark. Define a minimum acceptable score (e.g., MMLU delta under 1.5%) before promoting the quantized model to full traffic.
Latency regression threshold. Define a p95 TTFT budget. FP8 should improve TTFT; if p95 TTFT increases after quantization, check your tensor parallelism config and memory utilization settings.
Fallback toggle. Keep the FP16 model running as a hot standby or traffic split. If your guardrail fires, flip the traffic router back within seconds without a redeployment.
Monitoring plan. Track token throughput, TTFT p50/p95, GPU memory utilization, and your accuracy metric on a rolling window of sampled requests. Accuracy drift over time (as traffic distribution shifts away from calibration data) is a real phenomenon, especially for models serving diverse use cases. For patterns on sampling, job-level accuracy tracking, and alerting across large inference pipelines, see the batch LLM inference on GPU cloud guide.

Cost Per Million Tokens: FP16 Baseline vs ModelOpt-Quantized Variants

Cost formula: cost_per_M_tokens = (price_per_hr / (throughput_tok_s * 3600)) * 1,000,000

All prices are on-demand rates.

Precision	GPU	tok/s/GPU	On-demand $/hr	$/M tokens
FP16	H100 SXM5	750	$3.12	$1.16
FP8 (ModelOpt)	H100 SXM5	1,400	$3.12	$0.62
FP8 (ModelOpt)	H200 SXM5	1,600	$4.62	$0.80
INT4 AWQ	H100 SXM5	1,100	$3.12	$0.79
FP4 (ModelOpt)	B200 SXM6	4,000	$3.70	$0.26

FP8 on H100 is 47% cheaper per million tokens than the FP16 baseline, with negligible accuracy loss. INT4 sits in between. FP4 on B200 is the cheapest per token but requires Blackwell hardware and accepts larger accuracy tradeoffs. For non-Blackwell hardware where FP4 is not an option, FP8 via ModelOpt is the cost-optimal path.

For current rates on all GPUs, check GPU pricing.

Pricing fluctuates based on GPU availability. The prices above are based on 28 May 2026 and may have changed. Check current GPU pricing for live rates.

ModelOpt calibration jobs run in under 30 minutes on multi-GPU nodes. Spheron provides bare-metal H100, H200, and B200 access with no runtime overhead hiding your quantization gains.
Rent H100 SXM5 | H200 capacity | B200 on Spheron | View all pricing

STEPS / 05

Quick Setup Guide

Install ModelOpt and calibration dependencies
Install nvidia-modelopt with extras for quantization: pip install nvidia-modelopt[all]. Also install the target serving framework (tensorrt-llm, vllm, or sglang) and ensure CUDA 12.1+ is available.
Prepare a calibration dataset
Assemble 512-1024 representative text samples matching your production prompt distribution. For general-purpose 70B models, a random 1024-sample slice of ShareGPT works. Save as a JSONL with a 'text' field or use the built-in modelopt.torch.quantization datasets.
Run PTQ quantization with ModelOpt
Load the model with from_pretrained, then call modelopt.torch.quantization.quantize() with the target quant_cfg (e.g., INT4_AWQ_CFG for INT4, FP8_DEFAULT_CFG for FP8). Pass your calibration dataloader as the forward_loop argument. Quantization runs in-place and takes 10-30 minutes for a 70B model on 8x H100s.
Export to TensorRT-LLM, vLLM, or SGLang
For TensorRT-LLM: run trtllm-build with the quantized checkpoint to compile the TRT engine. For vLLM: save the quantized weights with modelopt.torch.export.export_hf_checkpoint() and launch vLLM with --quantization modelopt. For SGLang: use the HuggingFace-compatible export and start the SGLang server pointing at the output directory.
Validate accuracy before production rollout
Run lm-evaluation-harness benchmarks (MMLU, GSM8K, HumanEval) on the quantized model and compare against your FP16 baseline. Accept the deployment if MMLU degradation is under 1.5% and GSM8K degradation is under 2%. For higher regression, switch to QAT or lower the quantization ratio on sensitive layers.

FAQ / 05

Frequently Asked Questions

ModelOpt is NVIDIA's unified quantization and compression library that combines post-training quantization (PTQ), quantization-aware training (QAT), sparsity, and pruning under a single Python API. It supports FP8, INT4, MXFP4, and INT8 formats and exports directly to TensorRT-LLM, vLLM, and SGLang engines.

AutoAWQ targets INT4 weight-only quantization for CPU/GPU compatibility; GGUF uses llama.cpp format optimized for CPU inference. ModelOpt targets GPU-native formats (FP8, FP4) with hardware-specific kernels for Hopper and Blackwell GPUs, achieving higher throughput than generic INT4 methods on datacenter hardware.

ModelOpt PTQ typically requires 512-1024 representative text samples. For instruction-tuned models like Llama 4 and Qwen 3.5, use domain-matched data (e.g., ShareGPT or a subset of your production prompts). Larger datasets beyond 1024 samples rarely improve accuracy beyond 0.1-0.2% on MMLU.

Target FP8 for H100 and H200 (Hopper architecture), which have native FP8 Tensor Core support via NVIDIA Transformer Engine. Target FP4 (NVFP4, via NVFP4_DEFAULT_CFG) for B200 and GB200 (Blackwell), which added FP4 hardware support. INT4 weight-only works on any GPU but is slower on Hopper compared to FP8.

Yes. ModelOpt exports quantized checkpoints in a format compatible with vLLM (via --quantization fp8 or modelopt) and SGLang. For TensorRT-LLM, ModelOpt produces the optimized TensorRT engine directly. Export paths differ per framework; the post covers commands for all three.

What ModelOpt Is and Why NVIDIA Built It

ModelOpt vs Standalone Quantization Tools

Calibration Dataset Selection and Accuracy Preservation

Hardware-Aware Quantization: Picking the Right Precision

Step-by-Step: Quantize a 70B Model with ModelOpt

Install ModelOpt and Dependencies

Prepare the Calibration Dataloader

Run PTQ with ModelOpt

Export to TensorRT-LLM, vLLM, or SGLang

Accuracy Benchmarks Before and After Quantization

MMLU (5-shot accuracy)

GSM8K (8-shot chain-of-thought)

HumanEval (pass@1)

Throughput and Latency on Spheron H100, H200, and B200

Combining ModelOpt with Speculative Decoding and Continuous Batching

Production Rollout Checklist

Cost Per Million Tokens: FP16 Baseline vs ModelOpt-Quantized Variants

Quick Setup Guide

Install ModelOpt and calibration dependencies

Prepare a calibration dataset

Run PTQ quantization with ModelOpt

Export to TensorRT-LLM, vLLM, or SGLang

Validate accuracy before production rollout

Frequently Asked Questions

01What is NVIDIA TensorRT Model Optimizer (ModelOpt)?

02How does ModelOpt differ from AutoAWQ and GGUF quantization?

03What calibration dataset size does ModelOpt need?

04Which GPU should I target for ModelOpt FP8 vs FP4?

05Can ModelOpt quantized models run on vLLM and SGLang?

Build what's next.