Tutorial

NVIDIA TensorRT Model Optimizer (ModelOpt): FP8, INT4, and FP4 Quantization Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 28, 2026
TensorRT Model OptimizerModelOptNVIDIA ModelOptFP8 QuantizationINT4 QuantizationFP4 QuantizationFP8 FP4 QuantizationPost Training QuantizationGPU Cloud
NVIDIA TensorRT Model Optimizer (ModelOpt): FP8, INT4, and FP4 Quantization Guide (2026)

Running a 70B model at FP16 when FP8 or INT4 would do the same job costs you 30-50% more per million tokens. The problem historically was that each precision format required a different tool: AutoAWQ for INT4, SmoothQuant for activation-heavy models, custom GPTQ forks for various architectures. NVIDIA ModelOpt consolidates all of that into one library. For background on the underlying precision formats, see the FP8 quantization guide and the AWQ quantization guide.

What ModelOpt Is and Why NVIDIA Built It

Before ModelOpt, the quantization tooling landscape was fragmented. You had AutoAWQ for INT4 on vLLM, SmoothQuant for INT8 activation quantization, GPTQ variants maintained by different community forks, and separate paths for FP8 via the Transformer Engine. Switching formats meant switching tools. Combining quantization with pruning or sparsity required gluing incompatible libraries together.

ModelOpt (full name: TensorRT Model Optimizer, package name: nvidia-modelopt) is NVIDIA's answer to this fragmentation. It exposes four core capabilities through a unified Python API:

  • Post-training quantization (PTQ): FP8, INT4 (AWQ and GPTQ variants), FP4 (MXFP4/NVFP4), and INT8. Runs in-place on a loaded model with a calibration dataloader.
  • Quantization-aware training (QAT): Fine-tunes a model with simulated quantization noise in the forward pass, recovering accuracy lost during PTQ. Critical for tasks sensitive to precision (complex math, multi-step reasoning).
  • Sparsity: Structured (2:4 sparsity) and unstructured weight pruning with accuracy recovery. NVIDIA Ampere and later hardware accelerates 2:4 sparse matmul at near-dense compute.
  • Pruning: Channel and layer pruning to reduce model architecture size before deployment.

The output of any ModelOpt workflow exports to TensorRT-LLM, vLLM, SGLang, or a standard HuggingFace checkpoint, depending on which serving stack you target.

ModelOpt vs Standalone Quantization Tools

ToolFormat supportedHardware targetExport pathCPU-compatibleRecommended use case
ModelOptFP8, INT4, FP4, INT8, QATHopper, Blackwell (GPU-native)TRT-LLM, vLLM, SGLang, HFNoUnified GPU inference, datacenter deployment
AutoAWQINT4 (AWQ)All CUDA GPUsvLLM, SGLang, AutoGPTQ-compatiblePartial (llama.cpp)INT4 inference, pre-quantized HF checkpoints
GGUF/llama.cppINT4, INT8, Q2-Q8 mixedCPU + NVIDIA + Apple Siliconllama.cpp, OllamaYesLocal/edge/CPU inference
SmoothQuantINT8All CUDA GPUsFasterTransformer, customNoLegacy INT8 deployments, older Ampere stacks
MXFP4 standaloneFP4 (MXFP4)Blackwell, AMD MI355TRT-LLM, vLLMNoBlackwell FP4 when full ModelOpt isn't needed
GPTQINT4All CUDA GPUsvLLM, AutoGPTQNoINT4 when AWQ checkpoints don't exist

For the GGUF workflow and CPU inference, see the GGUF dynamic quantization guide. For MXFP4 specifically on Blackwell, see the MXFP4 microscaling guide.

When to pick ModelOpt:

  • You need FP8 for Hopper, FP4 for Blackwell, and INT4 as a fallback, and want one tool to handle all three.
  • You need QAT recovery after PTQ degrades accuracy below your threshold.
  • You're deploying to TensorRT-LLM and want the officially supported calibration path.
  • You want structured pruning or 2:4 sparsity alongside quantization.

When not to pick ModelOpt:

  • You need CPU inference. ModelOpt output does not run on CPU; use GGUF/llama.cpp.
  • Your model already has pre-calibrated AWQ weights on Hugging Face. Load them directly into vLLM with --quantization awq.

Calibration Dataset Selection and Accuracy Preservation

ModelOpt PTQ calibration works by running a forward pass through the model with representative samples and recording activation statistics. Those statistics are used to set per-tensor (or per-channel) quantization scales that minimize rounding error. The quality of the calibration dataset directly affects the accuracy of the quantized model.

Dataset size. 512 samples is the practical minimum. 1024 is the sweet spot for most production models. Beyond 1024, accuracy improvements on MMLU and GSM8K are typically below 0.1%. Using 2048+ samples adds calibration time with negligible accuracy return.

Domain matching. If you're deploying a coding assistant, calibrate on code. If your model serves customer support, use conversation transcripts. A mismatch between calibration data and production traffic skews the activation scales toward the wrong distribution and introduces avoidable accuracy loss.

Concrete dataset choices:

  • Llama 4 / Llama 3.x (instruction-tuned): A 1024-sample slice of ShareGPT (conversation format) or OpenHermes 2.5 (diverse instruction-following).
  • Qwen 3.5 (multilingual): Use a multilingual sample covering the top languages in your traffic. Calibrating only on English skews scales for non-English outputs.
  • DeepSeek V4 (coding and math heavy): A mix of math problems (MATH dataset) and code (The Stack or your own codebases). Heavy coding workloads with Wikipedia-calibrated models show 3-5% higher GSM8K regression than domain-matched calibration.

Note: Do not use your evaluation benchmarks (MMLU, GSM8K, HumanEval) as calibration data. Doing so leaks the test set into the quantization process and produces inflated accuracy numbers that do not reflect real-world performance. Use production data or a held-out calibration corpus.

Hardware-Aware Quantization: Picking the Right Precision

The precision you target should match the hardware you're deploying to. Not every format provides hardware acceleration on every GPU.

GPUArchitectureBest precisionExpected throughput gain vs FP16
H100 SXM5HopperFP81.8-2.1x
H200 SXM5HopperFP81.9-2.2x
B200 SXM6BlackwellFP4 (NVFP4)3.0-4.0x
A100 80GAmpereINT81.4-1.6x
L40SAda LovelaceINT8 / FP8 (partial)1.3-1.5x

Hopper FP8 acceleration is provided by dedicated FP8 Tensor Cores in the Transformer Engine. H200 has the same compute architecture as H100 but with 141 GB HBM3e vs 80 GB, giving it a memory bandwidth advantage on large models or long context. For the Transformer Engine internals and how per-tensor scaling works on Hopper, see the NVIDIA Transformer Engine guide.

Blackwell FP4 delivers the highest throughput gains because the B200 packs ~18,000 sparse FP4 TFLOPS versus ~9,000 sparse FP8 TFLOPS on the same chip. At 4-bit, you're also fitting 4x more weight values per HBM bandwidth unit compared to FP16. The combined effect makes FP4 on B200 substantially faster per dollar than FP8 on H100 for large-batch inference.

A100 and L40S hardware predates hardware-native FP8, so INT8 is the practical ceiling there. INT8 quantization with ModelOpt gives 1.4-1.6x throughput on A100 vs BF16 baseline, which is meaningful but far below what Hopper FP8 achieves.

Step-by-Step: Quantize a 70B Model with ModelOpt

This walkthrough covers Llama 3.1 70B Instruct on 8x H100 SXM5, targeting FP8 for TensorRT-LLM export. The process is identical for H200 instances; swap B200 as the target GPU and use NVFP4_DEFAULT_CFG for FP4.

Install ModelOpt and Dependencies

bash
# CUDA 12.1+ required; verify with: nvidia-smi
pip install "nvidia-modelopt[all]>=0.17"

# For TRT-LLM export, also install:
pip install tensorrt-llm  # or pull the NGC container

# Verify ModelOpt installation
python -c "import modelopt; print(modelopt.__version__)"

Pin to nvidia-modelopt>=0.17. The quantization config naming changed in v0.17; older versions use a different API surface.

Prepare the Calibration Dataloader

python
import json
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer

class CalibrationDataset(Dataset):
    def __init__(self, path, tokenizer, max_length=2048, num_samples=1024):
        with open(path) as f:
            raw = [json.loads(l) for l in f if l.strip()][:num_samples]
        self.encodings = [
            tokenizer(item["text"], truncation=True, max_length=max_length,
                      return_tensors="pt")
            for item in raw
        ]

    def __len__(self):
        return len(self.encodings)

    def __getitem__(self, idx):
        return {k: v.squeeze(0) for k, v in self.encodings[idx].items()}

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
dataset = CalibrationDataset("sharegpt_1024.jsonl", tokenizer)
calib_loader = DataLoader(dataset, batch_size=1, shuffle=False)

Use a batch size of 1 during calibration. Larger batch sizes can shift activation statistics and degrade quantization quality.

Run PTQ with ModelOpt

python
import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM
import torch

# Load on 8x H100 SXM5 (on Spheron H100 instances)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",  # distributes across 8 GPUs automatically
)

def forward_loop(model):
    for batch in calib_loader:
        batch = {k: v.cuda() for k, v in batch.items()}
        with torch.no_grad():
            model(**batch)

# FP8 quantization config - weights and activations
quant_cfg = mtq.FP8_DEFAULT_CFG

# For INT4 AWQ:
# quant_cfg = mtq.INT4_AWQ_CFG

# For FP4 on B200 (NVFP4 format; ModelOpt also supports MXFP4 via a separate config):
# quant_cfg = mtq.NVFP4_DEFAULT_CFG

model = mtq.quantize(model, quant_cfg, forward_loop=forward_loop)

Calibration runs in-place. For a 70B model on 8x H100 SXM5, expect 15-30 minutes. Memory usage per GPU is roughly the same as BF16 inference (the quantized weights are computed and stored in float during calibration, then packed at export).

Export to TensorRT-LLM, vLLM, or SGLang

TensorRT-LLM export: Save the checkpoint then run trtllm-build. Before compiling the TRT engine, run model optimization with ModelOpt to calibrate FP8 quantization and reduce engine size by 40-50%.

python
import modelopt.torch.export as mte

# Save quantized checkpoint
mte.export_tensorrt_llm_checkpoint(
    model,
    decoder_type="llama",
    dtype="bfloat16",
    export_dir="/engines/llama70b-fp8-checkpoint",
    inference_tensor_parallel=4,
    inference_pipeline_parallel=1,
)

Then build the engine (see the TensorRT-LLM production deployment guide for the full trtllm-build command):

bash
trtllm-build \
  --checkpoint_dir /engines/llama70b-fp8-checkpoint \
  --output_dir /engines/llama70b-engine \
  --gemm_plugin fp8 \
  --gpt_attention_plugin fp8 \
  --max_batch_size 128 \
  --max_input_len 8192 \
  --max_seq_len 10240 \
  --tp_size 4

vLLM export: Save as a HuggingFace-compatible checkpoint and launch with --quantization modelopt:

python
import modelopt.torch.export as mte

mte.export_hf_checkpoint(model, tokenizer, export_dir="/models/llama70b-fp8-modelopt")
bash
vllm serve /models/llama70b-fp8-modelopt \
  --quantization modelopt \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --tensor-parallel-size 4

SGLang export: The HuggingFace export works directly with SGLang. For the full SGLang setup, see the SGLang production deployment guide:

bash
python -m sglang.launch_server \
  --model /models/llama70b-fp8-modelopt \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --port 8000

For a full vLLM production deployment walkthrough, see the vLLM production deployment guide.

Accuracy Benchmarks Before and After Quantization

Estimated values: Benchmark figures below are based on NVIDIA ModelOpt documentation, community-reported results from Hugging Face and GitHub, and NVIDIA NGC model cards. Run your own evaluation before production decisions.

MMLU (5-shot accuracy)

ModelFP16FP8 PTQINT4 AWQINT4 ModelOptFP4 ModelOpt
Llama-3.1-70B82.0%81.7%80.2%81.0%79.8%
Qwen2.5-72B83.6%83.2%81.9%82.5%81.1%
DeepSeek-R1-70B85.1%84.7%83.3%84.0%82.9%

GSM8K (8-shot chain-of-thought)

ModelFP16FP8 PTQINT4 AWQINT4 ModelOptFP4 ModelOpt
Llama-3.1-70B87.5%86.9%84.1%85.6%83.0%
Qwen2.5-72B91.2%90.7%88.3%89.5%87.4%
DeepSeek-R1-70B92.8%92.2%90.4%91.3%89.7%

HumanEval (pass@1)

ModelFP16FP8 PTQINT4 AWQINT4 ModelOptFP4 ModelOpt
Llama-3.1-70B67.1%66.4%63.9%65.2%62.8%
Qwen2.5-72B73.2%72.6%70.1%71.4%69.0%
DeepSeek-R1-70B79.5%78.9%76.8%77.9%75.6%

FP8 PTQ shows the smallest accuracy degradation across all three benchmarks. INT4 ModelOpt consistently beats INT4 AWQ by 0.8-1.5% on MMLU, reflecting that ModelOpt's calibration pipeline uses more activation-aware scale estimation than AutoAWQ's default. FP4 ModelOpt has the largest degradation but stays within acceptable bounds (under 2% MMLU drop, under 4% GSM8K drop) when calibrated with domain-matched data.

Throughput and Latency on Spheron H100, H200, and B200

Throughput numbers below are for Llama 3.1 70B at batch size 32, sequence length 2048. Single-node, tensor parallel across all available GPUs per node.

Teams starting with 70B fine-tuning jobs typically start with H100 SXM5 on Spheron before scaling to multi-node. For higher memory bandwidth to serve longer context, H200 instances on Spheron provide 141 GB HBM3e with similar compute to H100. For maximum throughput per dollar on production batch inference, B200 bare-metal access provides Blackwell's FP4 acceleration at scale.

Estimated values: Per-GPU throughput figures are derived from industry benchmarks and MLPerf data. Real throughput depends on batch size, sequence length, serving framework, and memory pressure. Benchmark your specific model before production sizing.

All prices are on-demand rates.

GPUFormatThroughput (tok/s/GPU)TTFT (ms)On-demand $/hrRelative cost/token vs FP16
H100 SXM5FP16~750~85$3.121.0x (baseline)
H100 SXM5FP8 (ModelOpt)~1,400~48$3.120.54x
H200 SXM5FP8 (ModelOpt)~1,600~44$4.620.69x
H100 SXM5INT4 AWQ~1,100~61$3.120.68x
B200 SXM6FP4 (ModelOpt)~4,000~28$3.700.22x

Pricing fluctuates based on GPU availability. The prices above are based on 28 May 2026 and may have changed. Check current GPU pricing for live rates.

Combining ModelOpt with Speculative Decoding and Continuous Batching

ModelOpt and speculative decoding are complementary: quantization reduces memory footprint and compute per token, while speculative decoding reduces the number of target model forward passes needed per output. Running both gives you the cost benefits of quantization plus the latency benefits of speculative decoding.

The practical integration point is the draft model. For lower draft model latency, quantize the draft model to FP8 or INT4 using the ModelOpt PTQ pipeline described above before loading it into the speculative decoding engine. A 1B draft model quantized to FP8 adds roughly 1 GB VRAM overhead instead of 2 GB at BF16. On an H100 80 GB running a 70B target at FP8, that headroom matters.

For the full speculative decoding setup with vLLM, including EAGLE-3 and P-EAGLE draft configurations, see the speculative decoding production guide.

Continuous batching pairs naturally with ModelOpt-quantized models because quantized models process each token faster, which means the batching system clears requests more quickly and refills batch slots at a higher rate. For the batching mechanics, see the continuous batching and paged attention guide.

Production Rollout Checklist

  1. Shadow traffic test. Route 5-10% of production traffic to the quantized model endpoint alongside your FP16 baseline. Compare output quality on sampled requests before full cutover.
  2. Accuracy guardrails. Set up automated evaluation using lm-evaluation-harness against your task-specific benchmark. Define a minimum acceptable score (e.g., MMLU delta under 1.5%) before promoting the quantized model to full traffic.
  3. Latency regression threshold. Define a p95 TTFT budget. FP8 should improve TTFT; if p95 TTFT increases after quantization, check your tensor parallelism config and memory utilization settings.
  4. Fallback toggle. Keep the FP16 model running as a hot standby or traffic split. If your guardrail fires, flip the traffic router back within seconds without a redeployment.
  5. Monitoring plan. Track token throughput, TTFT p50/p95, GPU memory utilization, and your accuracy metric on a rolling window of sampled requests. Accuracy drift over time (as traffic distribution shifts away from calibration data) is a real phenomenon, especially for models serving diverse use cases. For patterns on sampling, job-level accuracy tracking, and alerting across large inference pipelines, see the batch LLM inference on GPU cloud guide.

Cost Per Million Tokens: FP16 Baseline vs ModelOpt-Quantized Variants

Cost formula: cost_per_M_tokens = (price_per_hr / (throughput_tok_s * 3600)) * 1,000,000

All prices are on-demand rates.

PrecisionGPUtok/s/GPUOn-demand $/hr$/M tokens
FP16H100 SXM5750$3.12$1.16
FP8 (ModelOpt)H100 SXM51,400$3.12$0.62
FP8 (ModelOpt)H200 SXM51,600$4.62$0.80
INT4 AWQH100 SXM51,100$3.12$0.79
FP4 (ModelOpt)B200 SXM64,000$3.70$0.26

FP8 on H100 is 47% cheaper per million tokens than the FP16 baseline, with negligible accuracy loss. INT4 sits in between. FP4 on B200 is the cheapest per token but requires Blackwell hardware and accepts larger accuracy tradeoffs. For non-Blackwell hardware where FP4 is not an option, FP8 via ModelOpt is the cost-optimal path.

For current rates on all GPUs, check GPU pricing.

Pricing fluctuates based on GPU availability. The prices above are based on 28 May 2026 and may have changed. Check current GPU pricing for live rates.


ModelOpt calibration jobs run in under 30 minutes on multi-GPU nodes. Spheron provides bare-metal H100, H200, and B200 access with no runtime overhead hiding your quantization gains.

Rent H100 SXM5 | H200 capacity | B200 on Spheron | View all pricing

STEPS / 05

Quick Setup Guide

  1. Install ModelOpt and calibration dependencies

    Install nvidia-modelopt with extras for quantization: pip install nvidia-modelopt[all]. Also install the target serving framework (tensorrt-llm, vllm, or sglang) and ensure CUDA 12.1+ is available.

  2. Prepare a calibration dataset

    Assemble 512-1024 representative text samples matching your production prompt distribution. For general-purpose 70B models, a random 1024-sample slice of ShareGPT works. Save as a JSONL with a 'text' field or use the built-in modelopt.torch.quantization datasets.

  3. Run PTQ quantization with ModelOpt

    Load the model with from_pretrained, then call modelopt.torch.quantization.quantize() with the target quant_cfg (e.g., INT4_AWQ_CFG for INT4, FP8_DEFAULT_CFG for FP8). Pass your calibration dataloader as the forward_loop argument. Quantization runs in-place and takes 10-30 minutes for a 70B model on 8x H100s.

  4. Export to TensorRT-LLM, vLLM, or SGLang

    For TensorRT-LLM: run trtllm-build with the quantized checkpoint to compile the TRT engine. For vLLM: save the quantized weights with modelopt.torch.export.export_hf_checkpoint() and launch vLLM with --quantization modelopt. For SGLang: use the HuggingFace-compatible export and start the SGLang server pointing at the output directory.

  5. Validate accuracy before production rollout

    Run lm-evaluation-harness benchmarks (MMLU, GSM8K, HumanEval) on the quantized model and compare against your FP16 baseline. Accept the deployment if MMLU degradation is under 1.5% and GSM8K degradation is under 2%. For higher regression, switch to QAT or lower the quantization ratio on sensitive layers.

FAQ / 05

Frequently Asked Questions

ModelOpt is NVIDIA's unified quantization and compression library that combines post-training quantization (PTQ), quantization-aware training (QAT), sparsity, and pruning under a single Python API. It supports FP8, INT4, MXFP4, and INT8 formats and exports directly to TensorRT-LLM, vLLM, and SGLang engines.

AutoAWQ targets INT4 weight-only quantization for CPU/GPU compatibility; GGUF uses llama.cpp format optimized for CPU inference. ModelOpt targets GPU-native formats (FP8, FP4) with hardware-specific kernels for Hopper and Blackwell GPUs, achieving higher throughput than generic INT4 methods on datacenter hardware.

ModelOpt PTQ typically requires 512-1024 representative text samples. For instruction-tuned models like Llama 4 and Qwen 3.5, use domain-matched data (e.g., ShareGPT or a subset of your production prompts). Larger datasets beyond 1024 samples rarely improve accuracy beyond 0.1-0.2% on MMLU.

Target FP8 for H100 and H200 (Hopper architecture), which have native FP8 Tensor Core support via NVIDIA Transformer Engine. Target FP4 (NVFP4, via NVFP4_DEFAULT_CFG) for B200 and GB200 (Blackwell), which added FP4 hardware support. INT4 weight-only works on any GPU but is slower on Hopper compared to FP8.

Yes. ModelOpt exports quantized checkpoints in a format compatible with vLLM (via --quantization fp8 or modelopt) and SGLang. For TensorRT-LLM, ModelOpt produces the optimized TensorRT engine directly. Export paths differ per framework; the post covers commands for all three.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.