PyTorch-native serving is fast to set up but leaves 20-40% throughput on the table. TensorRT-LLM closes that gap through compiled execution plans, fused CUDA kernels, and an in-flight batching executor purpose-built for autoregressive decoding.
This guide covers the full production path: engine build pipeline, FP8/INT4/FP4 quantization, tensor and pipeline parallelism for 70B+ models, Triton backend wiring, and cost benchmarks on H200 and B200.
TensorRT-LLM vs PyTorch-Native Engines: When TRT-LLM Wins
TensorRT-LLM compiles your model weights and architecture into a CUDA kernel graph optimized for a specific GPU SKU, batch size range, and sequence length. At inference time, the runtime executes this precompiled graph directly - no Python overhead, no dynamic dispatch, no op-level JIT. The upside is hardware utilization that matches or beats hand-tuned C++ code. The downside is that every change to the model, batch size ceiling, or sequence length forces a rebuild. If your model changes weekly, the 25-45 minute build cycle is a real operational cost.
TRT-LLM is the right call when your model is stable, you are at or near capacity, and you need to extract every token per second from the hardware. vLLM is the better starting point when you need model flexibility or fast iteration.
Hopper (H100/H200) vs Blackwell (B200/GB200) Acceleration
On Hopper, TRT-LLM exploits the Transformer Engine's FP8 tensor cores natively. The FP8 attention and MLP kernels are fused, and the runtime handles per-tensor scaling factors automatically. Hopper FP8 typically gives a 1.3-1.5x throughput improvement over FP16 on the same hardware.
Blackwell changes the economics again. The B200 SXM6 adds FP4 tensor core support and doubles the memory bandwidth over H200 (8 TB/s vs 4.8 TB/s). TRT-LLM 0.19.0+ includes FP4 kernels for Blackwell, though FP4 requires TensorRT 10.9+ and CUDA 12.8 or later. At FP4, a 70B model fits in roughly 35GB, leaving the rest of the B200's 192GB for KV cache - significantly higher sustainable batch sizes compared to FP8 on H200.
Engine Build Pipeline
Prerequisites and Docker Setup
Use the official NGC container. Trying to install TRT-LLM via pip and managing the CUDA/cuDNN/TensorRT dependency chain by hand will cost you a day. The NGC containers ship with the right versions already pinned.
# Verify your driver meets the minimum for CUDA 13.1
nvidia-smi | grep "Driver Version"
# Required: 590.44.01+ on Linux
docker pull nvcr.io/nvidia/tensorrt-llm/release:1.2.0The container tag here (1.2.0) is based on CUDA 13.1.0 and requires driver 590.44.01 or later. If you are on an older driver, pull the 1.0.0 container (CUDA 12.6) instead. CUDA version mismatches produce cryptic build errors, so verify before starting.
Checkpoint Conversion
TRT-LLM does not read HuggingFace weights directly at engine build time. You first convert to TRT-LLM checkpoint format using convert_checkpoint.py from the examples directory. This step is architecture-specific.
docker run --gpus all --ipc=host \
-v /path/to/hf-model:/models \
-v /path/to/output:/output \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0 \
python examples/llama/convert_checkpoint.py \
--model_dir /models/Llama-3.1-70B-Instruct \
--output_dir /output/llama70b-checkpoint \
--dtype float16 \
--tp_size 4 \
--pp_size 1Supported architectures include Llama, Mistral, Mixtral (MoE), Falcon, Gemma, Phi, and Qwen. Check the examples/ directory in the NGC container for the full list. The --tp_size and --pp_size flags set the parallelism strategy and must match what you pass to trtllm-build later.
Quantization Calibration
FP8 post-training quantization (PTQ) is the most common production path. The calibration step runs a small dataset through the model to compute the activation scales needed for accurate FP8 inference.
docker run --gpus all --ipc=host \
-v /models:/models \
-v /engines:/engines \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0 \
python examples/quantization/quantize.py \
--model_dir /models/Llama-3.1-70B-Instruct \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir /engines/fp8-checkpoint \
--calib_size 512 \
--tp_size 4512 calibration samples is a reasonable default. Going above 1024 gives diminishing accuracy returns for most models. The calibration run itself takes 5-10 minutes on a single H200. Use data that matches your real traffic distribution - calibrating on Wikipedia text when you are serving code will skew the activation scales.
trtllm-build Configuration
docker run --gpus all --ipc=host \
-v /engines:/engines \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0 \
trtllm-build \
--checkpoint_dir /engines/fp8-checkpoint \
--output_dir /engines/llama70b-engine \
--gemm_plugin fp8 \
--gpt_attention_plugin fp8 \
--max_batch_size 128 \
--max_input_len 8192 \
--max_seq_len 10240 \
--workers 4 \
--tp_size 4Key flags:
--max_batch_size 128: increase if you expect high concurrent request volume. Higher values also increase VRAM usage during build; if you hit OOM during build, drop to 64 or 32.--max_input_lenand--max_seq_len: the engine rejects requests exceeding these limits. Size them based on your P99 traffic, not the theoretical model maximum.--workers 4: parallel compilation workers. Match to your CPU core count.--tp_size 4: must match the--tp_sizepassed toquantize.py(orconvert_checkpoint.pyin the non-FP8 path). The engine is compiled for a specific tensor parallelism degree; mismatching this with the launch command causes an immediate error.
Build time for a 70B FP8 model on a single H200 is 25-45 minutes. For a spot-optimized workflow: run the build on a spot instance at roughly $2/hr (for B200 spot) and store the engine artifacts in object storage. The engine is reusable until your TRT version changes.
In-Flight Batching, Chunked Prefill, and Paged KV Cache
TRT-LLM's executor API handles continuous batching natively. When a request finishes generation, the executor immediately fills its slot with a new request without waiting for the entire batch to complete. This is the core mechanism behind the throughput advantage over static batching approaches.
Executor API Config
{
"executor_config": {
"max_batch_size": 128,
"max_num_tokens": 16384,
"enable_chunked_prefill": true,
"kv_cache_config": {
"free_gpu_memory_fraction": 0.85,
"enable_block_reuse": true
}
}
}What each parameter does:
max_batch_size: maximum concurrent requests in a batch. Set based on your peak QPS estimate.max_num_tokens: total token budget across all active requests per iteration. This bounds memory and compute per step. Start at 8192-16384 and increase if you see GPU idle time.enable_chunked_prefill: splits long prefills into chunks so they do not block decode steps. This reduces TTFT variance at long context (8K+ input length). The cost is marginal throughput reduction at low batch sizes.free_gpu_memory_fraction: fraction of remaining VRAM after model weights that goes to KV cache. Default is 0.9; back it off to 0.7-0.8 if you see runtime OOM errors.enable_block_reuse: caches KV blocks for shared prefixes (like system prompts). The same optimization as vLLM's prefix caching, just under a different name. See the KV cache optimization guide for a deeper look at block reuse strategies.
Multi-GPU Deployment: TP, PP, and EP Configs
For models too large to fit on a single GPU, TRT-LLM uses MPI for inter-GPU coordination. You set the parallelism strategy at checkpoint conversion time, build the engine with matching flags, and launch with mpirun.
70B+ Model Parallelism Strategies
| Model | Params | TP | PP | EP | Min GPUs | Recommended |
|---|---|---|---|---|---|---|
| Llama 3.1 70B | 70B | 4 | 1 | 1 | 4x H200 | 4x H200 SXM5 |
| Llama 3.1 405B | 405B | 8 | 2 | 1 | 16x H200 | 16x H200 SXM5 |
| Mixtral 8x7B | ~47B | 2 | 1 | 8 | 16x H200 | 16x H200 SXM5 |
| Mixtral 8x22B | ~141B | 4 | 1 | 8 | 32x H200 | 32x H200 SXM5 |
For the Llama 3.1 70B case, a 4-GPU H200 instance on Spheron provides enough VRAM and NVLink bandwidth to keep tensor parallel communication under 10% of total compute time. At 4-way TP, each GPU holds 35GB of model weights plus KV cache, well within H200's 141GB HBM3e.
MPI Launch Commands for TP and PP
# 4-way tensor parallel (single node, 4x H200)
mpirun -n 4 --allow-run-as-root \
python -m tensorrt_llm.commands.run_api_server \
--engine_dir /engines/llama70b-engine \
--tp_size 4 \
--pp_size 1 \
--port 8000
# 8 TP + 2 PP (16 GPUs, 2 nodes)
mpirun -n 16 --allow-run-as-root \
--hostfile /etc/mpi/hostfile \
-x NCCL_DEBUG=INFO \
python -m tensorrt_llm.commands.run_api_server \
--engine_dir /engines/llama405b-engine \
--tp_size 8 \
--pp_size 2 \
--port 8000For MoE models (Mixtral), set --ep_size to match the number of experts being distributed. Expert parallelism is orthogonal to TP and PP; you can combine all three. A common Mixtral 8x7B config is TP=2, EP=8, PP=1 on 16x H200s (world_size = TP × PP × EP = 2 × 1 × 8 = 16).
Triton + TRT-LLM Backend Deployment
For production-grade serving with metrics, model versioning, and multi-model routing, wire the TRT-LLM engine to Triton Inference Server via the Triton TensorRT-LLM backend. The Triton Inference Server deployment guide covers the full Triton setup; this section focuses on the TRT-LLM-specific wiring.
Model Repository Structure
model_repo/
├── preprocessing/
│ ├── 1/
│ │ └── model.py
│ └── config.pbtxt
├── tensorrt_llm/
│ ├── 1/
│ └── config.pbtxt # points engine_dir to your compiled engine
├── postprocessing/
│ ├── 1/
│ │ └── model.py
│ └── config.pbtxt
└── ensemble/
├── 1/
└── config.pbtxt # chains pre -> trtllm -> postClone tensorrtllm_backend from the Triton GitHub org to get the baseline model repository templates and the preprocessing/postprocessing Python models.
The key config.pbtxt for the tensorrt_llm model:
backend: "tensorrtllm"
max_batch_size: 128
parameters {
key: "engine_dir"
value { string_value: "/engines/llama70b-engine" }
}
parameters {
key: "executor_worker_path"
value { string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker" }
}
model_transaction_policy {
decoupled: true
}decoupled: true is required for streaming/async response. Without it, Triton waits for the full generation before returning, which kills TTFT for long generations.
Launch and Health Check
docker run --gpus all --ipc=host \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /model_repo:/model_repo \
-v /engines:/engines \
nvcr.io/nvidia/tritonserver:26.04-trtllm-python-py3 \
tritonserver \
--model-repository /model_repo \
--disable-auto-complete-config
# Readiness check
curl http://localhost:8000/v2/health/readyUse the readiness probe at /v2/health/ready in your Kubernetes pod spec. Triton is not ready until all models are loaded and the TRT-LLM engine is initialized. Expect 30-90 seconds of cold start depending on model size.
Quantization Recipes: FP8, INT4 AWQ, and FP4
FP8 on H200
FP8 is the standard production quantization for H200. On MMLU benchmarks, FP8 PTQ degrades accuracy by 0.5-1.5% relative to FP16. That is acceptable for most chat and completion workloads.
Per-tensor FP8 (the default) is simpler and faster to calibrate. Per-channel FP8 gives slightly better accuracy on models with high weight variance across channels, like Mixtral's FFN layers, at the cost of a longer calibration run.
Use 512-1024 calibration samples from a dataset that represents your real traffic. Calibrating on Wikipedia text when you are serving code will skew the activation scales and push accuracy degradation toward the upper end of that range.
INT4 AWQ
AWQ (Activation-Aware Weight Quantization) gives better accuracy than GPTQ at the same INT4 bit width, by scaling weights based on activation magnitudes. The pipeline uses AutoAWQ first, then TRT-LLM's quantizer.
| Method | MMLU Accuracy Delta (vs FP16) | Model Size vs FP16 | Throughput Gain |
|---|---|---|---|
| FP8 PTQ | -0.5 to -1.5% | 50% | 1.3-1.5x |
| INT4 AWQ | -1.0 to -2.5% | 25% | 1.6-2.0x |
| GPTQ INT4 | -2.0 to -4.0% | 25% | 1.6-2.0x |
AWQ consistently outperforms GPTQ by 0.5-1.5% on MMLU at the same bit width, and the gap widens on reasoning benchmarks. For most teams, FP8 is the right default. INT4 AWQ is the fallback when VRAM is the binding constraint and you cannot add more GPUs.
FP4 on Blackwell (B200/GB200)
FP4 is a Blackwell-exclusive format requiring TensorRT 10.9+ and CUDA 12.8.1+. It halves VRAM usage relative to FP8: a 70B model fits in roughly 35GB, and a 405B model fits in around 200GB (within a 4-GPU B200 node). For a detailed look at the MXFP4 microscaling format and its calibration mechanics, see the companion post.
The accuracy trade-off is steeper than FP8. Without calibration, FP4 shows 5-8% MMLU degradation. With 1024+ calibration samples, you can bring this down to 2-3%. For accuracy-sensitive workloads, measure on a held-out test set before shipping to production.
For maximum throughput per dollar on Blackwell, run FP4 inference on a Spheron B200 SXM6 instance. At FP4, the B200's 8 TB/s memory bandwidth and 192GB VRAM give you headroom for batch sizes that would OOM on H200.
Benchmarks: TRT-LLM on H200 and B200
The figures below are for Llama 3.1 70B FP8 with TRT-LLM, projected from H100 baselines in the vLLM vs TensorRT-LLM vs SGLang benchmark using GPU memory bandwidth ratios (H200: 4.8 TB/s vs H100: 3.35 TB/s; B200: 8.0 TB/s vs H100: 3.35 TB/s). Actual numbers vary with batch size distribution, context length, and driver version.
Throughput (tokens/sec), TRT-LLM FP8, Llama 3.1 70B, Single GPU
| GPU | Batch 1 | Batch 8 | Batch 32 | Batch 128 |
|---|---|---|---|---|
| H200 SXM5 | ~180 | ~950 | ~2,900 | ~4,000 |
| B200 SXM6 | ~310 | ~1,600 | ~5,000 | ~6,700 |
TTFT (ms), TRT-LLM FP8
| GPU | Batch 1 | Batch 8 | Batch 32 | Batch 128 |
|---|---|---|---|---|
| H200 SXM5 | ~75 | ~85 | ~120 | ~220 |
| B200 SXM6 | ~45 | ~55 | ~85 | ~155 |
These are single-GPU figures for the 70B FP8 model. The H100 baseline methodology is in the companion benchmark post.
Cost Per 1M Tokens on Spheron: Spot vs On-Demand
Formula: cost_per_1M = (hourly_rate / throughput_tokens_per_sec) * 1,000,000 / 3,600
Using throughput at batch 128 (sustained production load) from the table above:
| GPU | Mode | $/hr | Throughput (tok/s) | Cost/1M tokens |
|---|---|---|---|---|
| H200 SXM5 | On-demand | $5.58 | ~4,000 | ~$0.39 |
| H200 SXM5 | Spot | $1.19 | ~4,000 | ~$0.08 |
| B200 SXM6 | Spot | $2.06 | ~6,700 | ~$0.09 |
B200 SXM6 on-demand pricing is not currently listed via the Spheron API. Both H200 and B200 spot options are available and show substantial savings for batch workloads.
Pricing fluctuates based on GPU availability. The prices above are based on 26 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
The spot-for-build, on-demand-for-serving pattern works well here. Run the 25-45 minute engine build on a B200 spot instance at $2.06/hr. Total build cost is roughly $1.50-2.00 per engine. Store artifacts in object storage and load them directly for serving without rebuilding, unless the TRT version changes.
Migration Checklist: Moving a vLLM Deployment to TRT-LLM
See our vLLM production deployment guide for the baseline setup you are migrating from.
- Confirm your model architecture is in the TRT-LLM supported list (check
examples/in the NGC container) - Map your current quantization format (FP16/FP8/AWQ) to the TRT-LLM equivalent
- Benchmark peak throughput on a representative traffic profile before migration - this is your baseline number
- Convert weights and run calibration on a spot instance to minimize cost
- Build the engine and validate accuracy vs FP16 baseline; a delta under 1% on MMLU is acceptable for most chat workloads
- Load-test the TRT-LLM endpoint at P99 with synthetic traffic; use patterns from your vLLM traffic logs
- Update your client
BASE_URL: the Triton TensorRT-LLM backend serves at/v2/models/{model_name}/generate; thetrtllm-serveOpenAI-compatible mode serves at/v1/chat/completions - Monitor TTFT and throughput for 24h; compare against your vLLM baseline before routing production traffic. If TRT-LLM is not the right fit for your architecture, SGLang production deployment is another migration target worth evaluating.
Troubleshooting: OOM, Kernel Launch Failures, and Engine Rebuild Loops
OOM During Engine Build
RuntimeError: [TRT] Engine build failed. Out of GPU memory.Try these in order:
- Drop
--max_batch_sizefrom 128 to 64 or 32 - Drop
--max_input_lento the actual P99 input length of your traffic (not the theoretical model maximum) - Add
--strongly_typedto reduce memory usage during graph optimization
OOM at Runtime
Lower kv_cache_config.free_gpu_memory_fraction from 0.9 to 0.7. This reserves less VRAM for KV cache and leaves a larger safety margin for the executor. If you are still hitting OOM, enable enable_block_reuse to reuse KV cache blocks across requests with shared prefixes.
Kernel Launch Failures
CudaException(cuLaunchKernel) failed: invalid device functionThis is almost always a compute capability mismatch. The engine was built targeting one GPU architecture and is running on a different one. Check the --gpt_attention_plugin flag in your build command and confirm it matches the target GPU. Rebuild with the correct architecture.
Also verify your driver version: nvidia-smi | grep "Driver Version". CUDA 13.1 (TRT-LLM 1.2.0) requires 590.44.01+.
Engine Rebuild Loops
TRT-LLM engine artifacts are tightly coupled to the TensorRT version. Any TRT upgrade invalidates the cached engine and forces a rebuild. Keep engine artifacts versioned alongside the TRT version tag used to build them. A naming convention like llama70b-fp8-trt1.2.0-h200/ in object storage avoids confusion when managing multiple engine versions across a fleet.
TRT-LLM engine builds run best on short-burst GPU instances. Launch a spot H200 or B200 for the build phase and switch to on-demand for the serving endpoint, all from the same Spheron dashboard.
