How long does a TensorRT-LLM engine build take for a 70B model?

Building a TensorRT-LLM engine for a 70B model (e.g. Llama 3.1 70B) on a single H200 takes 25-45 minutes for FP16, and 35-60 minutes with FP8 calibration. Using a spot instance for the build phase and switching to on-demand for serving cuts the total cost significantly.

Does TensorRT-LLM support in-flight batching out of the box?

Yes. TRT-LLM's executor API and the Triton TensorRT-LLM backend both support continuous/in-flight batching by default. You configure max_batch_size, max_num_tokens, and enable_chunked_prefill in the executor config.

Can I run TensorRT-LLM on multiple GPUs without Kubernetes?

Yes. TRT-LLM uses MPI for multi-GPU coordination. For tensor parallelism across 4 or 8 H200s, run mpirun -n 4 python run.py --world_size 4 --tp_size 4. Kubernetes is optional and mainly adds orchestration, not a hard dependency.

How does TensorRT-LLM compare to vLLM for production throughput on H200?

TRT-LLM achieves 15-35% higher throughput than vLLM on H200 for FP8 models due to custom CUDA kernels and fused attention ops. The gap is largest at batch sizes above 32 and for decoder-only models like Llama and Mistral.

What GPU memory does TensorRT-LLM require for a 70B FP8 model?

A 70B model in FP8 needs roughly 70GB of VRAM. A single H200 SXM5 (141GB) fits it with room for KV cache. Two H100 SXM5 (80GB each) can also fit it with tensor parallel across 2 GPUs.

TensorRT-LLM Production Deployment on GPU Cloud: Engine Build, Multi-GPU Serving, and In-Flight Batching (2026)

PyTorch-native serving is fast to set up but leaves 20-40% throughput on the table. TensorRT-LLM closes that gap through compiled execution plans, fused CUDA kernels, and an in-flight batching executor purpose-built for autoregressive decoding.

This guide covers the full production path: engine build pipeline, FP8/INT4/FP4 quantization, tensor and pipeline parallelism for 70B+ models, Triton backend wiring, and cost benchmarks on H200 and B200.

TensorRT-LLM vs PyTorch-Native Engines: When TRT-LLM Wins

TensorRT-LLM compiles your model weights and architecture into a CUDA kernel graph optimized for a specific GPU SKU, batch size range, and sequence length. At inference time, the runtime executes this precompiled graph directly - no Python overhead, no dynamic dispatch, no op-level JIT. The upside is hardware utilization that matches or beats hand-tuned C++ code. The downside is that every change to the model, batch size ceiling, or sequence length forces a rebuild. If your model changes weekly, the 25-45 minute build cycle is a real operational cost.

TRT-LLM is the right call when your model is stable, you are at or near capacity, and you need to extract every token per second from the hardware. vLLM is the better starting point when you need model flexibility or fast iteration.

Hopper (H100/H200) vs Blackwell (B200/GB200) Acceleration

On Hopper, TRT-LLM exploits the Transformer Engine's FP8 tensor cores natively. The FP8 attention and MLP kernels are fused, and the runtime handles per-tensor scaling factors automatically. Hopper FP8 typically gives a 1.3-1.5x throughput improvement over FP16 on the same hardware.

Blackwell changes the economics again. The B200 SXM6 adds FP4 tensor core support and doubles the memory bandwidth over H200 (8 TB/s vs 4.8 TB/s). TRT-LLM 0.19.0+ includes FP4 kernels for Blackwell, though FP4 requires TensorRT 10.9+ and CUDA 12.8 or later. At FP4, a 70B model fits in roughly 35GB, leaving the rest of the B200's 192GB for KV cache - significantly higher sustainable batch sizes compared to FP8 on H200.

Engine Build Pipeline

Prerequisites and Docker Setup

Use the official NGC container. Trying to install TRT-LLM via pip and managing the CUDA/cuDNN/TensorRT dependency chain by hand will cost you a day. The NGC containers ship with the right versions already pinned.

bash

# Verify your driver meets the minimum for CUDA 13.1
nvidia-smi | grep "Driver Version"
# Required: 590.44.01+ on Linux

docker pull nvcr.io/nvidia/tensorrt-llm/release:1.2.0

The container tag here (1.2.0) is based on CUDA 13.1.0 and requires driver 590.44.01 or later. If you are on an older driver, pull the 1.0.0 container (CUDA 12.6) instead. CUDA version mismatches produce cryptic build errors, so verify before starting.

Checkpoint Conversion

TRT-LLM does not read HuggingFace weights directly at engine build time. You first convert to TRT-LLM checkpoint format using convert_checkpoint.py from the examples directory. This step is architecture-specific.

bash

docker run --gpus all --ipc=host \
  -v /path/to/hf-model:/models \
  -v /path/to/output:/output \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0 \
  python examples/llama/convert_checkpoint.py \
    --model_dir /models/Llama-3.1-70B-Instruct \
    --output_dir /output/llama70b-checkpoint \
    --dtype float16 \
    --tp_size 4 \
    --pp_size 1

Supported architectures include Llama, Mistral, Mixtral (MoE), Falcon, Gemma, Phi, and Qwen. Check the examples/ directory in the NGC container for the full list. The --tp_size and --pp_size flags set the parallelism strategy and must match what you pass to trtllm-build later.

Quantization Calibration

FP8 post-training quantization (PTQ) is the most common production path. The calibration step runs a small dataset through the model to compute the activation scales needed for accurate FP8 inference.

bash

docker run --gpus all --ipc=host \
  -v /models:/models \
  -v /engines:/engines \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0 \
  python examples/quantization/quantize.py \
    --model_dir /models/Llama-3.1-70B-Instruct \
    --dtype float16 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir /engines/fp8-checkpoint \
    --calib_size 512 \
    --tp_size 4

512 calibration samples is a reasonable default. Going above 1024 gives diminishing accuracy returns for most models. The calibration run itself takes 5-10 minutes on a single H200. Use data that matches your real traffic distribution - calibrating on Wikipedia text when you are serving code will skew the activation scales.

`trtllm-build` Configuration

bash

docker run --gpus all --ipc=host \
  -v /engines:/engines \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0 \
  trtllm-build \
    --checkpoint_dir /engines/fp8-checkpoint \
    --output_dir /engines/llama70b-engine \
    --gemm_plugin fp8 \
    --gpt_attention_plugin fp8 \
    --max_batch_size 128 \
    --max_input_len 8192 \
    --max_seq_len 10240 \
    --workers 4 \
    --tp_size 4

Key flags:

--max_batch_size 128: increase if you expect high concurrent request volume. Higher values also increase VRAM usage during build; if you hit OOM during build, drop to 64 or 32.
--max_input_len and --max_seq_len: the engine rejects requests exceeding these limits. Size them based on your P99 traffic, not the theoretical model maximum.
--workers 4: parallel compilation workers. Match to your CPU core count.
--tp_size 4: must match the --tp_size passed to quantize.py (or convert_checkpoint.py in the non-FP8 path). The engine is compiled for a specific tensor parallelism degree; mismatching this with the launch command causes an immediate error.

Build time for a 70B FP8 model on a single H200 is 25-45 minutes. For a spot-optimized workflow: run the build on a spot instance at roughly $2/hr (for B200 spot) and store the engine artifacts in object storage. The engine is reusable until your TRT version changes.

In-Flight Batching, Chunked Prefill, and Paged KV Cache

TRT-LLM's executor API handles continuous batching natively. When a request finishes generation, the executor immediately fills its slot with a new request without waiting for the entire batch to complete. This is the core mechanism behind the throughput advantage over static batching approaches.

Executor API Config

json

{
  "executor_config": {
    "max_batch_size": 128,
    "max_num_tokens": 16384,
    "enable_chunked_prefill": true,
    "kv_cache_config": {
      "free_gpu_memory_fraction": 0.85,
      "enable_block_reuse": true
    }
  }
}

What each parameter does:

max_batch_size: maximum concurrent requests in a batch. Set based on your peak QPS estimate.
max_num_tokens: total token budget across all active requests per iteration. This bounds memory and compute per step. Start at 8192-16384 and increase if you see GPU idle time.
enable_chunked_prefill: splits long prefills into chunks so they do not block decode steps. This reduces TTFT variance at long context (8K+ input length). The cost is marginal throughput reduction at low batch sizes.
free_gpu_memory_fraction: fraction of remaining VRAM after model weights that goes to KV cache. Default is 0.9; back it off to 0.7-0.8 if you see runtime OOM errors.
enable_block_reuse: caches KV blocks for shared prefixes (like system prompts). The same optimization as vLLM's prefix caching, just under a different name. See the KV cache optimization guide for a deeper look at block reuse strategies.

Multi-GPU Deployment: TP, PP, and EP Configs

For models too large to fit on a single GPU, TRT-LLM uses MPI for inter-GPU coordination. You set the parallelism strategy at checkpoint conversion time, build the engine with matching flags, and launch with mpirun.

70B+ Model Parallelism Strategies

Model	Params	TP	PP	EP	Min GPUs	Recommended
Llama 3.1 70B	70B	4	1	1	4x H200	4x H200 SXM5
Llama 3.1 405B	405B	8	2	1	16x H200	16x H200 SXM5
Mixtral 8x7B	~47B	2	1	8	16x H200	16x H200 SXM5
Mixtral 8x22B	~141B	4	1	8	32x H200	32x H200 SXM5

For the Llama 3.1 70B case, a 4-GPU H200 instance on Spheron provides enough VRAM and NVLink bandwidth to keep tensor parallel communication under 10% of total compute time. At 4-way TP, each GPU holds 35GB of model weights plus KV cache, well within H200's 141GB HBM3e.

MPI Launch Commands for TP and PP

bash

# 4-way tensor parallel (single node, 4x H200)
mpirun -n 4 --allow-run-as-root \
  python -m tensorrt_llm.commands.run_api_server \
    --engine_dir /engines/llama70b-engine \
    --tp_size 4 \
    --pp_size 1 \
    --port 8000

# 8 TP + 2 PP (16 GPUs, 2 nodes)
mpirun -n 16 --allow-run-as-root \
  --hostfile /etc/mpi/hostfile \
  -x NCCL_DEBUG=INFO \
  python -m tensorrt_llm.commands.run_api_server \
    --engine_dir /engines/llama405b-engine \
    --tp_size 8 \
    --pp_size 2 \
    --port 8000

For MoE models (Mixtral), set --ep_size to match the number of experts being distributed. Expert parallelism is orthogonal to TP and PP; you can combine all three. A common Mixtral 8x7B config is TP=2, EP=8, PP=1 on 16x H200s (world_size = TP × PP × EP = 2 × 1 × 8 = 16).

Triton + TRT-LLM Backend Deployment

For production-grade serving with metrics, model versioning, and multi-model routing, wire the TRT-LLM engine to Triton Inference Server via the Triton TensorRT-LLM backend. The Triton Inference Server deployment guide covers the full Triton setup; this section focuses on the TRT-LLM-specific wiring.

Model Repository Structure

model_repo/
├── preprocessing/
│   ├── 1/
│   │   └── model.py
│   └── config.pbtxt
├── tensorrt_llm/
│   ├── 1/
│   └── config.pbtxt    # points engine_dir to your compiled engine
├── postprocessing/
│   ├── 1/
│   │   └── model.py
│   └── config.pbtxt
└── ensemble/
    ├── 1/
    └── config.pbtxt    # chains pre -> trtllm -> post

Clone tensorrtllm_backend from the Triton GitHub org to get the baseline model repository templates and the preprocessing/postprocessing Python models.

The key config.pbtxt for the tensorrt_llm model:

protobuf

backend: "tensorrtllm"
max_batch_size: 128

parameters {
  key: "engine_dir"
  value { string_value: "/engines/llama70b-engine" }
}

parameters {
  key: "executor_worker_path"
  value { string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker" }
}

model_transaction_policy {
  decoupled: true
}

decoupled: true is required for streaming/async response. Without it, Triton waits for the full generation before returning, which kills TTFT for long generations.

Launch and Health Check

bash

docker run --gpus all --ipc=host \
  -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v /model_repo:/model_repo \
  -v /engines:/engines \
  nvcr.io/nvidia/tritonserver:26.04-trtllm-python-py3 \
  tritonserver \
    --model-repository /model_repo \
    --disable-auto-complete-config

# Readiness check
curl http://localhost:8000/v2/health/ready

Use the readiness probe at /v2/health/ready in your Kubernetes pod spec. Triton is not ready until all models are loaded and the TRT-LLM engine is initialized. Expect 30-90 seconds of cold start depending on model size.

Quantization Recipes: FP8, INT4 AWQ, and FP4

FP8 on H200

FP8 is the standard production quantization for H200. On MMLU benchmarks, FP8 PTQ degrades accuracy by 0.5-1.5% relative to FP16. That is acceptable for most chat and completion workloads.

Per-tensor FP8 (the default) is simpler and faster to calibrate. Per-channel FP8 gives slightly better accuracy on models with high weight variance across channels, like Mixtral's FFN layers, at the cost of a longer calibration run.

Use 512-1024 calibration samples from a dataset that represents your real traffic. Calibrating on Wikipedia text when you are serving code will skew the activation scales and push accuracy degradation toward the upper end of that range.

INT4 AWQ

AWQ (Activation-Aware Weight Quantization) gives better accuracy than GPTQ at the same INT4 bit width, by scaling weights based on activation magnitudes. The pipeline uses AutoAWQ first, then TRT-LLM's quantizer.

Method	MMLU Accuracy Delta (vs FP16)	Model Size vs FP16	Throughput Gain
FP8 PTQ	-0.5 to -1.5%	50%	1.3-1.5x
INT4 AWQ	-1.0 to -2.5%	25%	1.6-2.0x
GPTQ INT4	-2.0 to -4.0%	25%	1.6-2.0x

AWQ consistently outperforms GPTQ by 0.5-1.5% on MMLU at the same bit width, and the gap widens on reasoning benchmarks. For most teams, FP8 is the right default. INT4 AWQ is the fallback when VRAM is the binding constraint and you cannot add more GPUs.

FP4 on Blackwell (B200/GB200)

FP4 is a Blackwell-exclusive format requiring TensorRT 10.9+ and CUDA 12.8.1+. It halves VRAM usage relative to FP8: a 70B model fits in roughly 35GB, and a 405B model fits in around 200GB (within a 4-GPU B200 node). For a detailed look at the MXFP4 microscaling format and its calibration mechanics, see the companion post.

The accuracy trade-off is steeper than FP8. Without calibration, FP4 shows 5-8% MMLU degradation. With 1024+ calibration samples, you can bring this down to 2-3%. For accuracy-sensitive workloads, measure on a held-out test set before shipping to production.

For maximum throughput per dollar on Blackwell, run FP4 inference on a Spheron B200 SXM6 instance. At FP4, the B200's 8 TB/s memory bandwidth and 192GB VRAM give you headroom for batch sizes that would OOM on H200.

Benchmarks: TRT-LLM on H200 and B200

The figures below are for Llama 3.1 70B FP8 with TRT-LLM, projected from H100 baselines in the vLLM vs TensorRT-LLM vs SGLang benchmark using GPU memory bandwidth ratios (H200: 4.8 TB/s vs H100: 3.35 TB/s; B200: 8.0 TB/s vs H100: 3.35 TB/s). Actual numbers vary with batch size distribution, context length, and driver version.

Throughput (tokens/sec), TRT-LLM FP8, Llama 3.1 70B, Single GPU

GPU	Batch 1	Batch 8	Batch 32	Batch 128
H200 SXM5	~180	~950	~2,900	~4,000
B200 SXM6	~310	~1,600	~5,000	~6,700

TTFT (ms), TRT-LLM FP8

GPU	Batch 1	Batch 8	Batch 32	Batch 128
H200 SXM5	~75	~85	~120	~220
B200 SXM6	~45	~55	~85	~155

These are single-GPU figures for the 70B FP8 model. The H100 baseline methodology is in the companion benchmark post.

Cost Per 1M Tokens on Spheron: Spot vs On-Demand

Formula: cost_per_1M = (hourly_rate / throughput_tokens_per_sec) * 1,000,000 / 3,600

Using throughput at batch 128 (sustained production load) from the table above:

GPU	Mode	$/hr	Throughput (tok/s)	Cost/1M tokens
H200 SXM5	On-demand	$5.58	~4,000	~$0.39
H200 SXM5	Spot	$1.19	~4,000	~$0.08
B200 SXM6	Spot	$2.06	~6,700	~$0.09

B200 SXM6 on-demand pricing is not currently listed via the Spheron API. Both H200 and B200 spot options are available and show substantial savings for batch workloads.

Pricing fluctuates based on GPU availability. The prices above are based on 26 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The spot-for-build, on-demand-for-serving pattern works well here. Run the 25-45 minute engine build on a B200 spot instance at $2.06/hr. Total build cost is roughly $1.50-2.00 per engine. Store artifacts in object storage and load them directly for serving without rebuilding, unless the TRT version changes.

Migration Checklist: Moving a vLLM Deployment to TRT-LLM

See our vLLM production deployment guide for the baseline setup you are migrating from.

Confirm your model architecture is in the TRT-LLM supported list (check examples/ in the NGC container)
Map your current quantization format (FP16/FP8/AWQ) to the TRT-LLM equivalent
Benchmark peak throughput on a representative traffic profile before migration - this is your baseline number
Convert weights and run calibration on a spot instance to minimize cost
Build the engine and validate accuracy vs FP16 baseline; a delta under 1% on MMLU is acceptable for most chat workloads
Load-test the TRT-LLM endpoint at P99 with synthetic traffic; use patterns from your vLLM traffic logs
Update your client BASE_URL: the Triton TensorRT-LLM backend serves at /v2/models/{model_name}/generate; the trtllm-serve OpenAI-compatible mode serves at /v1/chat/completions
Monitor TTFT and throughput for 24h; compare against your vLLM baseline before routing production traffic. If TRT-LLM is not the right fit for your architecture, SGLang production deployment is another migration target worth evaluating.

Troubleshooting: OOM, Kernel Launch Failures, and Engine Rebuild Loops

OOM During Engine Build

RuntimeError: [TRT] Engine build failed. Out of GPU memory.

Try these in order:

Drop --max_batch_size from 128 to 64 or 32
Drop --max_input_len to the actual P99 input length of your traffic (not the theoretical model maximum)
Add --strongly_typed to reduce memory usage during graph optimization

OOM at Runtime

Lower kv_cache_config.free_gpu_memory_fraction from 0.9 to 0.7. This reserves less VRAM for KV cache and leaves a larger safety margin for the executor. If you are still hitting OOM, enable enable_block_reuse to reuse KV cache blocks across requests with shared prefixes.

Kernel Launch Failures

CudaException(cuLaunchKernel) failed: invalid device function

This is almost always a compute capability mismatch. The engine was built targeting one GPU architecture and is running on a different one. Check the --gpt_attention_plugin flag in your build command and confirm it matches the target GPU. Rebuild with the correct architecture.

Also verify your driver version: nvidia-smi | grep "Driver Version". CUDA 13.1 (TRT-LLM 1.2.0) requires 590.44.01+.

Engine Rebuild Loops

TRT-LLM engine artifacts are tightly coupled to the TensorRT version. Any TRT upgrade invalidates the cached engine and forces a rebuild. Keep engine artifacts versioned alongside the TRT version tag used to build them. A naming convention like llama70b-fp8-trt1.2.0-h200/ in object storage avoids confusion when managing multiple engine versions across a fleet.

TRT-LLM engine builds run best on short-burst GPU instances. Launch a spot H200 or B200 for the build phase and switch to on-demand for the serving endpoint, all from the same Spheron dashboard.
Rent H200 → | Rent B200 → | View all GPU pricing →

TensorRT-LLM vs PyTorch-Native Engines: When TRT-LLM Wins

Hopper (H100/H200) vs Blackwell (B200/GB200) Acceleration

Engine Build Pipeline

Prerequisites and Docker Setup

Checkpoint Conversion

Quantization Calibration

trtllm-build Configuration

In-Flight Batching, Chunked Prefill, and Paged KV Cache

Executor API Config

Multi-GPU Deployment: TP, PP, and EP Configs

70B+ Model Parallelism Strategies

MPI Launch Commands for TP and PP

Triton + TRT-LLM Backend Deployment

Model Repository Structure

Launch and Health Check

Quantization Recipes: FP8, INT4 AWQ, and FP4

FP8 on H200

INT4 AWQ

FP4 on Blackwell (B200/GB200)

Benchmarks: TRT-LLM on H200 and B200

Throughput (tokens/sec), TRT-LLM FP8, Llama 3.1 70B, Single GPU

TTFT (ms), TRT-LLM FP8

Cost Per 1M Tokens on Spheron: Spot vs On-Demand

Migration Checklist: Moving a vLLM Deployment to TRT-LLM

Troubleshooting: OOM, Kernel Launch Failures, and Engine Rebuild Loops

OOM During Engine Build

OOM at Runtime

Kernel Launch Failures

Engine Rebuild Loops

Build what's next.

`trtllm-build` Configuration