Most teams reach for vLLM or SGLang when they need to serve an LLM in production. That works fine when you have one model type. When your fleet needs to serve an LLM, an embedding model, and a vision encoder simultaneously on the same GPUs, NVIDIA Triton Inference Server is the production standard. It runs multiple backends in a single server process, handles concurrent requests across all of them, and gives you the request routing and batching controls that purpose-built LLM servers skip entirely.
Why Triton Inference Server: Multi-Framework, Multi-Model Serving
Triton's core distinction from vLLM or SGLang is that it's backend-agnostic. A single Triton process can simultaneously serve:
- An LLM via the vLLM backend (PagedAttention, continuous batching)
- A CLIP image encoder via the ONNX Runtime backend
- A BERT-based reranker via the TensorRT backend
- A custom preprocessing step via the Python backend
Practical example: you're building a multimodal RAG pipeline. Your query comes in, hits a CLIP encoder to generate embeddings, retrieves documents, runs a reranker, and sends the combined context to Llama 3.3 70B for synthesis. With Triton, all four models sit in the same server, and you define the routing in an ensemble config. Without Triton, you're running four separate servers and writing the inter-service coordination yourself.
Triton exposes both HTTP/REST on port 8000 and gRPC on port 8001. Port 8002 serves Prometheus metrics for GPU utilization, queue depth, and per-model inference latency. All three ports are active from a single docker run command.
For background on vLLM's production deployment patterns, see the vLLM production deployment guide.
Triton Architecture: Model Repository, Schedulers, and Backends
Model Repository
The model repository is a filesystem directory that Triton watches at startup and, optionally, at runtime. Layout:
model_repository/
my_model/
config.pbtxt # Model configuration
1/ # Version directory
model.onnx # Model weights (or model.pt, model.plan, etc.)
2/ # Optional: second version
model.onnxThe version number is a directory name. Triton loads the latest version by default and can serve multiple versions simultaneously if you configure it. Every model directory requires a config.pbtxt.
Schedulers
Triton has three scheduling modes:
Default scheduler: One request at a time, no batching. Used for stateful models or when you want predictable latency with no batch overhead.
Dynamic batching: Groups incoming requests into batches automatically. You set preferred_batch_size and max_queue_delay_microseconds. Requests wait up to the delay limit for the batch to fill, then execute. Good for most inference workloads where throughput matters more than single-request latency.
Sequence batching: For stateful models where requests belong to a session (RNNs, stateful decoders). Triton routes all requests from the same sequence ID to the same model instance.
Backends
| Backend | Use case |
|---|---|
tensorrt | TensorRT-compiled engines, highest throughput on NVIDIA |
pytorch | TorchScript models |
onnxruntime | ONNX models, broad framework support |
python | Custom Python inference code, preprocessing, postprocessing |
vllm | LLMs with PagedAttention and continuous batching |
openvino | Intel CPU inference (rarely used on GPU fleets) |
Minimal config.pbtxt (ONNX backend)
name: "clip_encoder"
backend: "onnxruntime"
max_batch_size: 32
input [
{
name: "pixel_values"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]
output [
{
name: "image_embeds"
data_type: TYPE_FP32
dims: [ 512 ]
}
]
dynamic_batching {
preferred_batch_size: [ 8, 16, 32 ]
max_queue_delay_microseconds: 5000
}GPU Cloud Requirements: VRAM Sizing for Concurrent Model Serving
Multi-model serving changes how you size GPU memory. You're not sizing for one model's peak; you're sizing for the sum of all resident models plus batching buffers.
| Setup | Models | GPU Recommendation | VRAM Needed |
|---|---|---|---|
| LLM only | Llama 3.3 70B FP8 | H100 80GB | ~40GB |
| LLM + embeddings | Llama 3.1 8B + BGE-M3 | L40S 48GB | ~18GB |
| LLM + vision | Llama 3.2 11B VLM + CLIP | A100 80GB | ~30GB |
| Full stack | Llama 3.3 70B + CLIP + BERT | H100 80GB | ~55GB |
Rule of thumb: total VRAM = sum of all resident model sizes (in their quantized form) + 15% headroom for dynamic batching buffers and KV cache. If a single model exceeds 40GB, you need an 80GB card. For more on managing KV cache memory, see the KV Cache Optimization Guide.
Current GPU pricing on Spheron:
| GPU | VRAM | On-Demand (lowest) | Spot (lowest) |
|---|---|---|---|
| L40S | 48GB | $0.72/hr | - |
| A100 80GB PCIe | 80GB | $1.07/hr | - |
| H100 80GB PCIe | 80GB | $2.01/hr | - |
| H100 SXM5 | 80GB | $4.41/hr | - |
| B200 SXM6 | 192GB | $7.43/hr | $1.71/hr |
Pricing fluctuates based on GPU availability. The prices above are based on 12 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Step-by-Step: Deploy Triton on GPU Cloud with Docker
Prerequisites
- NVIDIA GPU instance on Spheron with Docker and the NVIDIA Container Toolkit installed
- Docker image access to
nvcr.io/nvidia/tritonserver:24.12-py3 - Model weights in a supported format (ONNX, TorchScript, TensorRT engine, or a Hugging Face model name for the vLLM backend)
Step 1: Pull the Triton image
# For non-vLLM backends (TensorRT, ONNX, PyTorch, Python)
docker pull nvcr.io/nvidia/tritonserver:24.12-py3
# For the vLLM backend (LLM serving with PagedAttention)
docker pull nvcr.io/nvidia/tritonserver:24.12-vllm-python-py3The base py3 image is ~9.9GB. Pull the appropriate image while you set up your model repository.
Step 2: Create the model repository
# Create model repository structure
mkdir -p model_repository/my_model/1
# Place your model file (example: ONNX)
cp /path/to/model.onnx model_repository/my_model/1/model.onnxAdd a config.pbtxt for a PyTorch (TorchScript) model:
name: "my_model"
backend: "pytorch"
max_batch_size: 16
input [
{
name: "input__0"
data_type: TYPE_FP32
dims: [ 128 ]
}
]
output [
{
name: "output__0"
data_type: TYPE_FP32
dims: [ 10 ]
}
]
dynamic_batching {
preferred_batch_size: [ 4, 8, 16 ]
max_queue_delay_microseconds: 10000
}
instance_group [
{
count: 1
kind: KIND_GPU
}
]Step 3: Launch Triton
docker run --gpus all --rm \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
-v $(pwd)/model_repository:/models \
nvcr.io/nvidia/tritonserver:24.12-py3 \
tritonserver --model-repository=/modelsWatch the startup logs. You'll see each model load with its backend, version, and status. A successful load looks like:
I0101 tritonserver.cc] Started HTTPService at 0.0.0.0:8000
I0101 tritonserver.cc] Started GRPCInferenceService at 0.0.0.0:8001
I0101 tritonserver.cc] Started Metrics Service at 0.0.0.0:8002Step 4: Verify the server is ready
# Server health
curl http://localhost:8000/v2/health/ready
# Model status
curl http://localhost:8000/v2/models/my_model/readyBoth return HTTP 200 when the server and model are ready.
Step 5: Send an inference request via Python
import tritonclient.http as httpclient
import numpy as np
client = httpclient.InferenceServerClient(url="localhost:8000")
# Build input tensor
input_data = np.random.rand(1, 128).astype(np.float32)
inputs = [httpclient.InferInput("input__0", input_data.shape, "FP32")]
inputs[0].set_data_from_numpy(input_data)
# Define expected output
outputs = [httpclient.InferRequestedOutput("output__0")]
# Send request
result = client.infer("my_model", inputs, outputs=outputs)
output = result.as_numpy("output__0")
print(output.shape) # (1, 10)Install the client: pip install tritonclient[http].
Serving LLMs with Triton's vLLM Backend
Triton 24.12 includes a vLLM backend, but it is only available in the nvcr.io/nvidia/tritonserver:24.12-vllm-python-py3 image, not the standard 24.12-py3 image. Use the vllm-python-py3 image when you need LLM serving alongside other model types in one server process.
Model repository layout for the vLLM backend
model_repository/
llama3_70b/
config.pbtxt
1/
model.json # vLLM engine argsconfig.pbtxt for vLLM backend
name: "llama3_70b"
backend: "vllm"
max_batch_size: 0
model_transaction_policy {
decoupled: true
}
input [
{
name: "text_input"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "stream"
data_type: TYPE_BOOL
dims: [ 1 ]
},
{
name: "sampling_parameters"
data_type: TYPE_STRING
dims: [ 1 ]
optional: true
}
]
output [
{
name: "text_output"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
instance_group [
{
count: 1
kind: KIND_MODEL
}
]model.json (vLLM engine args)
{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
"dtype": "fp8",
"max_num_seqs": 256,
"gpu_memory_utilization": 0.90,
"tensor_parallel_size": 1
}For multi-GPU setups, increase tensor_parallel_size to match your GPU count.
Performance note: Triton's vLLM backend adds some overhead per request compared to standalone vLLM, due to the extra routing layer between the HTTP server and the vLLM engine. For pure single-LLM serving where latency matters, standalone vLLM is simpler. Use Triton's vLLM backend when you need multi-model routing on one server, not when squeezing the last few milliseconds from a dedicated LLM server.
For benchmark comparisons between vLLM, TensorRT-LLM, and SGLang on the same hardware, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.
Dynamic Batching and Model Ensembles
Dynamic Batching Configuration
Dynamic batching is the primary tool for improving GPU throughput when your request rate is bursty or when individual requests are small. The key parameters:
dynamic_batching {
preferred_batch_size: [ 4, 8, 16 ]
max_queue_delay_microseconds: 10000 # 10ms max wait
preserve_ordering: false
}preferred_batch_size tells Triton which batch sizes to target. Triton will dispatch a batch as soon as one of these sizes is reached, or when max_queue_delay_microseconds expires, whichever comes first. Lower the delay for latency-sensitive workloads; increase it for throughput-oriented batch jobs.
For a deeper look at continuous batching and paged attention, which underpin how modern LLM backends handle dynamic batching, see LLM Serving Optimization: Continuous Batching and Paged Attention.
Model Ensembles
An ensemble lets you chain models into a pipeline defined entirely in Triton config, with no client-side coordination. Each model's output tensors map to the next model's input tensors.
Example: a two-step pipeline where a preprocessing model tokenizes text, then an inference model runs the tokens.
name: "text_pipeline"
platform: "ensemble"
max_batch_size: 32
input [
{
name: "raw_text"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ -1 ]
}
]
ensemble_scheduling {
step [
{
model_name: "tokenizer"
model_version: -1
input_map { key: "text" value: "raw_text" }
output_map { key: "token_ids" value: "tokenized" }
},
{
model_name: "bert_classifier"
model_version: -1
input_map { key: "input_ids" value: "tokenized" }
output_map { key: "logits" value: "logits" }
}
]
}The input_map and output_map define tensor name translations between ensemble and model-local names.
BLS (Business Logic Scripting) is the Python-based alternative for pipelines with conditional routing. Instead of a static ensemble config, you write a Python backend that calls other Triton models programmatically. Use BLS when you need if/else logic or variable-length model chains; use ensemble configs when the pipeline is fixed.
Triton vs vLLM vs TensorRT-LLM vs SGLang: 2026 Decision Matrix
| Criterion | Triton | vLLM | TensorRT-LLM | SGLang |
|---|---|---|---|---|
| Multi-framework support | Yes (all) | LLMs only | LLMs + TRT | LLMs only |
| Multi-model concurrent serving | Yes | No | No | No |
| LLM throughput | Via vLLM backend | High | Highest | High |
| Setup complexity | High | Low | Very high | Medium |
| Dynamic batching | Built-in | PagedAttention | Built-in | RadixAttention |
| Best for | Diverse model types, pipelines | Single-LLM serving | Max throughput TRT | Agentic, multi-turn |
| Avoid when | Single LLM, simple needs | Multi-framework fleet | No TRT expertise | Stateless batch workloads |
The short version: if you're serving one LLM and nothing else, Triton's overhead isn't worth it. vLLM is simpler and faster for that case. If you need maximum token throughput and can tolerate TensorRT's compilation step, TensorRT-LLM wins on raw numbers. SGLang wins on agentic and multi-turn workloads where shared prefixes are common. Triton wins when you have a mixed fleet: LLMs alongside embedding models, vision encoders, classifiers, or custom preprocessing steps.
For deeper benchmark data, see vLLM vs TensorRT-LLM vs SGLang Benchmarks. For SGLang's specific advantages on agentic workloads with RadixAttention, see SGLang Production Deployment Guide.
Monitoring Triton in Production: Prometheus Metrics
Triton exposes Prometheus metrics at http://localhost:8002/metrics. No configuration required; the endpoint is active whenever Triton starts.
Key metrics to watch:
| Metric | What it tells you |
|---|---|
nv_inference_request_success | Total successful inference requests per model |
nv_inference_queue_duration_us | Time requests spend waiting in the queue before execution |
nv_gpu_utilization | GPU compute utilization as a percentage |
nv_gpu_memory_used_bytes | GPU memory used by Triton and loaded models |
nv_inference_exec_count | Number of inference executions (batches, not individual requests) |
Minimal Prometheus scrape config:
scrape_configs:
- job_name: triton
static_configs:
- targets: [ "localhost:8002" ]
metrics_path: /metrics
scrape_interval: 15sTo verify metrics are flowing:
curl http://localhost:8002/metrics | grep nv_gpuYou'll see nv_gpu_utilization and nv_gpu_memory_used_bytes for each GPU device. If both read 0 with no active requests, that's expected. Send a request and re-check; GPU utilization should spike during inference.
nv_inference_queue_duration_us is your main signal for batching tuning. If queue times are consistently low (under 1ms) and throughput is low, your batch sizes are too small. Increase preferred_batch_size. If queue times are high and requests are timing out, your model instances can't keep up with load; add more GPUs or scale horizontally.
Cost Optimization: Right-Sizing GPU Instances for Triton
Multi-model serving makes right-sizing more complex than single-LLM deployments. You can't just pick the cheapest GPU that fits your largest model; you need to fit all resident models simultaneously.
Sizing formula: total VRAM needed = sum of all resident model sizes (in their deployment quantization) + 15% headroom for batching buffers, KV cache, and CUDA overhead.
L40S at $0.72/hr on-demand: Right-sized for mixed lightweight workloads. A BERT reranker (~1GB) + a CLIP encoder (~600MB) + Llama 3.1 8B FP8 (~8GB) fits well under 48GB. Good starting point for most production pipelines that don't need 70B+ LLMs.
A100 80GB at $1.07/hr on-demand: Solid value when you need 80GB but the rest of your workload doesn't justify an H100's price. Works well for Llama 3.2 11B VLM + CLIP + a few small classifiers.
H100 80GB at $2.01/hr on-demand: Use when any single model in your stack exceeds 40GB, or when you're serving a 70B+ LLM alongside other models. The H100 PCIe's memory bandwidth (2.0 TB/s) makes it faster than A100 at the same VRAM size, which matters when you're context-switching between models under concurrent load.
Spot instances: The B200's spot price of $1.71/hr (vs $7.43/hr on-demand) makes it viable for batch inference pipelines where interruption is acceptable (indexing jobs, offline embedding generation, batch reranking). Don't use spot for real-time serving where a preemption causes a 200+ second restart.
B200 at $7.43/hr: Reserve this for the largest multi-model stacks where your 70B LLM, vision encoder, and embedding model combined exceed 130GB, or when you're running multiple concurrent 70B models at scale.
For full GPU selection guidance across model sizes, see AI Inference GPU Guide 2026 and GPU pricing →.
Triton's multi-model serving fits naturally with Spheron's per-hour GPU pricing. Spin up an H100 or L40S, load your model repository, and scale per request volume without overprovisioning.
