What is NVIDIA Triton Inference Server?

Triton Inference Server is NVIDIA's open-source inference serving platform that supports multiple ML frameworks (TensorRT, PyTorch, ONNX, OpenVINO, vLLM) and can serve multiple models concurrently on the same GPU fleet.

How does Triton compare to vLLM for LLM serving?

vLLM is purpose-built for LLMs and simpler to deploy for a single-framework stack. Triton is better when you need to serve multiple model types together (LLMs, vision, embeddings) or when you need fine-grained batching control through its ensemble and BLS pipeline features.

What GPU do I need to run Triton Inference Server?

Triton runs on any NVIDIA GPU with CUDA support. For single LLM serving, an L40S (48GB) handles 7B-13B models; for 70B+ models, an H100 or A100 80GB is required. For multi-model concurrent serving, size VRAM by summing all resident model sizes.

Can Triton serve multiple LLMs at the same time?

Yes. Triton's model repository supports loading multiple models simultaneously. Combined with the vLLM backend, it can serve multiple LLMs concurrently, routing requests to the right model based on the model name in the request.

What is dynamic batching in Triton?

Dynamic batching groups individual inference requests into larger batches automatically, improving GPU utilization without changes to the client. You configure max_queue_delay_microseconds and preferred_batch_size in the model config to control the latency-throughput trade-off.

Deploy NVIDIA Triton Inference Server on GPU Cloud: Production Multi-Model Serving (2026)

Most teams reach for vLLM or SGLang when they need to serve an LLM in production. That works fine when you have one model type. When your fleet needs to serve an LLM, an embedding model, and a vision encoder simultaneously on the same GPUs, NVIDIA Triton Inference Server is the production standard. It runs multiple backends in a single server process, handles concurrent requests across all of them, and gives you the request routing and batching controls that purpose-built LLM servers skip entirely.

Why Triton Inference Server: Multi-Framework, Multi-Model Serving

Triton's core distinction from vLLM or SGLang is that it's backend-agnostic. A single Triton process can simultaneously serve:

An LLM via the vLLM backend (PagedAttention, continuous batching)
A CLIP image encoder via the ONNX Runtime backend
A BERT-based reranker via the TensorRT backend
A custom preprocessing step via the Python backend

Practical example: you're building a multimodal RAG pipeline. Your query comes in, hits a CLIP encoder to generate embeddings, retrieves documents, runs a reranker, and sends the combined context to Llama 3.3 70B for synthesis. With Triton, all four models sit in the same server, and you define the routing in an ensemble config. Without Triton, you're running four separate servers and writing the inter-service coordination yourself.

Triton exposes both HTTP/REST on port 8000 and gRPC on port 8001. Port 8002 serves Prometheus metrics for GPU utilization, queue depth, and per-model inference latency. All three ports are active from a single docker run command.

For background on vLLM's production deployment patterns, see the vLLM production deployment guide.

Triton Architecture: Model Repository, Schedulers, and Backends

Model Repository

The model repository is a filesystem directory that Triton watches at startup and, optionally, at runtime. Layout:

model_repository/
  my_model/
    config.pbtxt          # Model configuration
    1/                    # Version directory
      model.onnx          # Model weights (or model.pt, model.plan, etc.)
    2/                    # Optional: second version
      model.onnx

The version number is a directory name. Triton loads the latest version by default and can serve multiple versions simultaneously if you configure it. Every model directory requires a config.pbtxt.

Schedulers

Triton has three scheduling modes:

Default scheduler: One request at a time, no batching. Used for stateful models or when you want predictable latency with no batch overhead.

Dynamic batching: Groups incoming requests into batches automatically. You set preferred_batch_size and max_queue_delay_microseconds. Requests wait up to the delay limit for the batch to fill, then execute. Good for most inference workloads where throughput matters more than single-request latency.

Sequence batching: For stateful models where requests belong to a session (RNNs, stateful decoders). Triton routes all requests from the same sequence ID to the same model instance.

Backends

Backend	Use case
`tensorrt`	TensorRT-compiled engines, highest throughput on NVIDIA
`pytorch`	TorchScript models
`onnxruntime`	ONNX models, broad framework support
`python`	Custom Python inference code, preprocessing, postprocessing
`vllm`	LLMs with PagedAttention and continuous batching
`openvino`	Intel CPU inference (rarely used on GPU fleets)

Minimal config.pbtxt (ONNX backend)

protobuf

name: "clip_encoder"
backend: "onnxruntime"
max_batch_size: 32

input [
  {
    name: "pixel_values"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]

output [
  {
    name: "image_embeds"
    data_type: TYPE_FP32
    dims: [ 512 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 5000
}

GPU Cloud Requirements: VRAM Sizing for Concurrent Model Serving

Multi-model serving changes how you size GPU memory. You're not sizing for one model's peak; you're sizing for the sum of all resident models plus batching buffers.

Setup	Models	GPU Recommendation	VRAM Needed
LLM only	Llama 3.3 70B FP8	H100 80GB	~40GB
LLM + embeddings	Llama 3.1 8B + BGE-M3	L40S 48GB	~18GB
LLM + vision	Llama 3.2 11B VLM + CLIP	A100 80GB	~30GB
Full stack	Llama 3.3 70B + CLIP + BERT	H100 80GB	~55GB

Rule of thumb: total VRAM = sum of all resident model sizes (in their quantized form) + 15% headroom for dynamic batching buffers and KV cache. If a single model exceeds 40GB, you need an 80GB card. For more on managing KV cache memory, see the KV Cache Optimization Guide.

Current GPU pricing on Spheron:

GPU	VRAM	On-Demand (lowest)	Spot (lowest)
L40S	48GB	$0.72/hr	-
A100 80GB PCIe	80GB	$1.07/hr	-
H100 80GB PCIe	80GB	$2.01/hr	-
H100 SXM5	80GB	$4.41/hr	-
B200 SXM6	192GB	$7.43/hr	$1.71/hr

Pricing fluctuates based on GPU availability. The prices above are based on 12 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Step-by-Step: Deploy Triton on GPU Cloud with Docker

Prerequisites

NVIDIA GPU instance on Spheron with Docker and the NVIDIA Container Toolkit installed
Docker image access to nvcr.io/nvidia/tritonserver:24.12-py3
Model weights in a supported format (ONNX, TorchScript, TensorRT engine, or a Hugging Face model name for the vLLM backend)

Step 1: Pull the Triton image

bash

# For non-vLLM backends (TensorRT, ONNX, PyTorch, Python)
docker pull nvcr.io/nvidia/tritonserver:24.12-py3

# For the vLLM backend (LLM serving with PagedAttention)
docker pull nvcr.io/nvidia/tritonserver:24.12-vllm-python-py3

The base py3 image is ~9.9GB. Pull the appropriate image while you set up your model repository.

Step 2: Create the model repository

bash

# Create model repository structure
mkdir -p model_repository/my_model/1

# Place your model file (example: ONNX)
cp /path/to/model.onnx model_repository/my_model/1/model.onnx

Add a config.pbtxt for a PyTorch (TorchScript) model:

protobuf

name: "my_model"
backend: "pytorch"
max_batch_size: 16

input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 128 ]
  }
]

output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 10 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8, 16 ]
  max_queue_delay_microseconds: 10000
}

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

Step 3: Launch Triton

bash

docker run --gpus all --rm \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  -v $(pwd)/model_repository:/models \
  nvcr.io/nvidia/tritonserver:24.12-py3 \
  tritonserver --model-repository=/models

Watch the startup logs. You'll see each model load with its backend, version, and status. A successful load looks like:

I0101 tritonserver.cc] Started HTTPService at 0.0.0.0:8000
I0101 tritonserver.cc] Started GRPCInferenceService at 0.0.0.0:8001
I0101 tritonserver.cc] Started Metrics Service at 0.0.0.0:8002

Step 4: Verify the server is ready

bash

# Server health
curl http://localhost:8000/v2/health/ready

# Model status
curl http://localhost:8000/v2/models/my_model/ready

Both return HTTP 200 when the server and model are ready.

Step 5: Send an inference request via Python

python

import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(url="localhost:8000")

# Build input tensor
input_data = np.random.rand(1, 128).astype(np.float32)
inputs = [httpclient.InferInput("input__0", input_data.shape, "FP32")]
inputs[0].set_data_from_numpy(input_data)

# Define expected output
outputs = [httpclient.InferRequestedOutput("output__0")]

# Send request
result = client.infer("my_model", inputs, outputs=outputs)
output = result.as_numpy("output__0")
print(output.shape)  # (1, 10)

Install the client: pip install tritonclient[http].

Serving LLMs with Triton's vLLM Backend

Triton 24.12 includes a vLLM backend, but it is only available in the nvcr.io/nvidia/tritonserver:24.12-vllm-python-py3 image, not the standard 24.12-py3 image. Use the vllm-python-py3 image when you need LLM serving alongside other model types in one server process.

Model repository layout for the vLLM backend

model_repository/
  llama3_70b/
    config.pbtxt
    1/
      model.json      # vLLM engine args

config.pbtxt for vLLM backend

protobuf

name: "llama3_70b"
backend: "vllm"
max_batch_size: 0

model_transaction_policy {
  decoupled: true
}

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
  },
  {
    name: "sampling_parameters"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_MODEL
  }
]

model.json (vLLM engine args)

json

{
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
  "dtype": "fp8",
  "max_num_seqs": 256,
  "gpu_memory_utilization": 0.90,
  "tensor_parallel_size": 1
}

For multi-GPU setups, increase tensor_parallel_size to match your GPU count.

Performance note: Triton's vLLM backend adds some overhead per request compared to standalone vLLM, due to the extra routing layer between the HTTP server and the vLLM engine. For pure single-LLM serving where latency matters, standalone vLLM is simpler. Use Triton's vLLM backend when you need multi-model routing on one server, not when squeezing the last few milliseconds from a dedicated LLM server.

For benchmark comparisons between vLLM, TensorRT-LLM, and SGLang on the same hardware, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.

Dynamic Batching and Model Ensembles

Dynamic Batching Configuration

Dynamic batching is the primary tool for improving GPU throughput when your request rate is bursty or when individual requests are small. The key parameters:

protobuf

dynamic_batching {
  preferred_batch_size: [ 4, 8, 16 ]
  max_queue_delay_microseconds: 10000   # 10ms max wait
  preserve_ordering: false
}

preferred_batch_size tells Triton which batch sizes to target. Triton will dispatch a batch as soon as one of these sizes is reached, or when max_queue_delay_microseconds expires, whichever comes first. Lower the delay for latency-sensitive workloads; increase it for throughput-oriented batch jobs.

For a deeper look at continuous batching and paged attention, which underpin how modern LLM backends handle dynamic batching, see LLM Serving Optimization: Continuous Batching and Paged Attention.

Model Ensembles

An ensemble lets you chain models into a pipeline defined entirely in Triton config, with no client-side coordination. Each model's output tensors map to the next model's input tensors.

Example: a two-step pipeline where a preprocessing model tokenizes text, then an inference model runs the tokens.

protobuf

name: "text_pipeline"
platform: "ensemble"
max_batch_size: 32

input [
  {
    name: "raw_text"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "tokenizer"
      model_version: -1
      input_map { key: "text" value: "raw_text" }
      output_map { key: "token_ids" value: "tokenized" }
    },
    {
      model_name: "bert_classifier"
      model_version: -1
      input_map { key: "input_ids" value: "tokenized" }
      output_map { key: "logits" value: "logits" }
    }
  ]
}

The input_map and output_map define tensor name translations between ensemble and model-local names.

BLS (Business Logic Scripting) is the Python-based alternative for pipelines with conditional routing. Instead of a static ensemble config, you write a Python backend that calls other Triton models programmatically. Use BLS when you need if/else logic or variable-length model chains; use ensemble configs when the pipeline is fixed.

Triton vs vLLM vs TensorRT-LLM vs SGLang: 2026 Decision Matrix

Criterion	Triton	vLLM	TensorRT-LLM	SGLang
Multi-framework support	Yes (all)	LLMs only	LLMs + TRT	LLMs only
Multi-model concurrent serving	Yes	No	No	No
LLM throughput	Via vLLM backend	High	Highest	High
Setup complexity	High	Low	Very high	Medium
Dynamic batching	Built-in	PagedAttention	Built-in	RadixAttention
Best for	Diverse model types, pipelines	Single-LLM serving	Max throughput TRT	Agentic, multi-turn
Avoid when	Single LLM, simple needs	Multi-framework fleet	No TRT expertise	Stateless batch workloads

The short version: if you're serving one LLM and nothing else, Triton's overhead isn't worth it. vLLM is simpler and faster for that case. If you need maximum token throughput and can tolerate TensorRT's compilation step, TensorRT-LLM wins on raw numbers. SGLang wins on agentic and multi-turn workloads where shared prefixes are common. Triton wins when you have a mixed fleet: LLMs alongside embedding models, vision encoders, classifiers, or custom preprocessing steps.

For deeper benchmark data, see vLLM vs TensorRT-LLM vs SGLang Benchmarks. For SGLang's specific advantages on agentic workloads with RadixAttention, see SGLang Production Deployment Guide.

Monitoring Triton in Production: Prometheus Metrics

Triton exposes Prometheus metrics at http://localhost:8002/metrics. No configuration required; the endpoint is active whenever Triton starts.

Key metrics to watch:

Metric	What it tells you
`nv_inference_request_success`	Total successful inference requests per model
`nv_inference_queue_duration_us`	Time requests spend waiting in the queue before execution
`nv_gpu_utilization`	GPU compute utilization as a percentage
`nv_gpu_memory_used_bytes`	GPU memory used by Triton and loaded models
`nv_inference_exec_count`	Number of inference executions (batches, not individual requests)

Minimal Prometheus scrape config:

yaml

scrape_configs:
  - job_name: triton
    static_configs:
      - targets: [ "localhost:8002" ]
    metrics_path: /metrics
    scrape_interval: 15s

To verify metrics are flowing:

bash

curl http://localhost:8002/metrics | grep nv_gpu

You'll see nv_gpu_utilization and nv_gpu_memory_used_bytes for each GPU device. If both read 0 with no active requests, that's expected. Send a request and re-check; GPU utilization should spike during inference.

nv_inference_queue_duration_us is your main signal for batching tuning. If queue times are consistently low (under 1ms) and throughput is low, your batch sizes are too small. Increase preferred_batch_size. If queue times are high and requests are timing out, your model instances can't keep up with load; add more GPUs or scale horizontally.

Cost Optimization: Right-Sizing GPU Instances for Triton

Multi-model serving makes right-sizing more complex than single-LLM deployments. You can't just pick the cheapest GPU that fits your largest model; you need to fit all resident models simultaneously.

Sizing formula: total VRAM needed = sum of all resident model sizes (in their deployment quantization) + 15% headroom for batching buffers, KV cache, and CUDA overhead.

L40S at $0.72/hr on-demand: Right-sized for mixed lightweight workloads. A BERT reranker (~1GB) + a CLIP encoder (~600MB) + Llama 3.1 8B FP8 (~8GB) fits well under 48GB. Good starting point for most production pipelines that don't need 70B+ LLMs.

A100 80GB at $1.07/hr on-demand: Solid value when you need 80GB but the rest of your workload doesn't justify an H100's price. Works well for Llama 3.2 11B VLM + CLIP + a few small classifiers.

H100 80GB at $2.01/hr on-demand: Use when any single model in your stack exceeds 40GB, or when you're serving a 70B+ LLM alongside other models. The H100 PCIe's memory bandwidth (2.0 TB/s) makes it faster than A100 at the same VRAM size, which matters when you're context-switching between models under concurrent load.

Spot instances: The B200's spot price of $1.71/hr (vs $7.43/hr on-demand) makes it viable for batch inference pipelines where interruption is acceptable (indexing jobs, offline embedding generation, batch reranking). Don't use spot for real-time serving where a preemption causes a 200+ second restart.

B200 at $7.43/hr: Reserve this for the largest multi-model stacks where your 70B LLM, vision encoder, and embedding model combined exceed 130GB, or when you're running multiple concurrent 70B models at scale.

For full GPU selection guidance across model sizes, see AI Inference GPU Guide 2026 and GPU pricing →.

Triton's multi-model serving fits naturally with Spheron's per-hour GPU pricing. Spin up an H100 or L40S, load your model repository, and scale per request volume without overprovisioning.
Rent H100 → | Rent L40S → | View all GPU pricing →

Why Triton Inference Server: Multi-Framework, Multi-Model Serving

Triton Architecture: Model Repository, Schedulers, and Backends

Model Repository

Schedulers

Backends

Minimal config.pbtxt (ONNX backend)

GPU Cloud Requirements: VRAM Sizing for Concurrent Model Serving

Step-by-Step: Deploy Triton on GPU Cloud with Docker

Prerequisites

Step 1: Pull the Triton image

Step 2: Create the model repository

Step 3: Launch Triton

Step 4: Verify the server is ready

Step 5: Send an inference request via Python

Serving LLMs with Triton's vLLM Backend

Model repository layout for the vLLM backend

config.pbtxt for vLLM backend

model.json (vLLM engine args)

Dynamic Batching and Model Ensembles

Dynamic Batching Configuration

Model Ensembles

Triton vs vLLM vs TensorRT-LLM vs SGLang: 2026 Decision Matrix

Monitoring Triton in Production: Prometheus Metrics

Cost Optimization: Right-Sizing GPU Instances for Triton

Build what's next.