Tutorial

Deploy NVIDIA Triton Inference Server on GPU Cloud: Production Multi-Model Serving (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 12, 2026
GPU CloudTriton Inference ServerModel ServingNVIDIALLM InferenceProduction ML
Deploy NVIDIA Triton Inference Server on GPU Cloud: Production Multi-Model Serving (2026)

Most teams reach for vLLM or SGLang when they need to serve an LLM in production. That works fine when you have one model type. When your fleet needs to serve an LLM, an embedding model, and a vision encoder simultaneously on the same GPUs, NVIDIA Triton Inference Server is the production standard. It runs multiple backends in a single server process, handles concurrent requests across all of them, and gives you the request routing and batching controls that purpose-built LLM servers skip entirely.

Why Triton Inference Server: Multi-Framework, Multi-Model Serving

Triton's core distinction from vLLM or SGLang is that it's backend-agnostic. A single Triton process can simultaneously serve:

  • An LLM via the vLLM backend (PagedAttention, continuous batching)
  • A CLIP image encoder via the ONNX Runtime backend
  • A BERT-based reranker via the TensorRT backend
  • A custom preprocessing step via the Python backend

Practical example: you're building a multimodal RAG pipeline. Your query comes in, hits a CLIP encoder to generate embeddings, retrieves documents, runs a reranker, and sends the combined context to Llama 3.3 70B for synthesis. With Triton, all four models sit in the same server, and you define the routing in an ensemble config. Without Triton, you're running four separate servers and writing the inter-service coordination yourself.

Triton exposes both HTTP/REST on port 8000 and gRPC on port 8001. Port 8002 serves Prometheus metrics for GPU utilization, queue depth, and per-model inference latency. All three ports are active from a single docker run command.

For background on vLLM's production deployment patterns, see the vLLM production deployment guide.

Triton Architecture: Model Repository, Schedulers, and Backends

Model Repository

The model repository is a filesystem directory that Triton watches at startup and, optionally, at runtime. Layout:

model_repository/
  my_model/
    config.pbtxt          # Model configuration
    1/                    # Version directory
      model.onnx          # Model weights (or model.pt, model.plan, etc.)
    2/                    # Optional: second version
      model.onnx

The version number is a directory name. Triton loads the latest version by default and can serve multiple versions simultaneously if you configure it. Every model directory requires a config.pbtxt.

Schedulers

Triton has three scheduling modes:

Default scheduler: One request at a time, no batching. Used for stateful models or when you want predictable latency with no batch overhead.

Dynamic batching: Groups incoming requests into batches automatically. You set preferred_batch_size and max_queue_delay_microseconds. Requests wait up to the delay limit for the batch to fill, then execute. Good for most inference workloads where throughput matters more than single-request latency.

Sequence batching: For stateful models where requests belong to a session (RNNs, stateful decoders). Triton routes all requests from the same sequence ID to the same model instance.

Backends

BackendUse case
tensorrtTensorRT-compiled engines, highest throughput on NVIDIA
pytorchTorchScript models
onnxruntimeONNX models, broad framework support
pythonCustom Python inference code, preprocessing, postprocessing
vllmLLMs with PagedAttention and continuous batching
openvinoIntel CPU inference (rarely used on GPU fleets)

Minimal config.pbtxt (ONNX backend)

protobuf
name: "clip_encoder"
backend: "onnxruntime"
max_batch_size: 32

input [
  {
    name: "pixel_values"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]

output [
  {
    name: "image_embeds"
    data_type: TYPE_FP32
    dims: [ 512 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 5000
}

GPU Cloud Requirements: VRAM Sizing for Concurrent Model Serving

Multi-model serving changes how you size GPU memory. You're not sizing for one model's peak; you're sizing for the sum of all resident models plus batching buffers.

SetupModelsGPU RecommendationVRAM Needed
LLM onlyLlama 3.3 70B FP8H100 80GB~40GB
LLM + embeddingsLlama 3.1 8B + BGE-M3L40S 48GB~18GB
LLM + visionLlama 3.2 11B VLM + CLIPA100 80GB~30GB
Full stackLlama 3.3 70B + CLIP + BERTH100 80GB~55GB

Rule of thumb: total VRAM = sum of all resident model sizes (in their quantized form) + 15% headroom for dynamic batching buffers and KV cache. If a single model exceeds 40GB, you need an 80GB card. For more on managing KV cache memory, see the KV Cache Optimization Guide.

Current GPU pricing on Spheron:

GPUVRAMOn-Demand (lowest)Spot (lowest)
L40S48GB$0.72/hr-
A100 80GB PCIe80GB$1.07/hr-
H100 80GB PCIe80GB$2.01/hr-
H100 SXM580GB$4.41/hr-
B200 SXM6192GB$7.43/hr$1.71/hr

Pricing fluctuates based on GPU availability. The prices above are based on 12 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Step-by-Step: Deploy Triton on GPU Cloud with Docker

Prerequisites

  • NVIDIA GPU instance on Spheron with Docker and the NVIDIA Container Toolkit installed
  • Docker image access to nvcr.io/nvidia/tritonserver:24.12-py3
  • Model weights in a supported format (ONNX, TorchScript, TensorRT engine, or a Hugging Face model name for the vLLM backend)

Step 1: Pull the Triton image

bash
# For non-vLLM backends (TensorRT, ONNX, PyTorch, Python)
docker pull nvcr.io/nvidia/tritonserver:24.12-py3

# For the vLLM backend (LLM serving with PagedAttention)
docker pull nvcr.io/nvidia/tritonserver:24.12-vllm-python-py3

The base py3 image is ~9.9GB. Pull the appropriate image while you set up your model repository.

Step 2: Create the model repository

bash
# Create model repository structure
mkdir -p model_repository/my_model/1

# Place your model file (example: ONNX)
cp /path/to/model.onnx model_repository/my_model/1/model.onnx

Add a config.pbtxt for a PyTorch (TorchScript) model:

protobuf
name: "my_model"
backend: "pytorch"
max_batch_size: 16

input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 128 ]
  }
]

output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 10 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8, 16 ]
  max_queue_delay_microseconds: 10000
}

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

Step 3: Launch Triton

bash
docker run --gpus all --rm \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  -v $(pwd)/model_repository:/models \
  nvcr.io/nvidia/tritonserver:24.12-py3 \
  tritonserver --model-repository=/models

Watch the startup logs. You'll see each model load with its backend, version, and status. A successful load looks like:

I0101 tritonserver.cc] Started HTTPService at 0.0.0.0:8000
I0101 tritonserver.cc] Started GRPCInferenceService at 0.0.0.0:8001
I0101 tritonserver.cc] Started Metrics Service at 0.0.0.0:8002

Step 4: Verify the server is ready

bash
# Server health
curl http://localhost:8000/v2/health/ready

# Model status
curl http://localhost:8000/v2/models/my_model/ready

Both return HTTP 200 when the server and model are ready.

Step 5: Send an inference request via Python

python
import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(url="localhost:8000")

# Build input tensor
input_data = np.random.rand(1, 128).astype(np.float32)
inputs = [httpclient.InferInput("input__0", input_data.shape, "FP32")]
inputs[0].set_data_from_numpy(input_data)

# Define expected output
outputs = [httpclient.InferRequestedOutput("output__0")]

# Send request
result = client.infer("my_model", inputs, outputs=outputs)
output = result.as_numpy("output__0")
print(output.shape)  # (1, 10)

Install the client: pip install tritonclient[http].

Serving LLMs with Triton's vLLM Backend

Triton 24.12 includes a vLLM backend, but it is only available in the nvcr.io/nvidia/tritonserver:24.12-vllm-python-py3 image, not the standard 24.12-py3 image. Use the vllm-python-py3 image when you need LLM serving alongside other model types in one server process.

Model repository layout for the vLLM backend

model_repository/
  llama3_70b/
    config.pbtxt
    1/
      model.json      # vLLM engine args

config.pbtxt for vLLM backend

protobuf
name: "llama3_70b"
backend: "vllm"
max_batch_size: 0

model_transaction_policy {
  decoupled: true
}

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
  },
  {
    name: "sampling_parameters"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_MODEL
  }
]

model.json (vLLM engine args)

json
{
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "tokenizer": "meta-llama/Llama-3.3-70B-Instruct",
  "dtype": "fp8",
  "max_num_seqs": 256,
  "gpu_memory_utilization": 0.90,
  "tensor_parallel_size": 1
}

For multi-GPU setups, increase tensor_parallel_size to match your GPU count.

Performance note: Triton's vLLM backend adds some overhead per request compared to standalone vLLM, due to the extra routing layer between the HTTP server and the vLLM engine. For pure single-LLM serving where latency matters, standalone vLLM is simpler. Use Triton's vLLM backend when you need multi-model routing on one server, not when squeezing the last few milliseconds from a dedicated LLM server.

For benchmark comparisons between vLLM, TensorRT-LLM, and SGLang on the same hardware, see vLLM vs TensorRT-LLM vs SGLang Benchmarks.

Dynamic Batching and Model Ensembles

Dynamic Batching Configuration

Dynamic batching is the primary tool for improving GPU throughput when your request rate is bursty or when individual requests are small. The key parameters:

protobuf
dynamic_batching {
  preferred_batch_size: [ 4, 8, 16 ]
  max_queue_delay_microseconds: 10000   # 10ms max wait
  preserve_ordering: false
}

preferred_batch_size tells Triton which batch sizes to target. Triton will dispatch a batch as soon as one of these sizes is reached, or when max_queue_delay_microseconds expires, whichever comes first. Lower the delay for latency-sensitive workloads; increase it for throughput-oriented batch jobs.

For a deeper look at continuous batching and paged attention, which underpin how modern LLM backends handle dynamic batching, see LLM Serving Optimization: Continuous Batching and Paged Attention.

Model Ensembles

An ensemble lets you chain models into a pipeline defined entirely in Triton config, with no client-side coordination. Each model's output tensors map to the next model's input tensors.

Example: a two-step pipeline where a preprocessing model tokenizes text, then an inference model runs the tokens.

protobuf
name: "text_pipeline"
platform: "ensemble"
max_batch_size: 32

input [
  {
    name: "raw_text"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "tokenizer"
      model_version: -1
      input_map { key: "text" value: "raw_text" }
      output_map { key: "token_ids" value: "tokenized" }
    },
    {
      model_name: "bert_classifier"
      model_version: -1
      input_map { key: "input_ids" value: "tokenized" }
      output_map { key: "logits" value: "logits" }
    }
  ]
}

The input_map and output_map define tensor name translations between ensemble and model-local names.

BLS (Business Logic Scripting) is the Python-based alternative for pipelines with conditional routing. Instead of a static ensemble config, you write a Python backend that calls other Triton models programmatically. Use BLS when you need if/else logic or variable-length model chains; use ensemble configs when the pipeline is fixed.

Triton vs vLLM vs TensorRT-LLM vs SGLang: 2026 Decision Matrix

CriterionTritonvLLMTensorRT-LLMSGLang
Multi-framework supportYes (all)LLMs onlyLLMs + TRTLLMs only
Multi-model concurrent servingYesNoNoNo
LLM throughputVia vLLM backendHighHighestHigh
Setup complexityHighLowVery highMedium
Dynamic batchingBuilt-inPagedAttentionBuilt-inRadixAttention
Best forDiverse model types, pipelinesSingle-LLM servingMax throughput TRTAgentic, multi-turn
Avoid whenSingle LLM, simple needsMulti-framework fleetNo TRT expertiseStateless batch workloads

The short version: if you're serving one LLM and nothing else, Triton's overhead isn't worth it. vLLM is simpler and faster for that case. If you need maximum token throughput and can tolerate TensorRT's compilation step, TensorRT-LLM wins on raw numbers. SGLang wins on agentic and multi-turn workloads where shared prefixes are common. Triton wins when you have a mixed fleet: LLMs alongside embedding models, vision encoders, classifiers, or custom preprocessing steps.

For deeper benchmark data, see vLLM vs TensorRT-LLM vs SGLang Benchmarks. For SGLang's specific advantages on agentic workloads with RadixAttention, see SGLang Production Deployment Guide.

Monitoring Triton in Production: Prometheus Metrics

Triton exposes Prometheus metrics at http://localhost:8002/metrics. No configuration required; the endpoint is active whenever Triton starts.

Key metrics to watch:

MetricWhat it tells you
nv_inference_request_successTotal successful inference requests per model
nv_inference_queue_duration_usTime requests spend waiting in the queue before execution
nv_gpu_utilizationGPU compute utilization as a percentage
nv_gpu_memory_used_bytesGPU memory used by Triton and loaded models
nv_inference_exec_countNumber of inference executions (batches, not individual requests)

Minimal Prometheus scrape config:

yaml
scrape_configs:
  - job_name: triton
    static_configs:
      - targets: [ "localhost:8002" ]
    metrics_path: /metrics
    scrape_interval: 15s

To verify metrics are flowing:

bash
curl http://localhost:8002/metrics | grep nv_gpu

You'll see nv_gpu_utilization and nv_gpu_memory_used_bytes for each GPU device. If both read 0 with no active requests, that's expected. Send a request and re-check; GPU utilization should spike during inference.

nv_inference_queue_duration_us is your main signal for batching tuning. If queue times are consistently low (under 1ms) and throughput is low, your batch sizes are too small. Increase preferred_batch_size. If queue times are high and requests are timing out, your model instances can't keep up with load; add more GPUs or scale horizontally.

Cost Optimization: Right-Sizing GPU Instances for Triton

Multi-model serving makes right-sizing more complex than single-LLM deployments. You can't just pick the cheapest GPU that fits your largest model; you need to fit all resident models simultaneously.

Sizing formula: total VRAM needed = sum of all resident model sizes (in their deployment quantization) + 15% headroom for batching buffers, KV cache, and CUDA overhead.

L40S at $0.72/hr on-demand: Right-sized for mixed lightweight workloads. A BERT reranker (~1GB) + a CLIP encoder (~600MB) + Llama 3.1 8B FP8 (~8GB) fits well under 48GB. Good starting point for most production pipelines that don't need 70B+ LLMs.

A100 80GB at $1.07/hr on-demand: Solid value when you need 80GB but the rest of your workload doesn't justify an H100's price. Works well for Llama 3.2 11B VLM + CLIP + a few small classifiers.

H100 80GB at $2.01/hr on-demand: Use when any single model in your stack exceeds 40GB, or when you're serving a 70B+ LLM alongside other models. The H100 PCIe's memory bandwidth (2.0 TB/s) makes it faster than A100 at the same VRAM size, which matters when you're context-switching between models under concurrent load.

Spot instances: The B200's spot price of $1.71/hr (vs $7.43/hr on-demand) makes it viable for batch inference pipelines where interruption is acceptable (indexing jobs, offline embedding generation, batch reranking). Don't use spot for real-time serving where a preemption causes a 200+ second restart.

B200 at $7.43/hr: Reserve this for the largest multi-model stacks where your 70B LLM, vision encoder, and embedding model combined exceed 130GB, or when you're running multiple concurrent 70B models at scale.

For full GPU selection guidance across model sizes, see AI Inference GPU Guide 2026 and GPU pricing →.


Triton's multi-model serving fits naturally with Spheron's per-hour GPU pricing. Spin up an H100 or L40S, load your model repository, and scale per request volume without overprovisioning.

Rent H100 → | Rent L40S → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.