Is vLLM faster than TensorRT-LLM?

It depends on concurrency. At high concurrency (50-100 requests), TensorRT-LLM's continuous batching can match or slightly exceed vLLM throughput if the model is pre-compiled. At low concurrency (1-10 requests), SGLang's RadixAttention gives it a TTFT advantage for requests with shared prefixes. vLLM is the most flexible: it supports the widest model range and has the fastest time-to-first-inference because it requires no pre-compilation step.

How long does TensorRT-LLM engine compilation take on an H100?

Compiling a Llama 70B engine for TensorRT-LLM on a single H100 80GB takes approximately 25-40 minutes depending on optimization flags and batch size configuration. This is a one-time cost per model version; the compiled engine is saved to disk and reused on subsequent starts. Factor this into your deployment pipeline if you update models frequently.

What is SGLang's RadixAttention and when does it help?

RadixAttention is SGLang's KV-cache sharing mechanism. When multiple requests share a common prefix (system prompt, few-shot examples, or a shared document), SGLang computes the attention for that prefix once and caches it across all requests. This is especially effective for chatbots with long system prompts and for RAG pipelines where the context document is reused. At unique-prompt workloads, the advantage mostly disappears.

Can I switch inference engines without changing my API calls?

Yes. All three frameworks expose an OpenAI-compatible REST API at /v1/chat/completions and /v1/completions. Your application code does not need to change when switching engines; only the Docker run command and environment variables differ.

Which inference engine works best for production LLM serving at scale?

For most teams, vLLM is the right default: broadest model support, active community, no compilation step, and throughput that is competitive across concurrency levels. Use TensorRT-LLM if you have a fixed model in production and need maximum throughput or the lowest TTFT at high concurrency, and are willing to maintain a compilation pipeline. Use SGLang if your workload has shared prefixes (chatbots, RAG, multi-turn) and latency is your top priority.

vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)

You've picked a model. Now you need to decide how to serve it. vLLM, TensorRT-LLM, and SGLang are the three engines that matter for production LLM inference in 2026, and they make very different tradeoffs. We ran all three on the same H100 80GB with Llama 3.3 70B Instruct at FP8 precision. Here is what the numbers actually look like. If you already have vLLM in production and want the multi-GPU deployment guide, see vLLM Multi-GPU Production Deployment 2026. All three framework deployment guides are available at Spheron's LLM quick-guides if you want to follow along.

TL;DR

Engine	Best For	Throughput (50 req)	TTFT p50 (10 req)	Cold Start
vLLM	General use, broad model support	1,850 tok/s	120 ms	~62 sec
TensorRT-LLM	Max throughput, fixed model	2,100 tok/s	105 ms	~28 min
SGLang	Shared-prefix workloads, low latency	1,920 tok/s	112 ms	~58 sec

Use vLLM if you want the quickest path to production and model-update flexibility.
Use TensorRT-LLM if you have a single model in long-term production and throughput is paramount.
Use SGLang if your workload has shared prefixes (chatbots, RAG pipelines, multi-turn conversations).

Test Setup

Hardware

We ran all benchmarks on a single Spheron H100 SXM5 80GB instance at on-demand rates (see current pricing). The instance runs on bare metal with no hypervisor overhead. Host driver 590.48.01 (current stable R590 release). vLLM and SGLang run CUDA 13.0 (cu130) containers; TensorRT-LLM v1.2.0 uses CUDA 13.1.0 (pytorch:25.12-py3). All three containers run without compatibility shims on driver 590. NVLink is present but not used for single-GPU runs.

Model

We used meta-llama/Llama-3.3-70B-Instruct in FP8 precision. Llama 3.3 is the most widely deployed dense 70B instruction-following model and remains the standard benchmark target for inference engine comparisons. Llama 4 (released April 2025) uses a mixture-of-experts architecture with different single-GPU memory characteristics; for Llama 4 deployment on Spheron, see the Llama 4 Scout & Maverick guide. For Llama 3.3 setup details, see Spheron's Llama 3 guide. At FP8, the 70B weights occupy approximately 70GB, which fits on the 80GB H100 with careful tuning. All three frameworks use native FP8 quantization: vLLM via --quantization fp8 (online dynamic quantization on load), SGLang via --quantization fp8, and TensorRT-LLM via --qformat fp8 in the quantize.py step before compilation, all fully supported on H100 since CUDA 12.0.

Framework versions:

vLLM v0.18.0
TensorRT-LLM v1.2.0
SGLang v0.5.9

Benchmark Methodology

We used an async Python client built on aiohttp to generate load. Each run used 200 prompts sampled from a diverse instruction dataset, with average input length of 512 tokens and average output length of 256 tokens (fixed seed 42 for reproducibility). We tested at four concurrency levels: 1, 10, 50, and 100 simultaneous requests. Each concurrency level ran for 3 minutes after a 60-second warm-up period. VRAM was sampled via nvidia-smi --query-gpu=memory.used at 1-second intervals; peak is the maximum recorded value during the measurement window.

Benchmark Results

Throughput (Output Tokens per Second)

Concurrency	vLLM	TensorRT-LLM	SGLang
1 req	120 tok/s	130 tok/s	125 tok/s
10 req	650 tok/s	710 tok/s	680 tok/s
50 req	1,850 tok/s	2,100 tok/s	1,920 tok/s
100 req	2,400 tok/s	2,780 tok/s	2,460 tok/s

TensorRT-LLM leads at every concurrency level once the engine is compiled. The gap is smallest at low concurrency (8% faster than vLLM at 1 request) and largest at 50 concurrent requests (13% faster). SGLang falls between vLLM and TensorRT-LLM at high concurrency. At low concurrency, SGLang's RadixAttention provides marginal throughput gains only when requests share prefixes; our benchmark used unique prompts throughout, so you are seeing the baseline behavior.

Time to First Token (TTFT, milliseconds)

Concurrency	vLLM p50	vLLM p95	TRT-LLM p50	TRT-LLM p95	SGLang p50	SGLang p95
1 req	45 ms	68 ms	38 ms	55 ms	42 ms	61 ms
10 req	120 ms	195 ms	105 ms	170 ms	112 ms	178 ms
50 req	380 ms	720 ms	340 ms	620 ms	360 ms	680 ms
100 req	740 ms	1,450 ms	680 ms	1,280 ms	710 ms	1,380 ms

TTFT is the metric that determines whether your application feels fast. TensorRT-LLM delivers the lowest p50 and p95 at every concurrency level. The p95 gap matters most at high load: at 100 concurrent requests, TensorRT-LLM's p95 TTFT is 1,280 ms versus vLLM's 1,450 ms. That 170 ms difference affects user-perceived responsiveness in interactive applications. SGLang's p95 sits between the other two at all concurrency levels tested here.

Peak VRAM Usage

Engine	Idle (model loaded)	Peak at 50 req	Peak at 100 req
vLLM	71 GB	76 GB	78 GB
TensorRT-LLM	74 GB	77 GB	79 GB
SGLang	72 GB	75 GB	78 GB

VRAM usage is tight across all three frameworks with a 70B FP8 model on an 80GB GPU. TensorRT-LLM's compiled engine takes slightly more idle VRAM (74 GB) than vLLM (71 GB) because the compiled engine stores additional activation buffers. SGLang uses the least VRAM at peak load due to its KV cache management. The difference between frameworks is less than 4 GB, so the headroom for tuning --max-model-len is similar across all three. If VRAM is your bottleneck, the engine choice matters less than your max-model-len and gpu-memory-utilization settings.

Cold Start Time

Engine	Time to First Request
vLLM	~62 seconds
TensorRT-LLM	~28 minutes (engine compilation)
SGLang	~58 seconds

TensorRT-LLM's compilation time is not a flaw: it is a deliberate tradeoff. The 28-minute build runs once per model version, saves the compiled engine to disk, and subsequent starts reuse it (reloading the compiled engine takes about 90 seconds). The problem is your deployment pipeline. If you do blue-green deploys, auto-scaling from zero, or frequent model updates, you need to plan around this. vLLM and SGLang both start in under 90 seconds (dominated by model weight loading from disk), which makes them compatible with auto-scaling policies that spin instances up on demand.

If you want TensorRT-LLM performance without the compile step, the PyTorch backend is available. v1.0 promoted it to stable and made it the default, replacing the older plugin-based backend. Since it is now the default, you can omit --backend pytorch entirely or pass it explicitly to trtllm-serve; either way it loads HuggingFace weights directly, cutting cold start to roughly 60-90 seconds. The benchmarks above used the compiled TRT engine; the PyTorch backend will have lower peak throughput but removes the compilation barrier entirely.

vLLM

vLLM's core design is PagedAttention: a KV cache memory manager that treats GPU memory like virtual memory pages. This lets vLLM serve many concurrent requests without reserving per-request memory upfront. Combined with continuous batching (dynamically grouping requests as they arrive rather than waiting for a full batch), vLLM achieves high GPU utilization on bursty traffic without manual batching logic.

FP8 on H100 works with a single flag change. No quantization script, no model modification:

bash

docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:v0.18.0-cu130 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 128 \
  --host 0.0.0.0 \
  --port 8000

Strengths: widest model support of the three frameworks (hundreds of architectures including multimodal models like Qwen3-VL, Qwen3-Omni, InternVL3, LLaVA-Next, Pixtral-12B, and Baidu ERNIE-4.5-VL, plus popular open-source families like Qwen3, Gemma 3, DeepSeek R1 & V3, Phi-4, and Mistral/Mixtral), no compilation step, simple deployment, OpenAI-compatible API out of the box. gRPC serving (via --grpc) provides an alternative to REST for lower-overhead internal deployments. v0.18.0 removes Ray as a default dependency; install it separately (pip install ray) if you use multi-node tensor parallelism. The --performance-mode throughput flag (introduced in v0.17.0) pre-tunes settings for batch workloads. The full deployment guide for production setups is at Spheron's vLLM docs.

Limitations: slightly lower peak throughput than TensorRT-LLM at high concurrency, and TTFT p95 is highest of the three at 100 concurrent requests.

TensorRT-LLM

TensorRT-LLM is NVIDIA's compiler-based approach. Instead of running the model weights through a general-purpose PyTorch runtime, TensorRT-LLM compiles the model into an optimized CUDA kernel graph tailored to your specific GPU, batch size, and sequence length configuration. The result is a compiled engine binary that extracts more hardware efficiency than any runtime-based approach.

The build pipeline for our benchmark:

bash

# Pull the TensorRT-LLM container
docker pull nvcr.io/nvidia/tensorrt-llm/release:1.2.0

# Step 1: Quantize HuggingFace weights to FP8 checkpoint (~5 min on H100)
docker run --gpus all --ipc=host \
  -v /path/to/model:/models \
  -v /path/to/engine:/engines \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0 \
  python examples/quantization/quantize.py \
    --model_dir /models/Llama-3.3-70B-Instruct \
    --dtype float16 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir /engines/fp8-checkpoint \
    --calib_size 512

# Step 2: Build the TRT engine from the FP8 checkpoint (~23 min on H100)
docker run --gpus all --ipc=host \
  -v /path/to/engine:/engines \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0 \
  trtllm-build \
    --checkpoint_dir /engines/fp8-checkpoint \
    --output_dir /engines/llama70b-engine \
    --max_batch_size 128 \
    --max_input_len 8192 \
    --max_seq_len 10240

# Run the OpenAI-compatible server
docker run --gpus all --ipc=host -p 8000:8000 \
  -v /path/to/engine:/engines \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0 \
  trtllm-serve /engines/llama70b-engine --port 8000 --host 0.0.0.0

The v1.2.0 build pipeline requires two steps: first quantize the HuggingFace weights to a TRT-LLM FP8 checkpoint using modelopt's quantize.py, then compile with trtllm-build --checkpoint_dir. The trtllm-serve command (added in v0.15.0) replaced the old launch_server.py. The container path is the NGC release registry (nvcr.io/nvidia/tensorrt-llm/release:1.2.0), based on nvcr.io/nvidia/pytorch:25.12-py3 (CUDA 13.1.0). CUDA 13.0 (used by vLLM and SGLang cu130 containers) requires driver 580.65.06 or later on Linux; CUDA 13.1 (used by TRT-LLM 1.2.0) requires driver 590.44.01 or later. The 590.48.01 driver in our setup meets both requirements. Mismatches produce cryptic build errors; verify your driver version before starting.

v1.2.0 added DGX Spark validation (beta) and extended DeepSeek V3.2 support (speculative decoding with MTP>1). DeepSeek V3/R1 support was first added in TensorRT-LLM v0.19.0, with further optimizations across v1.0, v1.1, and v1.2. If you are running DeepSeek on Spheron, see the Spheron DeepSeek R1 & V3 guide.

Full deployment instructions are at Spheron's TensorRT-LLM docs.

Strengths: highest throughput at every concurrency level, lowest p50 and p95 TTFT, best hardware utilization on fixed configurations. The PyTorch backend (v1.0+) removes the compilation requirement at the cost of some peak throughput. Limitations: compiled engine path requires 28-minute build per model version; narrower model support than vLLM; more complex deployment pipeline for the compiled path.

SGLang

SGLang's primary innovation is RadixAttention: a KV cache management system that stores cached attention activations in a radix tree keyed by the token sequence. When two requests share a common prefix (a system prompt, a document, few-shot examples), SGLang computes the KV cache for that prefix once and reuses it across all requests that share it. In workloads with long shared prefixes, this reduces both TTFT and compute cost significantly.

bash

docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  lmsysorg/sglang:v0.5.9-cu130-runtime \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.3-70B-Instruct \
    --quantization fp8 \
    --context-length 8192 \
    --mem-fraction-static 0.92 \
    --max-running-requests 128 \
    --host 0.0.0.0 \
    --port 8000

SGLang also includes speculative decoding support and a structured output engine for constrained generation (JSON schemas, regex). The structured output implementation is competitive with vLLM's guided decoding. v0.5.9 adds native Anthropic API compatibility alongside the OpenAI-compatible endpoint, useful if your application already targets the Anthropic SDK. The same release integrates TRT-LLM DSA (DeepSeek Sparse Attention) kernels into SGLang's Native Sparse Attention (NSA) backend for DeepSeek V3.2, delivering 3x-5x speedup on Blackwell via --nsa-prefill-backend trtllm and --nsa-decode-backend trtllm, and extends model support to Qwen3.5, Kimi-K2.5, GLM-5, and MiniMax 2.5. The release also adds LoRA weight loading overlap with computation, reducing TTFT by roughly 78% for LoRA adapter workloads. For DeepSeek deployment on Spheron, see the DeepSeek R1 & V3 guide.

Full docs: Spheron's SGLang guide.

Strengths: best TTFT when shared-prefix caching applies, competitive throughput with vLLM, fast cold start, strong structured output support. Limitations: RadixAttention's benefit disappears for unique-prompt workloads; in our benchmark (all unique prompts), SGLang's throughput advantage over vLLM was minimal.

When to Use Each Engine

Condition	Recommended Engine
You need to support many different models	vLLM
You're serving one model in long-term production	TensorRT-LLM
Your workload has shared system prompts or RAG context	SGLang
You need to be online in under 2 minutes from cold	vLLM, SGLang, or TRT-LLM PyTorch backend
You want the absolute highest throughput at 100+ concurrent	TensorRT-LLM
You're experimenting or prototyping	vLLM
Your team has limited DevOps capacity	vLLM
You run multi-turn conversations at scale	SGLang

Most teams should start with vLLM. It covers the widest model range, has the best documentation, requires no compilation step, and delivers throughput that is competitive for most workloads. Move to TensorRT-LLM when you have a model that won't change for months and you need to squeeze out every token per second at scale. Move to SGLang if you measure TTFT on shared-prefix workloads and find it materially improves user experience. If you want a lighter-weight alternative for smaller models, llama.cpp, LMDeploy, and LocalAI are also available on Spheron instances. For browser-based experimentation without the Docker setup, the Ollama + Open WebUI guide covers a simpler path.

Deploy in 5 Minutes on Spheron

First, provision a Spheron H100 instance, SSH in, and verify GPU access with nvidia-smi. If you're new to Spheron, the quick-start guide walks through provisioning your first instance and verifying CUDA access. The Spheron docs overview covers instance types and getting started. All three framework guides live under the Spheron LLM quick-guides if you need deeper configuration details beyond the commands below.

vLLM

bash

# Start the server
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:v0.18.0-cu130 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --host 0.0.0.0 \
  --port 8000

# Verify the endpoint
curl http://localhost:8000/v1/models

Full guide: Spheron vLLM docs

TensorRT-LLM

Build the engine first (one-time, ~28 min), then start the server. See the build commands in the TensorRT-LLM section above.

bash

# After building the engine, start the server
docker run --gpus all --ipc=host -p 8000:8000 \
  -v /path/to/engine:/engines \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0 \
  trtllm-serve /engines/llama70b-engine --port 8000 --host 0.0.0.0

# Verify
curl http://localhost:8000/v1/models

Full guide: Spheron TensorRT-LLM docs

SGLang

bash

# Start the server
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  lmsysorg/sglang:v0.5.9-cu130-runtime \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.3-70B-Instruct \
    --quantization fp8 \
    --context-length 8192 \
    --mem-fraction-static 0.92 \
    --host 0.0.0.0 \
    --port 8000

# Verify
curl http://localhost:8000/v1/models

Full guide: Spheron SGLang docs

The choice between these three frameworks comes down to your deployment constraints. If you can absorb a 28-minute compilation step and your model is stable, TensorRT-LLM gives you the best throughput and latency. If you need fast starts and model flexibility, vLLM is the right default. If your workload is built around shared prefixes, SGLang's RadixAttention provides real gains you won't get from the other two.

Running inference benchmarks like these requires H100 hardware with full CUDA access. Spheron provides bare-metal H100 instances at $2.01/hr with no virtualization overhead, on-demand provisioning, and no long-term commitment.
Rent H100 → | View all GPU pricing → | Get started on Spheron →

TL;DR

Test Setup

Hardware

Model

Benchmark Methodology

Benchmark Results

Throughput (Output Tokens per Second)

Time to First Token (TTFT, milliseconds)

Peak VRAM Usage

Cold Start Time

vLLM

TensorRT-LLM

SGLang

When to Use Each Engine

Deploy in 5 Minutes on Spheron

vLLM

TensorRT-LLM

SGLang

Build what's next.