You've already gotten vLLM working locally. Now you need to serve a model in production: maybe 70B parameters across multiple GPUs, maybe handling thousands of concurrent users, maybe with specific latency requirements. This guide covers the gap between "vLLM works on my laptop" and "vLLM is running reliably in production on bare metal."
Prerequisites: Docker, basic Linux CLI, an account on Spheron, and a model you want to serve (Hugging Face model name or local checkpoint). We'll go from instance setup through single-GPU deployment, multi-GPU tensor parallelism, FP8 quantization, load balancing, and production monitoring: with working code for every step.
vLLM in 2026: Key Features to Know
vLLM has become the default serving engine for production LLM inference. The current stable release (v0.17.1, released March 11 2026) includes several features you need to understand before deploying:
- FP8 inference support: significant throughput improvement on H100 and Blackwell GPUs; enable with a single flag
- Continuous batching: the default now; dynamically groups incoming requests for maximum GPU utilization without your intervention
- Streaming support: built-in server-sent events for real-time token streaming; important for chat applications and voice AI
- Multi-modal support: images + text for models like LLaVA, Qwen2.5-VL, and Llama 4 (Scout 17B-16E, Maverick 17B-128E, both are natively multi-modal MoE models)
- Structured outputs: JSON schema enforcement via guided decoding; critical for agent/tool-calling workloads
- Speculative decoding: faster generation using a small draft model; effective for specific model pairs where latency matters
- FlashAttention 4 backend: new in v0.17.0; the default attention backend on Blackwell (SM100+) GPUs (B200, RTX 50 series); improves prefill throughput and speculative decode performance with no configuration required. On Hopper GPUs (H100, H200), FlashAttention 3 remains the default
--performance-modeflag: new in v0.17.0; choose frombalanced,interactivity, orthroughputto pre-tune vLLM for your workload type without hand-tuning individual flags
For the full changelog, check vLLM's releases page. What's in this guide is what you actually need to configure for production: not every feature, just the ones that matter.
Choosing Your GPU Configuration
Your GPU choice depends on model size and whether you prioritize throughput or cost. Here's the practical decision table:
| Model Size | Recommended GPU | Parallelism Strategy | On-Demand Price | Spot Price |
|---|---|---|---|---|
| 7B–13B | 1x RTX 4090 (24GB) | None: single GPU | $0.57/hr | N/A |
| 7B–13B | 1x RTX 5090 (32GB) | None: single GPU | $0.76/hr | N/A |
| 13B–30B | 1x L40S (48GB) | None: single GPU | $0.72/hr | N/A |
| 30B–70B (FP8/Q4) | 1x RTX PRO 6000 (96GB) | None: single GPU | $1.65/hr | $0.72/hr |
| 30B–40B (FP16) | 1x A100 80GB (SXM4) | None: single GPU | $1.14/hr | N/A |
| 70B (FP8) | 1x H100 SXM5 80GB | None: just fits (tight ~70GB) | $2.50/hr | N/A |
| 70B (FP16) | 2x H100 SXM5 80GB | Tensor parallel (2) | $4.89/hr | N/A |
| 70B (FP16) | 4x H100 SXM5 80GB | Tensor parallel (4) | $9.67/hr | N/A |
| 100B+ | 8x H100 SXM5 | Tensor + pipeline parallel | $19.22/hr+ | varies |
Note on A100 and FP8: The A100 does not have hardware FP8 Tensor Cores. Running
--dtype fp8on A100 will either error or silently fall back to FP16. In FP16, each parameter takes 2 bytes, so an 80GB A100 can hold at most ~40B parameters (80GB ÷ 2 bytes), with some headroom needed for KV cache. For 70B models, use H100 with FP8 (fits on one 80GB card at ~70GB) or 2× H100 with FP16.
GPU pricing fluctuates over time based on availability and demand. Prices above are live on-demand (dedicated) and spot rates fetched from Spheron's GPU catalog as of 12 Mar 2026. Spot instances offer significant savings where available, but may not always be in stock. Your actual cost may differ - always check current GPU pricing before committing to a configuration.
For smaller models (7B-13B), see our RTX 5090 rental guide for benchmark data on common inference workloads. For the RTX PRO 6000 as a single-GPU 30B–70B option (96GB GDDR7), see our RTX PRO 6000 guide.
Setting Up Your Spheron Instance
Step 1: Launch the Instance
Select your GPU, region, and provider from Spheron's GPU catalog. For a 70B FP16 model with tensor parallelism, select a 2x H100 configuration. For FP8, a single H100 is sufficient.
Step 2: Verify GPU Access
nvidia-smi
# Should show all GPUs with expected VRAM
# For a 2x H100 instance, you'll see two rows with 80GB eachStep 3: Install Docker with NVIDIA Support
Most Spheron GPU instances come with the NVIDIA Container Toolkit pre-installed. If not:
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerStep 4: Verify Docker GPU Access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smiIf this outputs the same nvidia-smi table as the host, you're ready. If it errors, the container toolkit configuration didn't take: restart Docker and try again.
Single-GPU Deployment: The Starting Point
Start here even if you plan to run multi-GPU. Validate that the model loads, the API works, and your system is healthy before scaling.
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 256 \
--host 0.0.0.0 \
--port 8000Every flag here is deliberate:
--gpus all: expose all GPUs to the container; required for CUDA to work--ipc=host: use the host's shared memory namespace; do not skip this: vLLM uses shared memory for inter-process communication, and without this flag you'll hit cryptic CUDA errors under load--dtype float16: FP16 precision; switch tofp8on H100/Blackwell for ~2x throughput (covered below)--max-model-len 8192: maximum context length; lower values mean less KV cache VRAM, which means more room for concurrent requests--gpu-memory-utilization 0.90: reserve 90% of GPU VRAM for the model and KV cache; keep 10% headroom to avoid OOM on unexpected spikes--max-num-seqs 256: maximum concurrent sequences; increase for high-throughput workloads if VRAM allows
Test the deployment immediately after it starts:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello, are you online?"}],
"max_tokens": 50
}'You should get a JSON response with a choices array. If you get a connection error, the model is still loading: vLLM downloads and loads weights before accepting requests, which can take 5–15 minutes for large models on first run.
Multi-GPU Tensor Parallelism
When a model doesn't fit on one GPU, or when you need lower time-to-first-token (TTFT) by distributing compute, use tensor parallelism.
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype float16 \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 64What --tensor-parallel-size 2 does: splits every transformer layer across 2 GPUs. Each GPU processes half the attention heads and half the MLP feed-forward computation. The results are synchronized at the end of each layer via an all-reduce operation across the GPUs.
The performance of tensor parallelism depends heavily on the interconnect between GPUs:
- NVLink (SXM H100/H200): ~900 GB/s bidirectional bandwidth; all-reduce is fast; tensor parallelism scales well
- PCIe (A100 PCIe, non-NVLink setups): ~64 GB/s bidirectional via PCIe 4.0 x16; all-reduce is slower; each layer boundary adds latency
If you're running SXM H100s with NVLink, tensor parallelism at 2x or 4x is very efficient. On PCIe-only multi-GPU setups, expect more overhead between layers: still worth it for models that don't fit, but don't expect linear throughput scaling.
When tensor parallelism helps:
- Model is too large for a single GPU (70B FP16 needs ~140GB, beyond any single GPU)
- TTFT is too high: spreading prefill across more GPUs reduces time to first token
- You have NVLink-connected GPUs and want to push throughput higher
When it doesn't help:
- Running a 7B model across 4 GPUs: the communication overhead outweighs the benefit; just run 4 separate single-GPU instances instead
- PCIe-only systems with a model that barely fits on one GPU: the communication cost may not be worth the marginal VRAM headroom
For detailed GPU interconnect benchmarks and when multi-GPU actually pays off, see our production GPU cloud architecture guide.
FP8 Quantization: 2x Throughput on H100 and Blackwell
FP8 is the single most impactful configuration change available on H100 and Blackwell GPUs. It requires no quantization scripts, no model modifications, and no additional setup: just one flag change:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 128What you gain with FP8:
- ~1.5–2x throughput improvement vs FP16 on H100: the H100's FP8 Tensor Cores run at twice the FLOP rate of FP16 Tensor Cores
- ~50% VRAM reduction: a 70B model in FP8 uses roughly 70GB vs 140GB in FP16 - it fits on a single H100 80GB, though it is tight; tune
--gpu-memory-utilization(0.92+) and--max-model-lencarefully to avoid OOM - Marginal quality loss: typically less than 1-2% on standard benchmarks; acceptable for production inference on most tasks
Model compatibility: Most major open-source models (Llama 3.x, Mistral, Qwen 2.5, Phi-4) have been validated with vLLM FP8. For models without pre-quantized FP8 weights available, vLLM performs dynamic quantization on the fly using the original weights. Verify FP8 compatibility in vLLM's supported models documentation.
FP8 requires hardware support: H100, H200, NVIDIA Ada Lovelace GPUs (RTX 4090, L40S), and NVIDIA Blackwell GPUs (B200, B100, and consumer RTX 50 series including the RTX 5090) have dedicated FP8 Tensor Cores. On A100 or older Ampere hardware, --dtype fp8 will either fail or fall back to FP16 automatically: check your vLLM logs to confirm which mode is active.
Production Configuration for High Throughput
Once the model is running, tune these parameters for production workloads handling hundreds of concurrent requests:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 512 \
--max-num-batched-tokens 65536 \
--enable-chunked-prefill \
--kv-cache-dtype fp8What each production setting does:
--max-num-seqs 512: increase from the default 256 when you have many concurrent users and your VRAM can support it; monitorvllm:kv_cache_usage_percto see if you're running out of KV cache space--max-num-batched-tokens 65536: maximum tokens processed per forward pass iteration; increase for throughput-optimized workloads, decrease if you see out-of-memory errors under bursty load--enable-chunked-prefill: breaks long prefill sequences into smaller chunks and interleaves them with ongoing decode steps; reduces latency spikes when your traffic mix includes both long prompts and short responses--kv-cache-dtype fp8: stores the KV cache in FP8 format; saves ~50% VRAM on cached activations (FP8 uses 1 byte vs 2 bytes per element in FP16), allowing more concurrent requests with the same GPU; quality impact is minimal for most workloads
These settings are not universal: tune --max-num-seqs and --max-num-batched-tokens based on your actual traffic patterns. A workload with many short requests benefits from higher --max-num-seqs. A batch inference workload with long documents needs higher --max-num-batched-tokens. Start with these values and adjust based on the metrics in the monitoring section.
Shortcut: --performance-mode (new in v0.17.0): If you don't want to tune individual flags, use --performance-mode throughput for batch workloads or --performance-mode interactivity for chat/real-time applications. The balanced mode (default) is a reasonable starting point for mixed traffic. This flag configures a curated set of defaults for each scenario, you can still override individual flags on top of it.
Load Balancing Multiple vLLM Instances
Tensor parallelism makes one vLLM instance use more GPUs. Load balancing makes multiple vLLM instances handle more total traffic. Use both when your traffic exceeds what one instance can handle.
The simplest horizontal scaling approach: run separate vLLM instances on separate GPU devices, then load balance across them with NGINX.
# Instance 1: pinned to GPU 0, listening on port 8000
docker run --gpus '"device=0"' --ipc=host -p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype float16 \
--max-model-len 8192 \
--max-num-seqs 256
# Instance 2: pinned to GPU 1, listening on port 8001
docker run --gpus '"device=1"' --ipc=host -p 8001:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype float16 \
--max-model-len 8192 \
--max-num-seqs 256Note: This example uses
--dtype float16for broad hardware compatibility - it works on RTX 4090, RTX 5090, A100, and all other CUDA GPUs. If you are running on H100, H200, NVIDIA Blackwell (B200), RTX 5090, or RTX 4090 (Ada Lovelace), you can replacefloat16withfp8to leverage hardware FP8 Tensor Cores for ~2x throughput. FP8 Tensor Core hardware support requires Ada Lovelace (SM89, RTX 40 series) or newer, do not use--dtype fp8on A100, RTX 3090, or other Ampere (SM80/SM86) and older hardware, as those GPUs lack FP8 Tensor Core support and will fall back to FP16 silently or error. The RTX 5090 (Blackwell GB202) includes FP8 Tensor Cores and is fully supported in vLLM v0.17.0+, which ships a dedicated SM120 FP8 GEMM optimization for higher FP8 throughput on Blackwell consumer GPUs. You can enable--dtype fp8on RTX 5090 when running vLLM v0.17.0 or later.
NGINX configuration for load balancing across both instances:
upstream vllm_backend {
least_conn;
server localhost:8000;
server localhost:8001;
}
server {
listen 80;
location /v1/ {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
proxy_connect_timeout 10s;
proxy_buffering off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}Use least_conn (least connections), not round-robin. vLLM requests vary dramatically in duration: a short request completes in 100ms, a long generation can take 30 seconds. Least-connections routing sends new requests to whichever backend currently has fewer active connections, naturally load-balancing based on actual utilization rather than request count.
Install NGINX and load the configuration:
sudo apt-get install -y nginx
sudo cp vllm-nginx.conf /etc/nginx/sites-available/vllm
sudo ln -s /etc/nginx/sites-available/vllm /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginxFor multi-node deployments (separate physical servers), replace localhost:8000 and localhost:8001 with the actual IP addresses of each node. The rest of the NGINX configuration remains the same.
Monitoring in Production
Running vLLM in production without monitoring is running blind. Two systems to set up from day one:
GPU-Level Monitoring
# Real-time GPU stats: VRAM usage, utilization, temperature
watch -n 2 nvidia-smi
# Persistent logging to file
nvidia-smi dmon -s pum -d 10 >> gpu-metrics.log &For production, see our GPU monitoring guide for setting up DCGM with Prometheus and Grafana dashboards: nvidia-smi polling works for development but doesn't scale to multi-GPU production systems.
vLLM Metrics Endpoint
vLLM exposes a Prometheus-compatible metrics endpoint at /metrics:
curl http://localhost:8000/metrics | grep vllmThe metrics you need to watch:
| Metric | What it Tells You | Action if High |
|---|---|---|
vllm:num_requests_running | Active requests being processed | Normal: reflects traffic |
vllm:num_requests_waiting | Requests queued, waiting for a free slot | Scale out or increase --max-num-seqs |
vllm:kv_cache_usage_perc | KV cache fill percentage | If >95%, reduce --max-num-seqs or --max-model-len |
vllm:time_to_first_token_seconds | Latency before first token is generated | Your primary user-facing latency metric |
vllm:e2e_request_latency_seconds | Total request duration | Track p95 and p99, not just mean |
A rising num_requests_waiting with high kv_cache_usage_perc means you're KV-cache-bound: the bottleneck is memory, not compute. Reduce --max-model-len or add --kv-cache-dtype fp8 to free up space. A rising num_requests_waiting with low kv_cache_usage_perc means you're compute-bound: add more GPUs or instances.
Common Issues and Fixes
OOM (Out of Memory) Error on Startup
CUDA out of memory. Tried to allocate X GiBDiagnosis steps in order:
- Reduce
--gpu-memory-utilizationfrom 0.90 to 0.85: give the GPU more headroom - Reduce
--max-model-len: every 1024 tokens of context length requires additional KV cache VRAM - Switch from
--dtype float16to--dtype fp8: cuts model VRAM by ~50% on H100 - Increase
--tensor-parallel-sizeto spread the model across more GPUs
Slow TTFT (Time to First Token)
If your time-to-first-token is consistently above 2-3 seconds for normal-length prompts:
- Enable tensor parallelism: spreading prefill computation across 2-4 GPUs cuts TTFT proportionally
- Enable chunked prefill (
--enable-chunked-prefill): interleaves prefill with ongoing decode, preventing long prefill jobs from blocking short ones - Reduce
--max-model-len: if you don't need 128K context, set it to 8K or 16K; shorter configured max length = smaller KV cache allocation per sequence = more room for batching
CUDA Error: Device-Side Assert Triggered
RuntimeError: CUDA error: device-side assert triggeredThis is almost always a tokenizer mismatch. Causes:
- Input text that tokenizes to more tokens than
--max-model-len - Using the wrong model name in the API request (model field must match the model vLLM loaded)
- Special tokens or unicode sequences the tokenizer doesn't handle cleanly
Fix: verify that your input length (in tokens, not characters) is below --max-model-len. Use the tokenizer's encode method to check before sending.
Low GPU Utilization (Consistently Below 70%)
If nvidia-smi shows GPU utilization below 70% but requests are queuing:
- Increase
--max-num-seqs: you're not saturating the GPU with enough concurrent sequences - Increase
--max-num-batched-tokens: each forward pass is processing too few tokens - Check your client: if you're sending requests one at a time and waiting for each to complete, you're not actually generating concurrent load; vLLM's continuous batching requires concurrent requests to be effective
Model Download Fails at Startup
OSError: Hugging Face Hub is not reachable- Verify your
HUGGING_FACE_HUB_TOKENenvironment variable is set correctly - Check that the model name matches exactly (case-sensitive) the Hugging Face repo path
- Pre-download the model with
huggingface-cli download model-nameand mount the local directory instead, to decouple model download from container startup
For more context on overall GPU infrastructure architecture for production ML, see our production GPU cloud architecture guide and how to fine-tune LLMs in 2026 for the training side of the stack.
Deploy vLLM on Spheron's bare-metal H100s, RTX 5090s, or B200s: full CUDA access, no virtualization overhead. Your models run at native GPU performance.
