How do I deploy vLLM on multiple GPUs?

Use the --tensor-parallel-size flag in the vLLM Docker command. For example, add --tensor-parallel-size 2 to split a model across 2 GPUs. vLLM handles all inter-GPU communication automatically. Make sure to also set --ipc=host in Docker to enable shared memory between GPU processes.

How much VRAM do I need to run Llama 70B with vLLM?

With FP8 quantization (--dtype fp8), a 70B model requires approximately 70GB of VRAM for weights alone (1 byte × 70B parameters) - this barely fits on a single H100 80GB GPU. Set --gpu-memory-utilization to 0.92 or higher and limit --max-model-len to leave minimal headroom for KV cache. In FP16, you need at least 2x H100 80GB (160GB total) with tensor-parallel-size 2. On Spheron, H100 SXM5 80GB instances start from $2.50/hr on-demand as of 12 Mar 2026.

What is the difference between tensor parallelism and pipeline parallelism in vLLM?

Tensor parallelism (--tensor-parallel-size) splits each transformer layer across multiple GPUs: all GPUs work on every layer simultaneously. This reduces latency (time to first token). Pipeline parallelism (--pipeline-parallel-size) assigns different layers to different GPUs: each GPU processes a different stage. Pipeline parallelism is better for maximizing throughput on very large models that don't fit in tensor-parallel configurations.

Does FP8 quantization hurt model quality in vLLM?

In practice, FP8 causes marginal quality loss: typically less than 1-2% on standard benchmarks for major models like Llama and Mistral. For most production inference workloads, this tradeoff is acceptable given the ~1.5-2x throughput improvement and ~50% VRAM reduction on H100 and Blackwell GPUs.

How do I monitor vLLM in production?

vLLM exposes a Prometheus-compatible /metrics endpoint. The key metrics to watch are vllm:num_requests_waiting (queue depth), vllm:kv_cache_usage_perc (KV cache fill rate), and vllm:time_to_first_token_seconds (TTFT latency). Pair this with nvidia-smi or DCGM for GPU-level monitoring.

vLLM Multi-GPU Production Deployment 2026: Complete Bare Metal Setup Guide

You've already gotten vLLM working locally. Now you need to serve a model in production: maybe 70B parameters across multiple GPUs, maybe handling thousands of concurrent users, maybe with specific latency requirements. This guide covers the gap between "vLLM works on my laptop" and "vLLM is running reliably in production on bare metal."

Prerequisites: Docker, basic Linux CLI, an account on Spheron, and a model you want to serve (Hugging Face model name or local checkpoint). We'll go from instance setup through single-GPU deployment, multi-GPU tensor parallelism, FP8 quantization, load balancing, and production monitoring: with working code for every step.

vLLM in 2026: Key Features to Know

vLLM has become the default serving engine for production LLM inference. The current stable release (v0.17.1, released March 11 2026) includes several features you need to understand before deploying:

FP8 inference support: significant throughput improvement on H100 and Blackwell GPUs; enable with a single flag
Continuous batching: the default now; dynamically groups incoming requests for maximum GPU utilization without your intervention
Streaming support: built-in server-sent events for real-time token streaming; important for chat applications and voice AI
Multi-modal support: images + text for models like LLaVA, Qwen2.5-VL, and Llama 4 (Scout 17B-16E, Maverick 17B-128E, both are natively multi-modal MoE models)
Structured outputs: JSON schema enforcement via guided decoding; critical for agent/tool-calling workloads
Speculative decoding: faster generation using a small draft model; effective for specific model pairs where latency matters
FlashAttention 4 backend: new in v0.17.0; the default attention backend on Blackwell (SM100+) GPUs (B200, RTX 50 series); improves prefill throughput and speculative decode performance with no configuration required. On Hopper GPUs (H100, H200), FlashAttention 3 remains the default
--performance-mode flag: new in v0.17.0; choose from balanced, interactivity, or throughput to pre-tune vLLM for your workload type without hand-tuning individual flags

For the full changelog, check vLLM's releases page. What's in this guide is what you actually need to configure for production: not every feature, just the ones that matter.

Choosing Your GPU Configuration

Your GPU choice depends on model size and whether you prioritize throughput or cost. Here's the practical decision table:

Model Size	Recommended GPU	Parallelism Strategy	On-Demand Price	Spot Price
7B–13B	1x RTX 4090 (24GB)	None: single GPU	$0.57/hr	N/A
7B–13B	1x RTX 5090 (32GB)	None: single GPU	$0.76/hr	N/A
13B–30B	1x L40S (48GB)	None: single GPU	$0.72/hr	N/A
30B–70B (FP8/Q4)	1x RTX PRO 6000 (96GB)	None: single GPU	$1.65/hr	$0.72/hr
30B–40B (FP16)	1x A100 80GB (SXM4)	None: single GPU	$1.14/hr	N/A
70B (FP8)	1x H100 SXM5 80GB	None: just fits (tight ~70GB)	$2.50/hr	N/A
70B (FP16)	2x H100 SXM5 80GB	Tensor parallel (2)	$4.89/hr	N/A
70B (FP16)	4x H100 SXM5 80GB	Tensor parallel (4)	$9.67/hr	N/A
100B+	8x H100 SXM5	Tensor + pipeline parallel	$19.22/hr+	varies

Note on A100 and FP8: The A100 does not have hardware FP8 Tensor Cores. Running --dtype fp8 on A100 will either error or silently fall back to FP16. In FP16, each parameter takes 2 bytes, so an 80GB A100 can hold at most ~40B parameters (80GB ÷ 2 bytes), with some headroom needed for KV cache. For 70B models, use H100 with FP8 (fits on one 80GB card at ~70GB) or 2× H100 with FP16.

GPU pricing fluctuates over time based on availability and demand. Prices above are live on-demand (dedicated) and spot rates fetched from Spheron's GPU catalog as of 12 Mar 2026. Spot instances offer significant savings where available, but may not always be in stock. Your actual cost may differ - always check current GPU pricing before committing to a configuration.

For smaller models (7B-13B), see our RTX 5090 rental guide for benchmark data on common inference workloads. For the RTX PRO 6000 as a single-GPU 30B–70B option (96GB GDDR7), see our RTX PRO 6000 guide.

Setting Up Your Spheron Instance

Step 1: Launch the Instance

Select your GPU, region, and provider from Spheron's GPU catalog. For a 70B FP16 model with tensor parallelism, select a 2x H100 configuration. For FP8, a single H100 is sufficient.

Step 2: Verify GPU Access

bash

nvidia-smi
# Should show all GPUs with expected VRAM
# For a 2x H100 instance, you'll see two rows with 80GB each

Step 3: Install Docker with NVIDIA Support

Most Spheron GPU instances come with the NVIDIA Container Toolkit pre-installed. If not:

bash

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Step 4: Verify Docker GPU Access

bash

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

If this outputs the same nvidia-smi table as the host, you're ready. If it errors, the container toolkit configuration didn't take: restart Docker and try again.

Single-GPU Deployment: The Starting Point

Start here even if you plan to run multi-GPU. Validate that the model loads, the API works, and your system is healthy before scaling.

bash

docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 256 \
  --host 0.0.0.0 \
  --port 8000

Every flag here is deliberate:

--gpus all: expose all GPUs to the container; required for CUDA to work
--ipc=host: use the host's shared memory namespace; do not skip this: vLLM uses shared memory for inter-process communication, and without this flag you'll hit cryptic CUDA errors under load
--dtype float16: FP16 precision; switch to fp8 on H100/Blackwell for ~2x throughput (covered below)
--max-model-len 8192: maximum context length; lower values mean less KV cache VRAM, which means more room for concurrent requests
--gpu-memory-utilization 0.90: reserve 90% of GPU VRAM for the model and KV cache; keep 10% headroom to avoid OOM on unexpected spikes
--max-num-seqs 256: maximum concurrent sequences; increase for high-throughput workloads if VRAM allows

Test the deployment immediately after it starts:

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello, are you online?"}],
    "max_tokens": 50
  }'

You should get a JSON response with a choices array. If you get a connection error, the model is still loading: vLLM downloads and loads weights before accepting requests, which can take 5–15 minutes for large models on first run.

Multi-GPU Tensor Parallelism

When a model doesn't fit on one GPU, or when you need lower time-to-first-token (TTFT) by distributing compute, use tensor parallelism.

bash

docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype float16 \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 64

What --tensor-parallel-size 2 does: splits every transformer layer across 2 GPUs. Each GPU processes half the attention heads and half the MLP feed-forward computation. The results are synchronized at the end of each layer via an all-reduce operation across the GPUs.

The performance of tensor parallelism depends heavily on the interconnect between GPUs:

NVLink (SXM H100/H200): ~900 GB/s bidirectional bandwidth; all-reduce is fast; tensor parallelism scales well
PCIe (A100 PCIe, non-NVLink setups): ~64 GB/s bidirectional via PCIe 4.0 x16; all-reduce is slower; each layer boundary adds latency

If you're running SXM H100s with NVLink, tensor parallelism at 2x or 4x is very efficient. On PCIe-only multi-GPU setups, expect more overhead between layers: still worth it for models that don't fit, but don't expect linear throughput scaling.

When tensor parallelism helps:

Model is too large for a single GPU (70B FP16 needs ~140GB, beyond any single GPU)
TTFT is too high: spreading prefill across more GPUs reduces time to first token
You have NVLink-connected GPUs and want to push throughput higher

When it doesn't help:

Running a 7B model across 4 GPUs: the communication overhead outweighs the benefit; just run 4 separate single-GPU instances instead
PCIe-only systems with a model that barely fits on one GPU: the communication cost may not be worth the marginal VRAM headroom

For detailed GPU interconnect benchmarks and when multi-GPU actually pays off, see our production GPU cloud architecture guide.

FP8 Quantization: 2x Throughput on H100 and Blackwell

FP8 is the single most impactful configuration change available on H100 and Blackwell GPUs. It requires no quantization scripts, no model modifications, and no additional setup: just one flag change:

bash

docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 128

What you gain with FP8:

~1.5–2x throughput improvement vs FP16 on H100: the H100's FP8 Tensor Cores run at twice the FLOP rate of FP16 Tensor Cores
~50% VRAM reduction: a 70B model in FP8 uses roughly 70GB vs 140GB in FP16 - it fits on a single H100 80GB, though it is tight; tune --gpu-memory-utilization (0.92+) and --max-model-len carefully to avoid OOM
Marginal quality loss: typically less than 1-2% on standard benchmarks; acceptable for production inference on most tasks

Model compatibility: Most major open-source models (Llama 3.x, Mistral, Qwen 2.5, Phi-4) have been validated with vLLM FP8. For models without pre-quantized FP8 weights available, vLLM performs dynamic quantization on the fly using the original weights. Verify FP8 compatibility in vLLM's supported models documentation.

FP8 requires hardware support: H100, H200, NVIDIA Ada Lovelace GPUs (RTX 4090, L40S), and NVIDIA Blackwell GPUs (B200, B100, and consumer RTX 50 series including the RTX 5090) have dedicated FP8 Tensor Cores. On A100 or older Ampere hardware, --dtype fp8 will either fail or fall back to FP16 automatically: check your vLLM logs to confirm which mode is active.

Production Configuration for High Throughput

Once the model is running, tune these parameters for production workloads handling hundreds of concurrent requests:

bash

docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 65536 \
  --enable-chunked-prefill \
  --kv-cache-dtype fp8

What each production setting does:

--max-num-seqs 512: increase from the default 256 when you have many concurrent users and your VRAM can support it; monitor vllm:kv_cache_usage_perc to see if you're running out of KV cache space
--max-num-batched-tokens 65536: maximum tokens processed per forward pass iteration; increase for throughput-optimized workloads, decrease if you see out-of-memory errors under bursty load
--enable-chunked-prefill: breaks long prefill sequences into smaller chunks and interleaves them with ongoing decode steps; reduces latency spikes when your traffic mix includes both long prompts and short responses
--kv-cache-dtype fp8: stores the KV cache in FP8 format; saves ~50% VRAM on cached activations (FP8 uses 1 byte vs 2 bytes per element in FP16), allowing more concurrent requests with the same GPU; quality impact is minimal for most workloads

These settings are not universal: tune --max-num-seqs and --max-num-batched-tokens based on your actual traffic patterns. A workload with many short requests benefits from higher --max-num-seqs. A batch inference workload with long documents needs higher --max-num-batched-tokens. Start with these values and adjust based on the metrics in the monitoring section.

Shortcut: --performance-mode (new in v0.17.0): If you don't want to tune individual flags, use --performance-mode throughput for batch workloads or --performance-mode interactivity for chat/real-time applications. The balanced mode (default) is a reasonable starting point for mixed traffic. This flag configures a curated set of defaults for each scenario, you can still override individual flags on top of it.

Load Balancing Multiple vLLM Instances

Tensor parallelism makes one vLLM instance use more GPUs. Load balancing makes multiple vLLM instances handle more total traffic. Use both when your traffic exceeds what one instance can handle.

The simplest horizontal scaling approach: run separate vLLM instances on separate GPU devices, then load balance across them with NGINX.

bash

# Instance 1: pinned to GPU 0, listening on port 8000
docker run --gpus '"device=0"' --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dtype float16 \
  --max-model-len 8192 \
  --max-num-seqs 256

# Instance 2: pinned to GPU 1, listening on port 8001
docker run --gpus '"device=1"' --ipc=host -p 8001:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dtype float16 \
  --max-model-len 8192 \
  --max-num-seqs 256

Note: This example uses --dtype float16 for broad hardware compatibility - it works on RTX 4090, RTX 5090, A100, and all other CUDA GPUs. If you are running on H100, H200, NVIDIA Blackwell (B200), RTX 5090, or RTX 4090 (Ada Lovelace), you can replace float16 with fp8 to leverage hardware FP8 Tensor Cores for ~2x throughput. FP8 Tensor Core hardware support requires Ada Lovelace (SM89, RTX 40 series) or newer, do not use --dtype fp8 on A100, RTX 3090, or other Ampere (SM80/SM86) and older hardware, as those GPUs lack FP8 Tensor Core support and will fall back to FP16 silently or error. The RTX 5090 (Blackwell GB202) includes FP8 Tensor Cores and is fully supported in vLLM v0.17.0+, which ships a dedicated SM120 FP8 GEMM optimization for higher FP8 throughput on Blackwell consumer GPUs. You can enable --dtype fp8 on RTX 5090 when running vLLM v0.17.0 or later.

NGINX configuration for load balancing across both instances:

nginx

upstream vllm_backend {
    least_conn;
    server localhost:8000;
    server localhost:8001;
}

server {
    listen 80;

    location /v1/ {
        proxy_pass http://vllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_read_timeout 300s;
        proxy_connect_timeout 10s;
        proxy_buffering off;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Use least_conn (least connections), not round-robin. vLLM requests vary dramatically in duration: a short request completes in 100ms, a long generation can take 30 seconds. Least-connections routing sends new requests to whichever backend currently has fewer active connections, naturally load-balancing based on actual utilization rather than request count.

Install NGINX and load the configuration:

bash

sudo apt-get install -y nginx
sudo cp vllm-nginx.conf /etc/nginx/sites-available/vllm
sudo ln -s /etc/nginx/sites-available/vllm /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

For multi-node deployments (separate physical servers), replace localhost:8000 and localhost:8001 with the actual IP addresses of each node. The rest of the NGINX configuration remains the same.

Monitoring in Production

Running vLLM in production without monitoring is running blind. Two systems to set up from day one:

GPU-Level Monitoring

bash

# Real-time GPU stats: VRAM usage, utilization, temperature
watch -n 2 nvidia-smi

# Persistent logging to file
nvidia-smi dmon -s pum -d 10 >> gpu-metrics.log &

For production, see our GPU monitoring guide for setting up DCGM with Prometheus and Grafana dashboards: nvidia-smi polling works for development but doesn't scale to multi-GPU production systems.

vLLM Metrics Endpoint

vLLM exposes a Prometheus-compatible metrics endpoint at /metrics:

bash

curl http://localhost:8000/metrics | grep vllm

The metrics you need to watch:

Metric	What it Tells You	Action if High
`vllm:num_requests_running`	Active requests being processed	Normal: reflects traffic
`vllm:num_requests_waiting`	Requests queued, waiting for a free slot	Scale out or increase `--max-num-seqs`
`vllm:kv_cache_usage_perc`	KV cache fill percentage	If >95%, reduce `--max-num-seqs` or `--max-model-len`
`vllm:time_to_first_token_seconds`	Latency before first token is generated	Your primary user-facing latency metric
`vllm:e2e_request_latency_seconds`	Total request duration	Track p95 and p99, not just mean

A rising num_requests_waiting with high kv_cache_usage_perc means you're KV-cache-bound: the bottleneck is memory, not compute. Reduce --max-model-len or add --kv-cache-dtype fp8 to free up space. A rising num_requests_waiting with low kv_cache_usage_perc means you're compute-bound: add more GPUs or instances.

Common Issues and Fixes

OOM (Out of Memory) Error on Startup

CUDA out of memory. Tried to allocate X GiB

Diagnosis steps in order:

Reduce --gpu-memory-utilization from 0.90 to 0.85: give the GPU more headroom
Reduce --max-model-len: every 1024 tokens of context length requires additional KV cache VRAM
Switch from --dtype float16 to --dtype fp8: cuts model VRAM by ~50% on H100
Increase --tensor-parallel-size to spread the model across more GPUs

Slow TTFT (Time to First Token)

If your time-to-first-token is consistently above 2-3 seconds for normal-length prompts:

Enable tensor parallelism: spreading prefill computation across 2-4 GPUs cuts TTFT proportionally
Enable chunked prefill (--enable-chunked-prefill): interleaves prefill with ongoing decode, preventing long prefill jobs from blocking short ones
Reduce --max-model-len: if you don't need 128K context, set it to 8K or 16K; shorter configured max length = smaller KV cache allocation per sequence = more room for batching

CUDA Error: Device-Side Assert Triggered

RuntimeError: CUDA error: device-side assert triggered

This is almost always a tokenizer mismatch. Causes:

Input text that tokenizes to more tokens than --max-model-len
Using the wrong model name in the API request (model field must match the model vLLM loaded)
Special tokens or unicode sequences the tokenizer doesn't handle cleanly

Fix: verify that your input length (in tokens, not characters) is below --max-model-len. Use the tokenizer's encode method to check before sending.

Low GPU Utilization (Consistently Below 70%)

If nvidia-smi shows GPU utilization below 70% but requests are queuing:

Increase --max-num-seqs: you're not saturating the GPU with enough concurrent sequences
Increase --max-num-batched-tokens: each forward pass is processing too few tokens
Check your client: if you're sending requests one at a time and waiting for each to complete, you're not actually generating concurrent load; vLLM's continuous batching requires concurrent requests to be effective

Model Download Fails at Startup

OSError: Hugging Face Hub is not reachable

Verify your HUGGING_FACE_HUB_TOKEN environment variable is set correctly
Check that the model name matches exactly (case-sensitive) the Hugging Face repo path
Pre-download the model with huggingface-cli download model-name and mount the local directory instead, to decouple model download from container startup

For more context on overall GPU infrastructure architecture for production ML, see our production GPU cloud architecture guide and how to fine-tune LLMs in 2026 for the training side of the stack.

Deploy vLLM on Spheron's bare-metal H100s, RTX 5090s, or B200s: full CUDA access, no virtualization overhead. Your models run at native GPU performance.
Get started on Spheron →

vLLM in 2026: Key Features to Know

Choosing Your GPU Configuration

Setting Up Your Spheron Instance

Step 1: Launch the Instance

Step 2: Verify GPU Access

Step 3: Install Docker with NVIDIA Support

Step 4: Verify Docker GPU Access

Single-GPU Deployment: The Starting Point

Multi-GPU Tensor Parallelism

FP8 Quantization: 2x Throughput on H100 and Blackwell

Production Configuration for High Throughput

Load Balancing Multiple vLLM Instances

Monitoring in Production

GPU-Level Monitoring

vLLM Metrics Endpoint

Common Issues and Fixes

OOM (Out of Memory) Error on Startup

Slow TTFT (Time to First Token)

CUDA Error: Device-Side Assert Triggered

Low GPU Utilization (Consistently Below 70%)

Model Download Fails at Startup

Build what's next.