Most LLM deployments fail at the same choke points: developers get a model working locally, spin up a cloud instance, and then spend weeks firefighting latency spikes, OOM crashes, and unbounded GPU bills. This guide maps the full path from first ollama run to a load-balanced, monitored, auto-recovering production service.
TL;DR
| Phase | What You Do | Tool | Cost Ballpark |
|---|---|---|---|
| Prototype | Run model locally | Ollama | $0 (local GPU/CPU) |
| Validate | Cloud GPU, real traffic test | vLLM + Spheron | ~$2/hr (H100 on-demand) |
| Optimize | Benchmark engines, tune flags | vLLM, llama.cpp | Same instance, no extra cost |
| Production | Systemd, health checks, monitoring | systemd + Prometheus | +$0 tooling on existing instance |
| Scale | Multi-instance, spot, sharding | Nginx LB + spot GPUs | 60-70% cost reduction vs on-demand |
Phase 1: Prototype Locally with Ollama
Before spending on cloud compute, validate that your chosen model actually meets your quality requirements. This phase costs nothing if you have a local GPU or CPU with enough RAM.
For a deeper look at running models locally, see our guide on running LLMs locally with Ollama.
Pick Your Model
Model size determines VRAM requirements and inference speed. Here is the practical decision table. If you are targeting newer large models like Llama 4, see deploying Llama 4 on GPU cloud for model-specific setup.
| Model Size | VRAM Required (FP16) | VRAM Required (Q4) | Best For |
|---|---|---|---|
| 7B | ~14GB | ~5GB | Fast iteration, cost-sensitive API |
| 13B | ~26GB | ~8GB | Better quality, still single-GPU |
| 30B | ~60GB | ~18GB | High-quality outputs, multi-GPU or Q4 |
| 70B | ~140GB | ~35-40GB | Near-GPT-4 quality, needs H100 or multi-GPU |
If you are targeting a 7B or 13B model, a local machine with a 24GB GPU (RTX 3090 or 4090) handles it in FP16. For 30B+ you are looking at quantization locally, or moving to cloud for unquantized serving.
Run It with Ollama
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (e.g. Llama 3.1 8B)
ollama pull llama3.1:8b
# Run interactively
ollama run llama3.1:8b
# Or test via API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Explain how KV cache works in transformer inference.",
"stream": false
}'What to Test in Phase 1
Work through this checklist before moving on:
- Output quality: Does the model answer your specific domain questions correctly? Run 20-30 prompts representative of your actual workload.
- Latency on your hardware: Note time-to-first-token (TTFT) and tokens/second. This is your baseline before GPU cloud.
- Context window limits: Test with your longest expected prompts. Does quality degrade at 4K tokens? 8K? 16K?
- Model behavior with your prompts: Does it follow your system prompt? Does it hallucinate on your domain?
When to Move On
Move to Phase 2 when any of these are true:
- Your p50 TTFT exceeds 1 second under single-user load.
- You need to handle more than 5 concurrent users.
- Your target model requires more VRAM than your local GPU has.
- You need 24/7 availability without tying up your laptop.
Phase 2: Validate on a Cloud GPU with vLLM
Phase 2 is about getting a realistic read on throughput and cost before committing to production architecture. Pick a GPU, deploy vLLM, and run a real load test.
For a full breakdown of inference GPUs by cost-per-token, see the AI inference GPU comparison.
Choose Your GPU
Match your model size to the right GPU. Overpaying for a bigger GPU than you need is the most common Phase 2 mistake.
| Model Size | Recommended GPU | VRAM | On-Demand Price | Spot Price |
|---|---|---|---|---|
| 7B (FP16) | L40S | 48GB | $1.80/hr | ~$0.30-0.35/hr |
| 13B (FP16) | L40S | 48GB | $1.80/hr | ~$0.30-0.35/hr |
| 30B (FP16) | A100 SXM4 80GB | 80GB | $1.05/hr | ~$0.40-0.55/hr |
| 70B (FP8) | H100 SXM5 80GB | 80GB | $2.40/hr | ~$0.80-1.00/hr |
| 70B (FP16) | 2x H100 SXM5 80GB | 160GB | $4.80/hr | ~$1.60-2.00/hr |
Prices as of March 2026 on Spheron GPU rental. See current GPU pricing for live rates.
Pricing fluctuates based on GPU availability. The prices above are based on 22 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
Spin Up a Spheron Instance
- Log in to app.spheron.ai and select your GPU model from the catalog. See the Spheron GPU pricing for a full list of available GPU instances and regions.
- Choose your region and provider, then launch the instance.
- SSH into the instance and verify the GPU is visible:
nvidia-smiYou should see your GPU with the expected VRAM (e.g., 80034MiB for an H100 80GB). If you provisioned multiple GPUs, all should appear.
- Confirm Docker and the NVIDIA Container Toolkit are installed:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smiDeploy with vLLM
For a complete guide on replacing OpenAI API calls with your own vLLM server, see self-hosted OpenAI-compatible API with vLLM.
Run the vLLM OpenAI-compatible server. This single command covers the common case for a 7B model on an L40S:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--max-num-seqs 256Key flags:
--ipc=host: required for shared memory between GPU processes. Skipping causes CUDA errors under load.--gpu-memory-utilization 0.90: leaves 10% headroom for CUDA overhead. Go to 0.92-0.95 if you need more KV cache space.--max-num-seqs 256: maximum concurrent sequences in a batch. Raise this for high-throughput workloads.--max-model-len 8192: context window limit. Lower this to reduce KV cache memory pressure if you do not need long contexts.
For a 70B model in FP8 on H100:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--dtype fp8 \
--gpu-memory-utilization 0.92 \
--max-model-len 4096 \
--max-num-seqs 64Test the deployment:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello, are you running?"}]
}'Load Test with Real Traffic
Single-request latency does not predict production performance. Run a load test to measure throughput under concurrency:
# Install locust
pip install locust
# locustfile.py
from locust import HttpUser, task, between
class LLMUser(HttpUser):
wait_time = between(0.1, 0.5)
@task
def chat(self):
self.client.post("/v1/chat/completions", json={
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Write a two-sentence summary of transformer architecture."}],
"max_tokens": 100
}, timeout=120)
# Run with 50 concurrent users
locust -f locustfile.py --headless -u 50 -r 5 --host http://localhost:8000 --run-time 60sCapture these numbers before moving to Phase 3:
- Throughput: requests/second and tokens/second at steady state
- p50 and p95 TTFT: time-to-first-token at median and 95th percentile
- p95 end-to-end latency: total request time including generation
- Error rate: any 429 (queue full) or 500 errors at your target concurrency
Pass criteria: if your p95 TTFT is under your SLA at your target concurrency, you have the right GPU. If not, either increase --max-num-seqs, upgrade the GPU, or add a second instance.
Phase 2 Cost Snapshot
| Configuration | Tokens/sec | On-Demand $/hr | Cost per 1M Tokens |
|---|---|---|---|
| 7B on L40S | 2,500-4,000 | $1.80 | ~$0.13-0.20 |
| 30B on A100 80GB | 800-1,500 | $1.05 | ~$0.19-0.37 |
| 70B (FP8) on H100 | 400-700 | $2.40 | ~$0.95-1.67 |
| 70B (FP16) on 2x H100 | 600-1,000 | $4.80 | ~$1.33-2.22 |
Phase 3: Optimize Your Inference Engine
Phase 2 confirmed your GPU size and baseline throughput. Phase 3 is about getting more out of that hardware before you harden it for production.
For a detailed comparison of inference frameworks including benchmark numbers, see the full vLLM production deployment guide.
vLLM vs llama.cpp vs TGI: A Practical Comparison
| Factor | vLLM | llama.cpp | TGI (Text Generation Inference) |
|---|---|---|---|
| Throughput (concurrent) | Highest | Medium | High |
| p50 latency | Low | Low | Low |
| Multi-GPU support | Yes (tensor + pipeline) | Limited | Yes |
| Quantization | FP8, INT4, GPTQ, AWQ | GGUF (Q2-Q8) | GPTQ, AWQ, FP8 |
| Ops complexity | Medium | Low | Medium |
| Best for | Production batch/API serving | Single-user, CPU inference | Hugging Face-native stack |
The right answer for most production deployments is vLLM. llama.cpp is worth considering for CPU-only or very low-concurrency use cases where its GGUF quantization formats are a better fit. TGI is a viable alternative if you are already on the Hugging Face stack and want tighter integration with their ecosystem.
Tuning vLLM for Your Workload
Start with the defaults, then tune based on your Phase 2 load test results:
# High-throughput batch workload
docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 512 \
--max-model-len 4096 \
--performance-mode throughput
# Low-latency interactive workload
docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 64 \
--max-model-len 8192 \
--performance-mode interactivityKey flags to tune:
--gpu-memory-utilization: higher value means more KV cache space and higher concurrency ceiling, at the risk of OOM. Tune in 0.02 increments.--max-num-seqs: directly caps concurrent sequences. Set to 2-3x your expected peak concurrency.--dtype fp8: use on H100 and Blackwell GPUs. Gives ~1.5-2x throughput improvement. Not available on older GPU architectures.--performance-mode: new in vLLM v0.17.0.throughputfavors batching,interactivityfavors TTFT,balancedis the default.--kv-cache-dtype fp8: also store the KV cache in FP8 on H100. Saves additional VRAM, enabling longer context windows or more concurrent sequences.
Quantization Trade-offs
| Precision | VRAM for 70B | Throughput (relative) | Quality Impact |
|---|---|---|---|
| FP16 | ~140GB | 1x baseline | No loss |
| FP8 | ~70GB | 1.5-2x | Less than 1-2% on benchmarks |
| INT4 (AWQ) | ~40GB | 1.2-1.5x | 2-5% on benchmarks, varies by model |
| GGUF Q4_K_M | ~38GB | 0.6-1x | 3-6% on benchmarks, CPU-friendly |
FP8 is the practical choice on H100 and Blackwell. The throughput gain is real and the quality loss is marginal for most use cases. INT4 is worth considering if you need a 70B model on hardware with less than 70GB VRAM (e.g., A100 80GB with tight fit).
Benchmark Your Setup
After tuning, capture baseline numbers you can compare against later:
# GPU-level metrics (run in a separate terminal while vLLM is under load)
nvidia-smi dmon -s pucvmet -d 5
# vLLM internal metrics (raw Prometheus format)
curl http://localhost:8000/metrics | grep -E "num_requests_waiting|gpu_cache_usage|time_to_first_token"Numbers to record before Phase 4:
- GPU compute utilization at peak load
- GPU memory bandwidth utilization at peak load
vllm:gpu_cache_usage_percat peak loadvllm:time_to_first_token_secondsp50 and p95
Phase 4: Harden for Production
You have a working, tuned vLLM deployment. Phase 4 is about making it survive a long weekend without manual intervention.
For architecture patterns that apply to any GPU production workload, see our production GPU cloud architecture patterns guide.
Systemd Service Unit
Wrap the Docker container in a systemd service so it restarts automatically on crashes, reboots, and OOM kills:
# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM OpenAI-compatible inference server
After=docker.service
Requires=docker.service
[Service]
Type=simple
Restart=always
RestartSec=10
ExecStartPre=-/usr/bin/docker stop vllm-server
ExecStartPre=-/usr/bin/docker rm vllm-server
ExecStart=/usr/bin/docker run \
--name vllm-server \
--gpus all \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 256 \
--max-model-len 8192
ExecStop=/usr/bin/docker stop vllm-server
[Install]
WantedBy=multi-user.targetEnable and start:
sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm
sudo systemctl status vllmHealth Checks
vLLM exposes a /health endpoint. Poll it and trigger a service restart if it stops responding:
# /usr/local/bin/vllm-healthcheck.sh
#!/bin/bash
# Skip restart if the service has been active for less than 15 minutes (model loading grace period)
active_since=$(systemctl show vllm --property=ActiveEnterTimestamp --value 2>/dev/null)
if [ -n "$active_since" ]; then
start_epoch=$(date -d "$active_since" +%s 2>/dev/null); [ -z "$start_epoch" ] && exit 0
now_epoch=$(date +%s)
uptime_seconds=$((now_epoch - start_epoch))
if [ "$uptime_seconds" -lt 900 ]; then
echo "vLLM has been up for ${uptime_seconds}s, within grace period, skipping health check"
exit 0
fi
fi
# Also skip if service is still in activating state (i.e., loading)
service_state=$(systemctl is-active vllm 2>/dev/null)
if [ "$service_state" = "activating" ]; then
echo "vLLM is still activating, skipping health check"
exit 0
fi
response=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 --connect-timeout 5 http://localhost:8000/health)
if [ "$response" != "200" ]; then
echo "vLLM health check failed (HTTP $response), restarting service"
systemctl restart vllm
fiAdd a cron job to run this every minute:
chmod +x /usr/local/bin/vllm-healthcheck.sh
echo "* * * * * root /usr/local/bin/vllm-healthcheck.sh >> /var/log/vllm-healthcheck.log 2>&1" \
| sudo tee /etc/cron.d/vllm-healthcheckMonitoring Setup
For GPU monitoring details including DCGM and Grafana setup, see GPU monitoring with Prometheus and Grafana.
Add a Prometheus scrape job for vLLM metrics:
# prometheus.yml (add to scrape_configs)
scrape_configs:
- job_name: 'vllm'
scrape_interval: 10s
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'Three alerts to configure from the start:
# vllm_alerts.yml
groups:
- name: vllm
rules:
- alert: VLLMQueueDepth
expr: vllm:num_requests_waiting > 20
for: 2m
labels:
severity: warning
annotations:
summary: "vLLM request queue is backing up"
- alert: VLLMKVCachePressure
expr: vllm:gpu_cache_usage_perc > 0.90
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM KV cache above 90%, consider scaling"
- alert: VLLMHighTTFT
expr: histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket[5m])) by (le)) > 2.0
for: 3m
labels:
severity: critical
annotations:
summary: "vLLM p95 TTFT above 2s SLA"Load Balancing Two Instances
For redundancy or throughput beyond a single GPU, add an Nginx upstream block:
# /etc/nginx/sites-available/vllm
upstream vllm_backend {
least_conn;
server 10.0.0.1:8000;
server 10.0.0.2:8000;
}
server {
listen 80;
location / {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
proxy_buffering off;
}
}least_conn is the right load balancing strategy for LLM inference because request duration varies significantly. Round-robin can pile long requests onto one backend while the other sits idle.
Phase 4 Architecture Diagram
Client
|
v
Nginx (least_conn load balancer, port 80)
| |
v v
vLLM-1 vLLM-2
(port 8000) (port 8000)
| |
v v
GPU-1 GPU-2
| |
v v
Prometheus Prometheus
(scrape /metrics every 10s)Phase 5: Scale
Single-instance is fine for development and low-traffic production. Phase 5 is for when your traffic outgrows it.
For GPU cost reduction strategies that apply across all phases, see GPU cost optimization strategies.
Horizontal Scaling: When and How
Calculate how many instances you need:
instances_needed = ceil(peak_rps / single_instance_rps)Example: if your load test showed a single L40S handles 12 requests/second at p95 TTFT under 500ms, and your peak traffic is 60 requests/second, you need:
ceil(60 / 12) = 5 L40S instancesAdd 20-30% buffer for traffic spikes: plan for 6-7 instances.
For stateless LLM inference (no session affinity), horizontal scaling is straightforward. Each instance runs an independent vLLM server. Nginx distributes load across all of them. No shared state to coordinate.
Model Sharding for 70B+ Models
When a model does not fit on a single GPU even in FP8, use tensor parallelism to split it across multiple GPUs on the same host:
# 70B model in FP16 across 2x H100 80GB
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--dtype float16 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 128When to use tensor parallelism vs adding more single-GPU instances:
- Tensor parallelism: when the model does not fit on one GPU, or when you want to reduce TTFT (prefill is parallelized across GPUs).
- More instances: when throughput is the bottleneck and the model fits on one GPU. Multiple independent instances scale throughput linearly with no NVLink overhead.
For 140B+ models, consider pipeline parallelism (--pipeline-parallel-size) in addition to tensor parallelism. Pipeline parallelism assigns different transformer layers to different GPUs, which works better when NVLink bandwidth is a bottleneck.
Spot Instances for Cost Reduction
For production inference on a stable traffic pattern, spot instances cut costs significantly. On Spheron, spot pricing for H100 and L40S on-demand varies by availability, but savings typically run 50-70% versus on-demand:
| GPU | On-Demand $/hr | Spot $/hr (approx.) | Monthly Savings (1 instance) |
|---|---|---|---|
| H100 SXM5 80GB | $2.40 | ~$0.80-1.00 | ~$1,008-1,152 |
| L40S 48GB | $1.80 | ~$0.30-0.35 | ~$1,044-1,080 |
| A100 SXM4 80GB | $1.05 | ~$0.40-0.55 | ~$360-468 |
To use spot with vLLM safely, the service must handle interruptions gracefully. LLM inference is stateless (no in-flight requests survive an interruption), so the main concern is in-flight requests at the moment of preemption. A reasonable approach:
- Run spot instances behind the Nginx load balancer.
- Set a short connection drain timeout (10-30 seconds) so Nginx stops sending new requests to a preempting instance.
- Keep at least one on-demand instance in the pool for reliability. Blend spot and on-demand based on your availability tolerance.
Cost at Scale: Projections
| Tier | Configuration | On-Demand $/month | Spot $/month (approx.) |
|---|---|---|---|
| Dev / Low traffic | 1x L40S | $1,296 | ~$216-252 |
| Small production | 2x H100 SXM5 | $3,456 | ~$1,152-1,440 |
| Scale production | 4x H100 SXM5 (spot blend) | $6,912 | ~$2,304-2,880 |
See H100 GPU rental and View all GPU pricing for current rates before budgeting.
Pricing fluctuates based on GPU availability. The prices above are based on 22 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
Related Resources
The five phases above each have deeper coverage in related posts:
- Running LLMs locally with Ollama covers Phase 1 model selection and local testing in more depth.
- Full vLLM production deployment guide covers multi-GPU tensor parallelism, FP8, and production monitoring in much more detail than Phase 2-3 above.
- AI inference GPU comparison has cost-per-token benchmarks across GPU models for Phase 2 GPU selection.
- GPU monitoring with Prometheus and Grafana covers the full Prometheus and Grafana setup referenced in Phase 4.
- Production GPU cloud architecture patterns covers failover, checkpointing, and multi-provider redundancy beyond what Phase 4 covers.
- GPU cost optimization strategies covers reserved instances, spot strategies, and idle GPU elimination in detail for Phase 5.
Every phase in this guide runs on Spheron, from a single L40S for vLLM validation to multi-GPU H100 configurations for production. Per-minute billing means you only pay for the phases you are in, not idle capacity between them.
Rent H100 → | Rent L40S → | View all GPU pricing → | Get started on Spheron →
