What this does: vLLM's --served-model-name flag creates an OpenAI-compatible endpoint your existing SDK code hits without changes.
Cost: ~$1.67/1M tokens for a 70B model on one H100 vs $10/1M from OpenAI GPT-4o.
Time to set up: 15 minutes from provisioning to first request.
What you need: A Spheron account, a Hugging Face token, and a GPU.
Why Self-Host Your LLM API
The OpenAI API is convenient until you look at the bill. At scale, the cost gap between cloud API calls and self-hosted inference is not marginal; it is an order of magnitude.
Here's the math for a 70B model. A single H100 80GB running Llama 3.3 70B in FP8 delivers roughly 400 tokens/sec at typical concurrency with vLLM's continuous batching. That's 1.44 million tokens per hour. At $2.40/hr for the H100 SXM5 on Spheron (as of March 2026), you're paying about ~$1.67 per million output tokens. OpenAI charges $10/1M for GPT-4o output. For a 7B model on an A100 80GB at $1.05/hr, throughput climbs to ~3,300 tokens/sec (11.88M tokens/hr), putting you at roughly ~$0.09/1M tokens.
Cost is not the only reason. Your prompts contain your data. Every API call to OpenAI sends that data to a third party. For healthcare, legal, and financial workloads, that is not acceptable. Self-hosting keeps inference entirely on your infrastructure. If you're also looking to move off hyperscaler clouds, see the AWS, GCP, and Azure GPU alternatives guide.
Latency is another factor. A shared cloud API has rate limits, queue wait times, and cold starts. Your own GPU endpoint has none of those. Burst as hard as your hardware allows.
| Comparision | OpenAI GPT-4o | OpenAI GPT-4o-mini | Self-hosted 70B (H100) | Self-hosted 7B (A100) |
|---|---|---|---|---|
| Cost per 1M output tokens | ~$10 | ~$0.60 | ~$1.67 | ~$0.09 |
| Rate limits | Yes (TPM/RPM) | Yes | None | None |
| Data privacy | Sent to OpenAI | Sent to OpenAI | Stays on your GPU | Stays on your GPU |
| Latency (first token) | 200-600ms | 100-400ms | 50-150ms (local) | 30-100ms (local) |
| GPU hardware cost/hr | N/A | N/A | $2.40/hr (H100 SXM5) | $1.05/hr (A100 80GB SXM4) |
Pricing as of 24 Mar 2026. Check current GPU pricing for live rates. The GPU Cost Optimization Playbook covers how to push per-token costs even lower with spot instances and reserved capacity.
Architecture: How It Works
The key insight is that vLLM ships with a built-in HTTP server that speaks the same API protocol as OpenAI. Your code does not need to know it is talking to a different backend.
Your app (Python/JS)
│ openai.chat.completions.create(base_url="http://YOUR_IP:8000/v1")
▼
vLLM OpenAI-compatible server (:8000)
│ /v1/chat/completions /v1/completions /v1/models
▼
GPU (H100 / A100 / RTX 4090)
│ Continuous batching, KV cache, FP8 quantization
▼
Model weights (Llama 4, Mistral, Qwen 3, DeepSeek V3.2)The vLLM server layer handles request parsing, batching, tokenization, and response formatting. It exposes /v1/chat/completions, /v1/completions, and /v1/models, the same paths the OpenAI API uses. The GPU layer runs the actual model with continuous batching (requests are grouped dynamically, not in fixed batches) and FP8 quantization where supported.
For multi-GPU setups, tensor parallelism, and running vLLM on 2-8 H100s with NVLink, see the full vLLM multi-GPU production guide. This post focuses on the single-GPU case and the OpenAI API drop-in replacement.
Step 1: Spin Up a Spheron GPU Instance
Pick the GPU based on the model you want to run. Bigger models need more VRAM. For a complete model-to-GPU mapping, see the GPU requirements cheat sheet.
| Model Size | Recommended GPU | Spheron On-Demand | Notes |
|---|---|---|---|
| 7B-13B (FP16) | RTX 4090 24GB | ~$0.52/hr* | Best cost/token for small models |
| 13B-30B (FP8) | L40S 48GB | ~$0.72/hr* | Single GPU, good throughput |
| 30B-70B (FP8) | H100 80GB SXM5 | $2.40/hr | Runs Llama 3.3 70B in FP8 |
| 30B-70B (FP16) | 2x H100 80GB | ~$4.80/hr | Full precision, tensor parallel |
| 70B-236B (FP8) | 4-8x H100 80GB | ~$9.60-$19.20/hr | Qwen 235B, Llama 4 Maverick |
Prices as of 24 Mar 2026. *RTX 4090 and L40S pricing varies by region and availability. Check current GPU pricing for live rates. Note: A100 is not listed for 30B+ models because it lacks hardware FP8 support. A100 80GB can handle 30B models in FP16, but 70B in FP8 is an H100-only capability. Use --dtype fp16 on A100.
Rent an H100 on Spheron or an A100 on Spheron from the GPU catalog.
Once your instance is running (the Spheron quick-start guide walks through the provisioning steps if this is your first deployment), SSH in and verify the GPUs:
nvidia-smi
# You should see your GPU(s) with the expected VRAM
# For H100 80GB SXM5: "80GB HBM3" or similarStep 2: Install vLLM and Start the Server
Install vLLM via pip (this post uses the pip CLI path; for the Spheron docs walkthrough of vLLM deployment, see the vLLM server guide in Spheron docs or the vLLM multi-GPU production guide linked above):
pip install vllmStart the OpenAI-compatible server:
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 256 \
--host 0.0.0.0 \
--port 8000What each flag does:
--dtype fp8: use FP8 quantization; this is what makes a 70B model fit on one H100 80GB (weights ~70GB in FP8 vs ~140GB in FP16). Only works on H100, H200, Ada Lovelace, and Blackwell GPUs. A100 does not have hardware FP8 tensor cores. vLLM will error or fall back to FP16 on A100. For A100, use--dtype fp16(or omit--dtypeentirely). To fit larger models on A100, use weight-only quantization like AWQ or GPTQ instead.--max-model-len 16384: maximum context window in tokens. Lower values use less KV cache VRAM, leaving more room for concurrent requests.--gpu-memory-utilization 0.90: allocate 90% of GPU VRAM for the model and KV cache. Keep 10% headroom to avoid OOM under bursty load.--max-num-seqs 256: maximum concurrent sequences. Increase this if you have VRAM headroom and high concurrency.--host 0.0.0.0: bind to all network interfaces. Required to accept connections from outside localhost. See the security note below.--port 8000: the port your clients will connect to.
vLLM downloads model weights from Hugging Face on first run. For gated models like Llama, set your token first:
export HF_TOKEN=your_token_hereThe server is ready when you see INFO: Application startup complete. in the logs. While it's loading (can take 5-15 minutes for large models), test the models endpoint:
# Check available models
curl http://127.0.0.1:8000/v1/models
# Test inference
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [{"role": "user", "content": "Hello, are you online?"}],
"max_tokens": 50
}'Security note: --host 0.0.0.0 exposes port 8000 to the internet. For production, firewall the port and route external traffic through NGINX with auth (covered in Step 4). Or use vLLM's built-in key validation: set VLLM_API_KEY=your-secret in your shell before running vllm serve (or store it in a secrets file and load it via EnvironmentFile= in systemd), and vLLM will reject requests without a matching Authorization: Bearer your-secret header. Do not pass the key as --api-key on the command line, as it will be visible to other local users via /proc/<PID>/cmdline.
Model name mismatch warning: The model field in API requests must exactly match the Hugging Face model ID you loaded (e.g., meta-llama/Llama-3.3-70B-Instruct, not llama-70b or llama). A mismatch returns a 404 "model not found" error. Use --served-model-name to set a custom alias (covered in the multi-model section).
Other Models to Run
Substitute the model ID in the vllm serve command for any of these:
mistralai/Mistral-7B-Instruct-v0.3: fast 7B model, fits on RTX 4090Qwen/Qwen3-32B: strong 32B model, fits on L40S or single H100. See the Qwen 3 GPU deployment guide for Qwen-specific configuration.meta-llama/Llama-3.3-70B-Instruct: 70B model, needs H100 80GB with FP8- For Llama 4 Scout (109B total params with 16 experts, 17B active per token) or Maverick (400B total params with 128 experts, 17B active per token). Both use Mixture-of-Experts (MoE) architecture, meaning only a fraction of parameters are activated for each token. See the Llama 4 deployment guide for the specific configuration needed.
- For DeepSeek V3.2 Speciale, note that it requires 8-16x H100/H200 GPUs depending on precision and context length. Do not attempt it on a single GPU. See the DeepSeek V3.2 deployment guide for the multi-GPU setup.
Step 3: Drop-In Replacement, Zero Code Changes
This is the part that makes vLLM practical for teams already using OpenAI. You change two values and nothing else.
Before (OpenAI):
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this contract."}],
max_tokens=500,
)
print(response.choices[0].message.content)After (vLLM on Spheron, only 2 lines change):
from openai import OpenAI
client = OpenAI(
base_url="http://YOUR_SPHERON_IP:8000/v1", # change this
api_key="token-abc123", # any non-empty string
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct", # match vLLM model name
messages=[{"role": "user", "content": "Summarize this contract."}],
max_tokens=500,
)
print(response.choices[0].message.content)The same pattern works with the openai npm package in Node.js/TypeScript:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://YOUR_SPHERON_IP:8000/v1", // change this
apiKey: "token-abc123", // any non-empty string
});
const response = await client.chat.completions.create({
model: "meta-llama/Llama-3.3-70B-Instruct", // match vLLM model name
messages: [{ role: "user", content: "Summarize this contract." }],
max_tokens: 500,
});
console.log(response.choices[0].message.content);Streaming works without any changes as well:
for chunk in client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": "Write a short poem."}],
max_tokens=200,
stream=True,
):
if chunk.choices and chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)vLLM handles streaming via server-sent events, same as OpenAI. No client-side changes needed.
Step 4: Production Hardening
Running vllm serve in a terminal session is fine for testing. For production, you need the process to survive reboots and failures, with traffic gated behind auth.
Systemd Service
Create a dedicated non-root user for running vLLM. This user needs access to GPU devices via the render and video groups:
sudo useradd --system --no-create-home --shell /usr/sbin/nologin vllm
sudo usermod -aG render,video vllm
# Give the vllm user ownership of the model weights directory
sudo chown -R vllm:vllm /path/to/model/weightsCreate a secrets file to store your Hugging Face token and API key. Systemd unit files under /etc/systemd/system/ are world-readable (mode 644), so embedding secrets directly in Environment= directives exposes them to any local user via systemctl cat or direct file reads. Use a separate file with restricted permissions instead:
sudo mkdir -p /etc/vllm
sudo sh -c 'umask 077; printf "HF_TOKEN=your_token_here\nVLLM_API_KEY=your-secret-token\n" > /etc/vllm/secrets'
sudo chown root:root /etc/vllm/secretsvLLM does not emit sd_notify(READY=1) natively, so you need a small wrapper script that starts vLLM, polls /health, notifies systemd once the server is ready, and then waits for the vLLM process to exit. Create /usr/local/bin/vllm-start:
#!/bin/bash
set -euo pipefail
/usr/local/bin/vllm serve meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 256 \
--host 127.0.0.1 \
--port 8000 &
VLLM_PID=$!
until curl -sf http://127.0.0.1:8000/health > /dev/null 2>&1; do
if ! kill -0 "$VLLM_PID" 2>/dev/null; then
echo "vLLM process exited during startup" >&2
exit 1
fi
sleep 5
done
systemd-notify --ready
wait "$VLLM_PID"sudo chmod +x /usr/local/bin/vllm-startCreate /etc/systemd/system/vllm.service:
[Unit]
Description=vLLM OpenAI-compatible inference server
After=network.target
[Service]
Type=notify
NotifyAccess=all
User=vllm
EnvironmentFile=/etc/vllm/secrets
Environment=HF_HOME=/path/to/model/weights
ExecStart=/usr/local/bin/vllm-start
TimeoutStartSec=900
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.targetType=notify tells systemd to wait for an sd_notify(READY=1) signal before marking the unit active. The wrapper script sends that signal (via systemd-notify --ready) only after the /health poll succeeds, so any service declaring After=vllm.service is held until the model is fully loaded and accepting requests. With Type=simple and ExecStartPost, systemd marks the unit active as soon as ExecStart is forked — ExecStartPost cannot delay that transition and therefore cannot gate dependent services.
TimeoutStartSec=900 gives the wrapper script up to 15 minutes to send READY=1 before systemd marks the start as failed. This covers the 5-15 minutes a 70B model can take to load weights on first run. The default is 90 seconds, which is not enough.
Note --host 127.0.0.1 instead of 0.0.0.0 here. This binds vLLM to localhost only; NGINX handles external traffic (below). Running vLLM as a dedicated non-root user limits the blast radius if the process or a loaded model weight has a vulnerability. CUDA and nvidia drivers work fine under non-root users with the correct group membership.
Environment=HF_HOME=/path/to/model/weights tells vLLM where to read (or download) model weights. Set this to the directory you chowned to the vllm user above. Without it, the system user has no home directory, so $HOME/.cache/huggingface/ resolves to a path the vllm user cannot write to, and the service will fail to load weights.
VLLM_API_KEY in /etc/vllm/secrets (mode 600) enables vLLM's built-in Bearer token validation. vLLM reads VLLM_API_KEY directly from its process environment, so the secret does not need to appear in ExecStart as a command-line argument at all. This matters because systemd expands environment variables from EnvironmentFile= before calling execve(), so any secret passed as --api-key $VLLM_API_KEY would be expanded into argv and become readable to any local user via /proc/<PID>/cmdline or ps aux. By omitting --api-key from ExecStart entirely, the secret stays in the process environment block, readable only by the process owner and root via /proc/<PID>/environ (mode 400). The value does not appear in the unit file or systemctl cat output as plaintext. The OpenAI Python and JavaScript SDKs send Authorization: Bearer <api_key> with every request, and vLLM validates that token natively. Replace your-secret-token in /etc/vllm/secrets with a strong random value and use the same value as your api_key in your SDK clients.
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm
sudo journalctl -u vllm -f # tail the logsHealth Check
vLLM exposes a /health endpoint:
curl http://127.0.0.1:8000/health
# returns HTTP 200 with empty body when readyUse this as a readiness probe before routing traffic:
#!/bin/bash
until curl -sf http://127.0.0.1:8000/health > /dev/null; do
echo "Waiting for vLLM to be ready..."
sleep 5
done
echo "vLLM is ready"NGINX Reverse Proxy with TLS
Install NGINX and obtain a TLS certificate before configuring the proxy. Certbot is the easiest way to get a free Let's Encrypt certificate:
sudo apt-get install -y nginx certbot python3-certbot-nginx
sudo certbot --nginx -d your-domain.comAuthentication is handled by vLLM's built-in Bearer token validation via the VLLM_API_KEY environment variable (set in /etc/vllm/secrets as shown in Step 4). The OpenAI Python and JavaScript SDKs send Authorization: Bearer <api_key> with every request, and vLLM validates that token natively. Do not use NGINX auth_basic here: HTTP Basic Auth uses a different scheme (Authorization: Basic <base64>) and the OpenAI SDKs have no way to send it, so every request would get a 401 before reaching vLLM.
Create a config at /etc/nginx/sites-available/vllm:
# Redirect all HTTP traffic to HTTPS
server {
listen 80;
server_name your-domain.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
server_name your-domain.com;
ssl_certificate /etc/letsencrypt/live/your-domain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/your-domain.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
client_max_body_size 50m;
location /v1/ {
proxy_pass http://127.0.0.1:8000;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
proxy_connect_timeout 10s;
proxy_buffering off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}Enable the site:
sudo ln -s /etc/nginx/sites-available/vllm /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginxFor load balancing across multiple vLLM instances, see the NGINX upstream config in the vLLM production guide (linked in the multi-model section below).
Monitoring
vLLM exposes a Prometheus-compatible /metrics endpoint:
curl http://127.0.0.1:8000/metrics | grep vllmKey metrics to watch:
| Metric | What it tells you |
|---|---|
vllm:num_requests_waiting | Request queue depth |
vllm:kv_cache_usage_perc (v1+; vllm:gpu_cache_usage_perc in older versions) | GPU KV cache fill rate |
vllm:time_to_first_token_seconds | TTFT latency |
vllm:num_requests_running | Active requests |
vllm:e2e_request_latency_seconds | Total request duration |
For the full Prometheus/Grafana setup, GPU-level DCGM monitoring, and alerting configuration, see the GPU monitoring for ML guide.
Load Testing: How Many Concurrent Requests Can One H100 Handle?
Continuous batching means throughput increases under concurrent load. Single requests do not saturate the GPU. These are approximate figures consistent with published vLLM benchmarks; benchmark on your specific model and workload:
| Metric | 7B (FP16) on H100 | 70B (FP8) on H100 |
|---|---|---|
| Max concurrent requests (steady state) | ~400 | ~100 |
| Aggregate throughput (tokens/sec) | ~8,000-12,000 | ~300-500 |
| Median TTFT at 50 concurrent | ~80ms | ~180ms |
| P95 TTFT at 100 concurrent | ~250ms | ~600ms |
| Cost per 1M tokens (at max throughput) | ~$0.06 | ~$1.33 |
To generate load for benchmarking, use locust or vegeta:
# Install vegeta
go install github.com/tsenart/vegeta@latest
# Create the request body file
cat > payload.json <<'EOF'
{"model": "meta-llama/Llama-3.3-70B-Instruct", "messages": [{"role": "user", "content": "Summarize this in one sentence: The quick brown fox jumps over the lazy dog."}], "max_tokens": 50}
EOF
# Send 50 requests/sec for 30 seconds (1,500 total requests)
# Note: -rate controls requests per second, not concurrency. Use -workers to set worker count.
echo "POST http://127.0.0.1:8000/v1/chat/completions
Content-Type: application/json
Authorization: Bearer ${VLLM_API_KEY}" > targets.txt
vegeta attack -rate=50 -duration=30s -body payload.json -targets targets.txt | vegeta reportThe key thing to measure is throughput under concurrency, not single-request latency. A low single-request TTFT can still collapse under 100 concurrent users if you haven't tuned --max-num-seqs and --max-num-batched-tokens.
Pricing fluctuates based on GPU availability. The prices above are based on 24 Mar 2026 and may have changed. Check current GPU pricing for live rates.
Multi-Model Setup: Serve Multiple Models from One Instance
Two approaches depending on your hardware.
Separate Ports (One Model Per GPU)
If you have a multi-GPU instance, pin one vLLM process per GPU on different ports:
# /etc/vllm/secrets is mode 600 owned by root. 'source' is a shell builtin and
# cannot be prefixed with sudo, so a non-root user will get Permission denied.
# You have two options:
#
# Option A: open a root shell first, then run the commands below:
# sudo -s
#
# Option B: export the variables individually (works from any user shell):
# export VLLM_API_KEY=$(sudo grep ^VLLM_API_KEY= /etc/vllm/secrets | cut -d= -f2-)
# export HF_TOKEN=$(sudo grep ^HF_TOKEN= /etc/vllm/secrets | cut -d= -f2-)
source /etc/vllm/secrets || { echo "Failed to load secrets from /etc/vllm/secrets; check file permissions and ownership"; exit 1; }
# GPU 0 - 7B model on port 8000
CUDA_VISIBLE_DEVICES=0 vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
--dtype auto \
--host 127.0.0.1 \
--port 8000 &
# GPU 1 - 70B model on port 8001
CUDA_VISIBLE_DEVICES=1 vllm serve meta-llama/Llama-3.3-70B-Instruct \
--dtype auto \
--host 127.0.0.1 \
--port 8001 &--dtype auto lets vLLM pick the best precision for the GPU it detects. Use --dtype fp8 only on H100 or H200 instances (which have native FP8 tensor cores). A100 GPUs lack hardware FP8 support and vLLM will error if you force --dtype fp8 on them. See the FP8 note in Step 2 for details.
Both processes bind to 127.0.0.1 (localhost only), consistent with the production hardening advice above. External traffic goes through NGINX.
Because each backend serves a different model, do not pool them in an upstream block with least_conn or round-robin. vLLM validates the model field in every request against the model it loaded, so routing requests randomly across backends will return a "model not found" 404 error roughly half the time. Use separate location blocks instead:
# Mistral 7B - clients set base_url to "https://your-domain.com/mistral/v1"
location /mistral/v1/ {
client_max_body_size 50m;
proxy_pass http://127.0.0.1:8000/v1/;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
proxy_buffering off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
# Llama 70B - clients set base_url to "https://your-domain.com/llama/v1"
location /llama/v1/ {
client_max_body_size 50m;
proxy_pass http://127.0.0.1:8001/v1/;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
proxy_buffering off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}A client targeting Mistral sets base_url="https://your-domain.com/mistral/v1". A client targeting Llama sets base_url="https://your-domain.com/llama/v1". The upstream pool with least_conn is only appropriate when all backends serve the same model, such as running two identical Mistral 7B instances for redundancy or throughput scaling.
Model Name Alias with --served-model-name
If you have a legacy integration that sends gpt-4 as the model name, you can load Llama and present it as gpt-4:
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--served-model-name gpt-4 \
--dtype fp8 \
--host 127.0.0.1 \
--port 8000Now requests with "model": "gpt-4" in the body will route to your Llama instance. Useful for migrating applications that have the model name hardcoded. The /v1/models endpoint will also return gpt-4 as the available model ID.
Security note: Use --host 127.0.0.1 to bind to localhost only. Route external traffic through NGINX with auth as shown in Step 4, or set the VLLM_API_KEY environment variable to reject unauthenticated requests. Do not pass the key via --api-key on the command line.
Running a self-hosted OpenAI-compatible API means you own the endpoint, the data, and the cost curve. Spheron's bare-metal H100s and A100s give you the GPU without the hyperscaler markup.
