Most LoRA guides end at trainer.train(). What happens after, serving those adapters to real users at scale, is where most teams get stuck. The economics are brutal without a good answer: if you spin up a dedicated model instance per customer, you need 100 GPUs for 100 customers. With multi-adapter serving, you need one. The math is that simple, and this guide walks through everything you need to make it work in production.
If you haven't built your adapters yet, start with our LLM fine-tuning guide first. If you already have adapters and want to serve them, keep reading.
The Production Gap: Why LoRA Guides Stop at Training
Fine-tuning is well-documented because the tooling is mature and the problem is self-contained. You have a dataset, you run training, you get an adapter file. Done. The problem comes next: you now have dozens or hundreds of these adapter files, one per customer or use case, and you need to serve them under production load.
The naive answer is to run a separate inference server for each adapter. At small scale, that works. At 10 customers it's expensive. At 100 customers it's financially unsustainable. A dedicated Llama 3.1 8B instance per customer means 100 copies of a 16GB model, sitting in VRAM across 100 GPUs, most of them idle most of the time.
Multi-adapter serving is the pattern that breaks this equation. One base model instance in VRAM, adapters loaded on demand per request, customers isolated by the model routing layer. This is what vLLM's LoRA support was designed for.
All adapters must share the same base model architecture and dtype. A mixed-base setup (some adapters on Llama 3.1 8B, others on Llama 3.1 70B) requires separate vLLM instances. Plan your fine-tuning pipeline around a single base model if you want to consolidate serving.
Multi-Adapter Architecture: One Base Model, Hundreds of Adapters
The architecture is conceptually simple. The base model weights sit in GPU HBM, frozen. Each LoRA adapter is a small set of low-rank matrices (the A and B matrices per attention layer) that represent the fine-tuned delta. At inference time:
- A request arrives with a
modelfield specifying the adapter alias (e.g.,customer-a). - vLLM's LoRAManager checks if
customer-a's weights are in GPU memory. If yes, they get merged into the base model's computation for this request. If not, they get loaded from CPU RAM or disk. - The request runs through the base model with those delta weights applied as residual additions to the attention outputs.
- After the request completes, the adapter stays cached (LRU eviction when capacity is exceeded).
The key architectural point: adapters share the base model's KV cache and all attention layers. Only the delta weights are per-customer. This is why the VRAM budget is so favorable.
Base Model (frozen, ~16 GB for 8B FP16)
|
+-- LoRA Delta: customer-a (~60 MB, rank 16)
+-- LoRA Delta: customer-b (~60 MB, rank 16)
+-- LoRA Delta: customer-c (~60 MB, rank 16)
+-- ... (up to hundreds more in CPU RAM)
|
KV Cache (shared, ~4-8 GB for typical batch sizes)vLLM pre-allocates GPU memory buffers sized to --max-lora-rank at startup, not per-adapter. If any adapter has rank 64 but most are rank 16, all adapters get rank-64 buffers. Set --max-lora-rank to the actual maximum across your adapter set, not higher. Oversizing this wastes significant VRAM.
GPU Memory Math: How 100 Adapters Fit on One H100
Here's the VRAM breakdown for a typical production setup with Llama 3.1 8B and 100 LoRA adapters:
| Component | Size |
|---|---|
| Llama 3.1 8B base (FP16) | ~16 GB |
| 8 x LoRA r=16 adapters in GPU | ~0.5 GB total |
| 92 x LoRA adapters in CPU RAM | ~6 GB total |
| KV cache (batch of 64, 8K context) | ~4 GB |
| Activations and buffers | ~2 GB |
| Total GPU VRAM | ~22.5 GB |
Compare that to the alternative: 100 separate Llama 3.1 8B instances at 16GB each = 1,600 GB across 20 H100s. The multi-adapter setup fits on a single A100 80GB with room to spare.
VRAM scales with adapter rank. Higher rank means more parameters per adapter and more VRAM per adapter in GPU cache:
| Adapter Rank | Size per Adapter (8B model) | Max Adapters on H100 80GB (w/ base model) |
|---|---|---|
| r=8 | ~30 MB | 200+ (GPU cache) |
| r=16 | ~60 MB | 100+ (GPU cache) |
| r=32 | ~120 MB | 50+ (GPU cache) |
| r=64 | ~240 MB | 25+ (GPU cache) |
In practice, keep only the most frequently requested adapters in GPU cache (--max-loras). Less-frequent adapters live in CPU RAM (--max-cpu-loras) and load on demand. This tiered approach handles hundreds of adapters with minimal VRAM overhead.
For a deeper breakdown of how VRAM is split between weights, KV cache, and activations, see our GPU memory requirements for LLMs guide.
vLLM LoRA Serving: Configuration and Flags
vLLM has supported LoRA since v0.3.0. Dynamic adapter loading via REST API is available in recent releases (v0.6.2+). Use the latest release.
Full production launch command:
docker run --gpus all --ipc=host -p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
-v /path/to/adapters:/adapters \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype fp16 \
--enable-lora \
--max-loras 8 \
--max-lora-rank 64 \
--max-cpu-loras 64 \
--lora-modules customer-a=/adapters/customer-a \
customer-b=/adapters/customer-b \
customer-c=/adapters/customer-c \
--max-num-seqs 64What each LoRA flag does:
--enable-lora: activates LoRA support in vLLM's scheduler and LoRAManager.--max-loras: number of adapters that can be in GPU memory at once. Higher means more VRAM usage, lower means more adapter swaps (and latency on cache misses).--max-lora-rank: maximum rank across all registered adapters. vLLM pre-allocates buffers for this rank at startup. Set it to your actual max rank, not higher.--max-cpu-loras: adapters held in CPU RAM as an intermediate cache tier before eviction to disk. Acts as a buffer between GPU cache and disk/S3.--lora-modules: space-separatedalias=pathpairs. Paths can be local directories or S3 URIs (s3://bucket/path/to/adapter).
Sending a request to a specific adapter uses the standard OpenAI API, with the model field set to the adapter alias:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="customer-a", # routes to the customer-a LoRA adapter
messages=[{"role": "user", "content": "Summarize this support ticket: ..."}],
)
print(response.choices[0].message.content)To add adapters dynamically at runtime without restarting the server, use the REST API (vLLM v0.6.2+). This requires the VLLM_ALLOW_RUNTIME_LORA_UPDATING=True environment variable to be set when the server starts:
docker run --gpus all --ipc=host -p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
-e VLLM_ALLOW_RUNTIME_LORA_UPDATING=True \
-v /path/to/adapters:/adapters \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--max-loras 8 \
--max-cpu-loras 64Then register new adapters without restarting:
curl -X POST http://localhost:8000/v1/load_lora_adapter \
-H "Content-Type: application/json" \
-d '{"lora_name": "customer-d", "lora_path": "/adapters/customer-d"}'This is how you handle new customer onboarding without downtime: upload the adapter, call the endpoint, start serving.
SGLang LoRA Serving: The Weight Loading Overlap Advantage
SGLang v0.5.9 introduced weight loading overlap for LoRA, reducing time-to-first-token by up to 78% vs sequential loading. The feature overlaps the computation of the first few layers with loading remaining LoRA weights onto the GPU, hiding most of the adapter load latency. Full framework comparison in our vLLM vs TensorRT-LLM vs SGLang benchmarks.
SGLang launch command for LoRA serving:
docker run --gpus all --ipc=host -p 8000:8000 \
-v /path/to/adapters:/adapters \
lmsysorg/sglang:v0.5.9-cu124-runtime \
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--lora-paths customer-a=/adapters/customer-a \
customer-b=/adapters/customer-b \
--max-loras-per-batch 4 \
--port 8000When to choose SGLang over vLLM for LoRA:
- Your workload is latency-sensitive (TTFT matters more than throughput).
- You have a small number of active adapters with high request concurrency per adapter.
- You're already using SGLang for its RadixAttention prefix caching.
When to stick with vLLM:
- You need dynamic adapter loading via REST API (SGLang's runtime adapter API is less mature).
- You have a large adapter registry (50+) that needs LRU eviction and CPU offload.
- You need the widest model compatibility.
Step-by-Step: Deploy on Spheron GPU Cloud
1. Provision a GPU Instance
Go to app.spheron.ai, select H100 SXM5 or A100 80GB from the GPU catalog. SSH into the instance and verify:
nvidia-smi
# Should show your GPU with full VRAM availableFor a 7-8B base model with up to 50 adapters, the A100 80GB at $1.05/hr is the right choice. For 13B+ models or adapters at rank 32+, use the H100 SXM5 at $2.40/hr.
2. Install vLLM
pip install vllmLoRA support is built-in from v0.3.0+. No extras needed.
3. Upload Your LoRA Adapters
Three options:
# Option 1: scp from local machine
scp -r ./my-adapter user@your-instance:/home/user/adapters/customer-a
# Option 2: from Hugging Face Hub
huggingface-cli download my-org/customer-a-adapter --local-dir /adapters/customer-a
# Option 3: from S3 (use the S3 URI directly in --lora-modules)
# s3://your-bucket/adapters/customer-a4. Launch the Multi-Adapter Server
docker run --gpus all --ipc=host -p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
-v /home/user/adapters:/adapters \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype fp16 \
--enable-lora \
--max-loras 8 \
--max-lora-rank 64 \
--max-cpu-loras 64 \
--lora-modules customer-a=/adapters/customer-a \
customer-b=/adapters/customer-b \
--max-num-seqs 645. Test Per-Adapter Routing
# Test customer-a adapter
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "customer-a",
"messages": [{"role": "user", "content": "Hello, who are you?"}]
}'Or using the Python SDK:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
for adapter in ["customer-a", "customer-b"]:
response = client.chat.completions.create(
model=adapter,
messages=[{"role": "user", "content": "Hello"}],
)
print(f"{adapter}: {response.choices[0].message.content}")6. Systemd Service for Persistence
Create /etc/systemd/system/vllm-lora.service so the server survives SSH disconnection:
[Unit]
Description=vLLM LoRA Multi-Adapter Server
After=network.target
[Service]
Type=simple
Restart=on-failure
RestartSec=5
EnvironmentFile=/etc/vllm-lora.env
ExecStart=/usr/bin/docker run --rm --gpus all --ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} \
-v /home/user/adapters:/adapters \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype fp16 \
--enable-lora \
--max-loras 8 \
--max-lora-rank 64 \
--max-cpu-loras 64 \
--lora-modules customer-a=/adapters/customer-a
[Install]
WantedBy=multi-user.targetBefore enabling the service, create /etc/vllm-lora.env with restricted permissions to keep your token out of the unit file:
sudo bash -c 'echo "HUGGING_FACE_HUB_TOKEN=your_actual_token" > /etc/vllm-lora.env'
sudo chmod 600 /etc/vllm-lora.envThen enable and start:
sudo systemctl daemon-reload
sudo systemctl enable vllm-lora
sudo systemctl start vllm-lora
sudo journalctl -u vllm-lora -f # follow logsRestart=on-failure means systemd restarts the service if it crashes, but systemctl stop vllm-lora performs a clean shutdown without an immediate restart.
Adapter Hot-Swapping: Request-Level Routing
vLLM's LoRAManager handles adapter selection at the request level. When a request arrives:
- The
modelfield maps to an adapter alias registered at startup (or added via/v1/load_lora_adapter). - LoRAManager checks GPU cache. If the adapter is there, it proceeds. Cache hit latency is sub-millisecond.
- On a cache miss, it loads from CPU RAM first (fast, tens of milliseconds), then from disk or S3 if not in CPU RAM (hundreds of milliseconds for S3).
- GPU cache eviction is LRU. If 8 slots are full and a 9th adapter is needed, the least recently used adapter gets evicted to CPU RAM.
Pre-warming: to avoid cold-adapter latency for important customers, send a dummy request per adapter at startup:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
adapters = ["customer-a", "customer-b", "customer-c"]
for adapter in adapters:
client.chat.completions.create(
model=adapter,
messages=[{"role": "user", "content": "warmup"}],
max_tokens=1,
)
print(f"Pre-warmed: {adapter}")For production deployments, build an adapter registry: a mapping from customer ID to adapter path, stored in Postgres or Redis. A thin FastAPI wrapper resolves customer_id to adapter alias before forwarding to vLLM:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import openai
app = FastAPI()
client = openai.AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="none")
# In practice, load this from your database
ADAPTER_REGISTRY = {
"cust_abc123": "customer-a",
"cust_def456": "customer-b",
}
class ChatRequest(BaseModel):
customer_id: str
messages: list
@app.post("/v1/chat")
async def chat(request: ChatRequest):
adapter_alias = ADAPTER_REGISTRY.get(request.customer_id)
if not adapter_alias:
raise HTTPException(status_code=404, detail="unknown customer")
response = await client.chat.completions.create(
model=adapter_alias,
messages=request.messages,
)
return response.model_dump()Use namespaced aliases (tenant-{id}-v{version}) to avoid collisions across customers and model versions.
Cost Comparison: Multi-Adapter vs Per-Instance
Here's what the math looks like at 100 customers, running 24/7 on-demand on Spheron:
| Approach | GPU Setup | Monthly Cost (100 customers) |
|---|---|---|
| 1 model instance per customer | 100x A100 80GB | $75,600/month |
| Multi-adapter on 1 H100 SXM5 | 1x H100 SXM5 | $1,728/month |
| Multi-adapter on 2 H100 SXM5 (HA) | 2x H100 SXM5 | $3,456/month |
The per-instance approach costs $75,600/month (100 A100s at $1.05/hr). One H100 with multi-adapter serving costs $1,728/month. That's a 97.7% cost reduction for the same 100 customers.
In practice, one H100 handles 100 customers at moderate concurrency (a few dozen simultaneous requests). If your traffic is higher, two H100s behind a load balancer gives you high availability and headroom for growth at $3,456/month, still 95.4% cheaper than the per-instance approach. For more ways to reduce GPU spend across your entire stack, see the GPU cost optimization playbook.
Rent H100 → | Rent A100 → | View all pricing →
Pricing fluctuates based on GPU availability. The prices above are based on 28 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
Production Patterns: Scaling, Registries, and A/B Testing
Auto-Scaling
vLLM exposes a Prometheus-compatible /metrics endpoint. Key metrics for scaling decisions:
vllm:num_requests_waiting: queue depth. If this is consistently above 0, you need more capacity.vllm:gpu_cache_usage_perc: KV cache fill rate. Above 90% means you're hitting memory limits.vllm:time_to_first_token_seconds: latency by percentile.
Scale horizontally (multiple H100 instances behind a load balancer) when queue depth exceeds a threshold. A simple rule: if num_requests_waiting averages above 5 for more than 60 seconds, add another instance.
For full Prometheus and Grafana monitoring setup, see the vLLM production deployment guide.
Adapter Registries
Store adapter metadata in a database. Minimal Postgres schema:
CREATE TABLE lora_adapters (
id SERIAL PRIMARY KEY,
customer_id VARCHAR(128) NOT NULL,
alias VARCHAR(256) NOT NULL UNIQUE,
path TEXT NOT NULL, -- local path or s3:// URI
base_model VARCHAR(256) NOT NULL,
rank INTEGER NOT NULL,
version INTEGER DEFAULT 1,
created_at TIMESTAMP DEFAULT NOW()
);When a customer onboards: upload their adapter to S3, insert a row, call /v1/load_lora_adapter to register the adapter with the running vLLM instance. No restart needed.
When the server restarts, reload all adapters from the registry at startup using the --lora-modules flag or the REST API in a startup script.
A/B Testing Fine-Tuned Models
Register two adapter aliases for the same customer (tenant-abc-v1, tenant-abc-v2). Split traffic at the API gateway layer using a request header or simple percentage split:
import random
def get_adapter_alias(customer_id: str, ab_config: dict) -> str:
"""Route customer to adapter version based on A/B split."""
split = ab_config.get(customer_id, {"v1": 1.0}) # default: 100% v1
total_weight = sum(split.values())
if abs(total_weight - 1.0) > 1e-6:
raise ValueError(
f"A/B weights for customer {customer_id!r} sum to {total_weight:.4f}, must equal 1.0"
)
r = random.random()
cumulative = 0.0
for version, weight in split.items():
cumulative += weight
if r < cumulative:
return f"tenant-{customer_id}-{version}"
# Unreachable after weight validation, but return last version as a safety net
last_version = list(split.keys())[-1]
return f"tenant-{customer_id}-{last_version}"No model redeployment needed. Both adapters coexist in the same vLLM instance. Switch the split by updating the config, not the server.
Common Issues
| Issue | Cause | Fix |
|---|---|---|
| OOM on adapter load | --max-lora-rank too high | Set to max rank across all adapters, not higher |
| High TTFT on cold adapter | Adapter not in GPU cache | Increase --max-loras or pre-warm with dummy requests at startup |
| Adapter alias not found | Not registered at startup | Use /v1/load_lora_adapter REST endpoint at runtime |
| Wrong adapter served | Alias collision | Use namespaced aliases like tenant-{id}-v{version} |
| S3 cold load latency | Network round trip on first request | Pre-warm adapters at startup; set --max-cpu-loras to buffer in CPU RAM |
| Adapters incompatible with base model | Different base model or dtype used during training | All adapters must match base model architecture and dtype exactly |
Multi-adapter LoRA serving makes per-customer fine-tuning economically viable: one H100 on Spheron replaces 100 separate model instances. Provision an H100 or A100 in minutes and start serving your adapters.
