How many LoRA adapters can vLLM serve simultaneously on one GPU?

vLLM's default is --max-loras 1 (one adapter in GPU memory at a time). For production multi-tenant setups, set --max-loras 8 to keep 8 adapters in GPU cache, and --max-cpu-loras 64 to buffer 64 adapters in CPU RAM (the actual default for --max-cpu-loras equals max_num_seqs). The --max-lora-rank flag sets the maximum rank dimension for any single adapter (not a count), which determines VRAM buffer allocation. On an H100 80GB, you can cache 100+ adapters at 30-100MB each while keeping the base model loaded.

What is the VRAM overhead of serving multiple LoRA adapters?

Each LoRA adapter at rank 16 adds roughly 30-100MB depending on base model size. A 16GB base model (Llama 3.1 8B in FP16) plus 100 LoRA adapters at 60MB each totals about 22GB - well within an H100 or A100 80GB VRAM budget.

Does LoRA adapter switching add latency?

vLLM swaps adapters per-request with sub-millisecond overhead using its LoRAManager. SGLang v0.5.9+ uses weight loading overlap that reduces LoRA TTFT by up to 78% vs naive sequential loading.

Can I serve LoRA adapters without the adapter being in GPU memory?

Yes. vLLM supports on-demand loading from disk or S3-compatible storage via --lora-modules with a path argument. Adapters are loaded when first requested and evicted from the LRU cache when capacity is exceeded.

Which GPU should I use for production LoRA multi-adapter serving?

For a 7-8B base model with 50+ adapters: A100 80GB or L40S. For 13-70B models or high concurrency: H100 SXM5 80GB. The H100's HBM3 memory bandwidth (3.35 TB/s) reduces adapter activation latency significantly vs PCIe variants.

LoRA Multi-Adapter Serving: Fine-Tune Once, Serve 100 Customers on One GPU

Most LoRA guides end at trainer.train(). What happens after, serving those adapters to real users at scale, is where most teams get stuck. The economics are brutal without a good answer: if you spin up a dedicated model instance per customer, you need 100 GPUs for 100 customers. With multi-adapter serving, you need one. The math is that simple, and this guide walks through everything you need to make it work in production.

If you haven't built your adapters yet, start with our LLM fine-tuning guide first. If you already have adapters and want to serve them, keep reading.

The Production Gap: Why LoRA Guides Stop at Training

Fine-tuning is well-documented because the tooling is mature and the problem is self-contained. You have a dataset, you run training, you get an adapter file. Done. The problem comes next: you now have dozens or hundreds of these adapter files, one per customer or use case, and you need to serve them under production load.

The naive answer is to run a separate inference server for each adapter. At small scale, that works. At 10 customers it's expensive. At 100 customers it's financially unsustainable. A dedicated Llama 3.1 8B instance per customer means 100 copies of a 16GB model, sitting in VRAM across 100 GPUs, most of them idle most of the time.

Multi-adapter serving is the pattern that breaks this equation. One base model instance in VRAM, adapters loaded on demand per request, customers isolated by the model routing layer. This is what vLLM's LoRA support was designed for.

All adapters must share the same base model architecture and dtype. A mixed-base setup (some adapters on Llama 3.1 8B, others on Llama 3.1 70B) requires separate vLLM instances. Plan your fine-tuning pipeline around a single base model if you want to consolidate serving.

Multi-Adapter Architecture: One Base Model, Hundreds of Adapters

The architecture is conceptually simple. The base model weights sit in GPU HBM, frozen. Each LoRA adapter is a small set of low-rank matrices (the A and B matrices per attention layer) that represent the fine-tuned delta. At inference time:

A request arrives with a model field specifying the adapter alias (e.g., customer-a).
vLLM's LoRAManager checks if customer-a's weights are in GPU memory. If yes, they get merged into the base model's computation for this request. If not, they get loaded from CPU RAM or disk.
The request runs through the base model with those delta weights applied as residual additions to the attention outputs.
After the request completes, the adapter stays cached (LRU eviction when capacity is exceeded).

The key architectural point: adapters share the base model's KV cache and all attention layers. Only the delta weights are per-customer. This is why the VRAM budget is so favorable.

Base Model (frozen, ~16 GB for 8B FP16)
     |
     +-- LoRA Delta: customer-a (~60 MB, rank 16)
     +-- LoRA Delta: customer-b (~60 MB, rank 16)
     +-- LoRA Delta: customer-c (~60 MB, rank 16)
     +-- ... (up to hundreds more in CPU RAM)
     |
KV Cache (shared, ~4-8 GB for typical batch sizes)

vLLM pre-allocates GPU memory buffers sized to --max-lora-rank at startup, not per-adapter. If any adapter has rank 64 but most are rank 16, all adapters get rank-64 buffers. Set --max-lora-rank to the actual maximum across your adapter set, not higher. Oversizing this wastes significant VRAM.

GPU Memory Math: How 100 Adapters Fit on One H100

Here's the VRAM breakdown for a typical production setup with Llama 3.1 8B and 100 LoRA adapters:

Component	Size
Llama 3.1 8B base (FP16)	~16 GB
8 x LoRA r=16 adapters in GPU	~0.5 GB total
92 x LoRA adapters in CPU RAM	~6 GB total
KV cache (batch of 64, 8K context)	~4 GB
Activations and buffers	~2 GB
Total GPU VRAM	~22.5 GB

Compare that to the alternative: 100 separate Llama 3.1 8B instances at 16GB each = 1,600 GB across 20 H100s. The multi-adapter setup fits on a single A100 80GB with room to spare.

VRAM scales with adapter rank. Higher rank means more parameters per adapter and more VRAM per adapter in GPU cache:

Adapter Rank	Size per Adapter (8B model)	Max Adapters on H100 80GB (w/ base model)
r=8	~30 MB	200+ (GPU cache)
r=16	~60 MB	100+ (GPU cache)
r=32	~120 MB	50+ (GPU cache)
r=64	~240 MB	25+ (GPU cache)

In practice, keep only the most frequently requested adapters in GPU cache (--max-loras). Less-frequent adapters live in CPU RAM (--max-cpu-loras) and load on demand. This tiered approach handles hundreds of adapters with minimal VRAM overhead.

For a deeper breakdown of how VRAM is split between weights, KV cache, and activations, see our GPU memory requirements for LLMs guide.

vLLM LoRA Serving: Configuration and Flags

vLLM has supported LoRA since v0.3.0. Dynamic adapter loading via REST API is available in recent releases (v0.6.2+). Use the latest release.

Full production launch command:

bash

docker run --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -v /path/to/adapters:/adapters \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype fp16 \
  --enable-lora \
  --max-loras 8 \
  --max-lora-rank 64 \
  --max-cpu-loras 64 \
  --lora-modules customer-a=/adapters/customer-a \
                 customer-b=/adapters/customer-b \
                 customer-c=/adapters/customer-c \
  --max-num-seqs 64

What each LoRA flag does:

--enable-lora: activates LoRA support in vLLM's scheduler and LoRAManager.
--max-loras: number of adapters that can be in GPU memory at once. Higher means more VRAM usage, lower means more adapter swaps (and latency on cache misses).
--max-lora-rank: maximum rank across all registered adapters. vLLM pre-allocates buffers for this rank at startup. Set it to your actual max rank, not higher.
--max-cpu-loras: adapters held in CPU RAM as an intermediate cache tier before eviction to disk. Acts as a buffer between GPU cache and disk/S3.
--lora-modules: space-separated alias=path pairs. Paths can be local directories or S3 URIs (s3://bucket/path/to/adapter).

Sending a request to a specific adapter uses the standard OpenAI API, with the model field set to the adapter alias:

python

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="customer-a",  # routes to the customer-a LoRA adapter
    messages=[{"role": "user", "content": "Summarize this support ticket: ..."}],
)
print(response.choices[0].message.content)

To add adapters dynamically at runtime without restarting the server, use the REST API (vLLM v0.6.2+). This requires the VLLM_ALLOW_RUNTIME_LORA_UPDATING=True environment variable to be set when the server starts:

bash

docker run --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -e VLLM_ALLOW_RUNTIME_LORA_UPDATING=True \
  -v /path/to/adapters:/adapters \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 8 \
  --max-cpu-loras 64

Then register new adapters without restarting:

bash

curl -X POST http://localhost:8000/v1/load_lora_adapter \
  -H "Content-Type: application/json" \
  -d '{"lora_name": "customer-d", "lora_path": "/adapters/customer-d"}'

This is how you handle new customer onboarding without downtime: upload the adapter, call the endpoint, start serving.

SGLang LoRA Serving: The Weight Loading Overlap Advantage

SGLang v0.5.9 introduced weight loading overlap for LoRA, reducing time-to-first-token by up to 78% vs sequential loading. The feature overlaps the computation of the first few layers with loading remaining LoRA weights onto the GPU, hiding most of the adapter load latency. Full framework comparison in our vLLM vs TensorRT-LLM vs SGLang benchmarks.

SGLang launch command for LoRA serving:

bash

docker run --gpus all --ipc=host -p 8000:8000 \
  -v /path/to/adapters:/adapters \
  lmsysorg/sglang:v0.5.9-cu124-runtime \
  python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --lora-paths customer-a=/adapters/customer-a \
               customer-b=/adapters/customer-b \
  --max-loras-per-batch 4 \
  --port 8000

When to choose SGLang over vLLM for LoRA:

Your workload is latency-sensitive (TTFT matters more than throughput).
You have a small number of active adapters with high request concurrency per adapter.
You're already using SGLang for its RadixAttention prefix caching.

When to stick with vLLM:

You need dynamic adapter loading via REST API (SGLang's runtime adapter API is less mature).
You have a large adapter registry (50+) that needs LRU eviction and CPU offload.
You need the widest model compatibility.

Step-by-Step: Deploy on Spheron GPU Cloud

1. Provision a GPU Instance

Go to app.spheron.ai, select H100 SXM5 or A100 80GB from the GPU catalog. SSH into the instance and verify:

bash

nvidia-smi
# Should show your GPU with full VRAM available

For a 7-8B base model with up to 50 adapters, the A100 80GB at $1.05/hr is the right choice. For 13B+ models or adapters at rank 32+, use the H100 SXM5 at $2.40/hr.

2. Install vLLM

bash

pip install vllm

LoRA support is built-in from v0.3.0+. No extras needed.

3. Upload Your LoRA Adapters

Three options:

bash

# Option 1: scp from local machine
scp -r ./my-adapter user@your-instance:/home/user/adapters/customer-a

# Option 2: from Hugging Face Hub
huggingface-cli download my-org/customer-a-adapter --local-dir /adapters/customer-a

# Option 3: from S3 (use the S3 URI directly in --lora-modules)
# s3://your-bucket/adapters/customer-a

4. Launch the Multi-Adapter Server

bash

docker run --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -v /home/user/adapters:/adapters \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype fp16 \
  --enable-lora \
  --max-loras 8 \
  --max-lora-rank 64 \
  --max-cpu-loras 64 \
  --lora-modules customer-a=/adapters/customer-a \
                 customer-b=/adapters/customer-b \
  --max-num-seqs 64

5. Test Per-Adapter Routing

bash

# Test customer-a adapter
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "customer-a",
    "messages": [{"role": "user", "content": "Hello, who are you?"}]
  }'

Or using the Python SDK:

python

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

for adapter in ["customer-a", "customer-b"]:
    response = client.chat.completions.create(
        model=adapter,
        messages=[{"role": "user", "content": "Hello"}],
    )
    print(f"{adapter}: {response.choices[0].message.content}")

6. Systemd Service for Persistence

Create /etc/systemd/system/vllm-lora.service so the server survives SSH disconnection:

ini

[Unit]
Description=vLLM LoRA Multi-Adapter Server
After=network.target

[Service]
Type=simple
Restart=on-failure
RestartSec=5
EnvironmentFile=/etc/vllm-lora.env
ExecStart=/usr/bin/docker run --rm --gpus all --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} \
  -v /home/user/adapters:/adapters \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype fp16 \
  --enable-lora \
  --max-loras 8 \
  --max-lora-rank 64 \
  --max-cpu-loras 64 \
  --lora-modules customer-a=/adapters/customer-a

[Install]
WantedBy=multi-user.target

Before enabling the service, create /etc/vllm-lora.env with restricted permissions to keep your token out of the unit file:

bash

sudo bash -c 'echo "HUGGING_FACE_HUB_TOKEN=your_actual_token" > /etc/vllm-lora.env'
sudo chmod 600 /etc/vllm-lora.env

Then enable and start:

bash

sudo systemctl daemon-reload
sudo systemctl enable vllm-lora
sudo systemctl start vllm-lora
sudo journalctl -u vllm-lora -f  # follow logs

Restart=on-failure means systemd restarts the service if it crashes, but systemctl stop vllm-lora performs a clean shutdown without an immediate restart.

Adapter Hot-Swapping: Request-Level Routing

vLLM's LoRAManager handles adapter selection at the request level. When a request arrives:

The model field maps to an adapter alias registered at startup (or added via /v1/load_lora_adapter).
LoRAManager checks GPU cache. If the adapter is there, it proceeds. Cache hit latency is sub-millisecond.
On a cache miss, it loads from CPU RAM first (fast, tens of milliseconds), then from disk or S3 if not in CPU RAM (hundreds of milliseconds for S3).
GPU cache eviction is LRU. If 8 slots are full and a 9th adapter is needed, the least recently used adapter gets evicted to CPU RAM.

Pre-warming: to avoid cold-adapter latency for important customers, send a dummy request per adapter at startup:

python

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
adapters = ["customer-a", "customer-b", "customer-c"]

for adapter in adapters:
    client.chat.completions.create(
        model=adapter,
        messages=[{"role": "user", "content": "warmup"}],
        max_tokens=1,
    )
    print(f"Pre-warmed: {adapter}")

For production deployments, build an adapter registry: a mapping from customer ID to adapter path, stored in Postgres or Redis. A thin FastAPI wrapper resolves customer_id to adapter alias before forwarding to vLLM:

python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import openai

app = FastAPI()
client = openai.AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="none")

# In practice, load this from your database
ADAPTER_REGISTRY = {
    "cust_abc123": "customer-a",
    "cust_def456": "customer-b",
}

class ChatRequest(BaseModel):
    customer_id: str
    messages: list

@app.post("/v1/chat")
async def chat(request: ChatRequest):
    adapter_alias = ADAPTER_REGISTRY.get(request.customer_id)
    if not adapter_alias:
        raise HTTPException(status_code=404, detail="unknown customer")

    response = await client.chat.completions.create(
        model=adapter_alias,
        messages=request.messages,
    )
    return response.model_dump()

Use namespaced aliases (tenant-{id}-v{version}) to avoid collisions across customers and model versions.

Cost Comparison: Multi-Adapter vs Per-Instance

Here's what the math looks like at 100 customers, running 24/7 on-demand on Spheron:

Approach	GPU Setup	Monthly Cost (100 customers)
1 model instance per customer	100x A100 80GB	$75,600/month
Multi-adapter on 1 H100 SXM5	1x H100 SXM5	$1,728/month
Multi-adapter on 2 H100 SXM5 (HA)	2x H100 SXM5	$3,456/month

The per-instance approach costs $75,600/month (100 A100s at $1.05/hr). One H100 with multi-adapter serving costs $1,728/month. That's a 97.7% cost reduction for the same 100 customers.

In practice, one H100 handles 100 customers at moderate concurrency (a few dozen simultaneous requests). If your traffic is higher, two H100s behind a load balancer gives you high availability and headroom for growth at $3,456/month, still 95.4% cheaper than the per-instance approach. For more ways to reduce GPU spend across your entire stack, see the GPU cost optimization playbook.

Rent H100 → | Rent A100 → | View all pricing →

Pricing fluctuates based on GPU availability. The prices above are based on 28 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Production Patterns: Scaling, Registries, and A/B Testing

Auto-Scaling

vLLM exposes a Prometheus-compatible /metrics endpoint. Key metrics for scaling decisions:

vllm:num_requests_waiting: queue depth. If this is consistently above 0, you need more capacity.
vllm:gpu_cache_usage_perc: KV cache fill rate. Above 90% means you're hitting memory limits.
vllm:time_to_first_token_seconds: latency by percentile.

Scale horizontally (multiple H100 instances behind a load balancer) when queue depth exceeds a threshold. A simple rule: if num_requests_waiting averages above 5 for more than 60 seconds, add another instance.

For full Prometheus and Grafana monitoring setup, see the vLLM production deployment guide.

Adapter Registries

Store adapter metadata in a database. Minimal Postgres schema:

sql

CREATE TABLE lora_adapters (
    id          SERIAL PRIMARY KEY,
    customer_id VARCHAR(128) NOT NULL,
    alias       VARCHAR(256) NOT NULL UNIQUE,
    path        TEXT NOT NULL,          -- local path or s3:// URI
    base_model  VARCHAR(256) NOT NULL,
    rank        INTEGER NOT NULL,
    version     INTEGER DEFAULT 1,
    created_at  TIMESTAMP DEFAULT NOW()
);

When a customer onboards: upload their adapter to S3, insert a row, call /v1/load_lora_adapter to register the adapter with the running vLLM instance. No restart needed.

When the server restarts, reload all adapters from the registry at startup using the --lora-modules flag or the REST API in a startup script.

A/B Testing Fine-Tuned Models

Register two adapter aliases for the same customer (tenant-abc-v1, tenant-abc-v2). Split traffic at the API gateway layer using a request header or simple percentage split:

python

import random

def get_adapter_alias(customer_id: str, ab_config: dict) -> str:
    """Route customer to adapter version based on A/B split."""
    split = ab_config.get(customer_id, {"v1": 1.0})  # default: 100% v1
    total_weight = sum(split.values())
    if abs(total_weight - 1.0) > 1e-6:
        raise ValueError(
            f"A/B weights for customer {customer_id!r} sum to {total_weight:.4f}, must equal 1.0"
        )
    r = random.random()
    cumulative = 0.0
    for version, weight in split.items():
        cumulative += weight
        if r < cumulative:
            return f"tenant-{customer_id}-{version}"
    # Unreachable after weight validation, but return last version as a safety net
    last_version = list(split.keys())[-1]
    return f"tenant-{customer_id}-{last_version}"

No model redeployment needed. Both adapters coexist in the same vLLM instance. Switch the split by updating the config, not the server.

Common Issues

Issue	Cause	Fix
OOM on adapter load	`--max-lora-rank` too high	Set to max rank across all adapters, not higher
High TTFT on cold adapter	Adapter not in GPU cache	Increase `--max-loras` or pre-warm with dummy requests at startup
Adapter alias not found	Not registered at startup	Use `/v1/load_lora_adapter` REST endpoint at runtime
Wrong adapter served	Alias collision	Use namespaced aliases like `tenant-{id}-v{version}`
S3 cold load latency	Network round trip on first request	Pre-warm adapters at startup; set `--max-cpu-loras` to buffer in CPU RAM
Adapters incompatible with base model	Different base model or dtype used during training	All adapters must match base model architecture and dtype exactly

Multi-adapter LoRA serving makes per-customer fine-tuning economically viable: one H100 on Spheron replaces 100 separate model instances. Provision an H100 or A100 in minutes and start serving your adapters.
Rent H100 → | Rent A100 → | View all pricing →
Get started on Spheron →

The Production Gap: Why LoRA Guides Stop at Training

Multi-Adapter Architecture: One Base Model, Hundreds of Adapters

GPU Memory Math: How 100 Adapters Fit on One H100

vLLM LoRA Serving: Configuration and Flags

SGLang LoRA Serving: The Weight Loading Overlap Advantage

Step-by-Step: Deploy on Spheron GPU Cloud

1. Provision a GPU Instance

2. Install vLLM

3. Upload Your LoRA Adapters

4. Launch the Multi-Adapter Server

5. Test Per-Adapter Routing

6. Systemd Service for Persistence

Adapter Hot-Swapping: Request-Level Routing

Cost Comparison: Multi-Adapter vs Per-Instance

Production Patterns: Scaling, Registries, and A/B Testing

Auto-Scaling

Adapter Registries

A/B Testing Fine-Tuned Models

Common Issues

Build what's next.