Tutorial

LoRA Multi-Adapter Serving: Fine-Tune Once, Serve 100 Customers on One GPU

Back to BlogWritten by Mitrasish, Co-founderMar 28, 2026
LoRALLM ServingvLLMSGLangGPU CloudFine-TuningMulti-Tenant AIH100
LoRA Multi-Adapter Serving: Fine-Tune Once, Serve 100 Customers on One GPU

Most LoRA guides end at trainer.train(). What happens after, serving those adapters to real users at scale, is where most teams get stuck. The economics are brutal without a good answer: if you spin up a dedicated model instance per customer, you need 100 GPUs for 100 customers. With multi-adapter serving, you need one. The math is that simple, and this guide walks through everything you need to make it work in production.

If you haven't built your adapters yet, start with our LLM fine-tuning guide first. If you already have adapters and want to serve them, keep reading.

The Production Gap: Why LoRA Guides Stop at Training

Fine-tuning is well-documented because the tooling is mature and the problem is self-contained. You have a dataset, you run training, you get an adapter file. Done. The problem comes next: you now have dozens or hundreds of these adapter files, one per customer or use case, and you need to serve them under production load.

The naive answer is to run a separate inference server for each adapter. At small scale, that works. At 10 customers it's expensive. At 100 customers it's financially unsustainable. A dedicated Llama 3.1 8B instance per customer means 100 copies of a 16GB model, sitting in VRAM across 100 GPUs, most of them idle most of the time.

Multi-adapter serving is the pattern that breaks this equation. One base model instance in VRAM, adapters loaded on demand per request, customers isolated by the model routing layer. This is what vLLM's LoRA support was designed for.

All adapters must share the same base model architecture and dtype. A mixed-base setup (some adapters on Llama 3.1 8B, others on Llama 3.1 70B) requires separate vLLM instances. Plan your fine-tuning pipeline around a single base model if you want to consolidate serving.

Multi-Adapter Architecture: One Base Model, Hundreds of Adapters

The architecture is conceptually simple. The base model weights sit in GPU HBM, frozen. Each LoRA adapter is a small set of low-rank matrices (the A and B matrices per attention layer) that represent the fine-tuned delta. At inference time:

  1. A request arrives with a model field specifying the adapter alias (e.g., customer-a).
  2. vLLM's LoRAManager checks if customer-a's weights are in GPU memory. If yes, they get merged into the base model's computation for this request. If not, they get loaded from CPU RAM or disk.
  3. The request runs through the base model with those delta weights applied as residual additions to the attention outputs.
  4. After the request completes, the adapter stays cached (LRU eviction when capacity is exceeded).

The key architectural point: adapters share the base model's KV cache and all attention layers. Only the delta weights are per-customer. This is why the VRAM budget is so favorable.

Base Model (frozen, ~16 GB for 8B FP16)
     |
     +-- LoRA Delta: customer-a (~60 MB, rank 16)
     +-- LoRA Delta: customer-b (~60 MB, rank 16)
     +-- LoRA Delta: customer-c (~60 MB, rank 16)
     +-- ... (up to hundreds more in CPU RAM)
     |
KV Cache (shared, ~4-8 GB for typical batch sizes)

vLLM pre-allocates GPU memory buffers sized to --max-lora-rank at startup, not per-adapter. If any adapter has rank 64 but most are rank 16, all adapters get rank-64 buffers. Set --max-lora-rank to the actual maximum across your adapter set, not higher. Oversizing this wastes significant VRAM.

GPU Memory Math: How 100 Adapters Fit on One H100

Here's the VRAM breakdown for a typical production setup with Llama 3.1 8B and 100 LoRA adapters:

ComponentSize
Llama 3.1 8B base (FP16)~16 GB
8 x LoRA r=16 adapters in GPU~0.5 GB total
92 x LoRA adapters in CPU RAM~6 GB total
KV cache (batch of 64, 8K context)~4 GB
Activations and buffers~2 GB
Total GPU VRAM~22.5 GB

Compare that to the alternative: 100 separate Llama 3.1 8B instances at 16GB each = 1,600 GB across 20 H100s. The multi-adapter setup fits on a single A100 80GB with room to spare.

VRAM scales with adapter rank. Higher rank means more parameters per adapter and more VRAM per adapter in GPU cache:

Adapter RankSize per Adapter (8B model)Max Adapters on H100 80GB (w/ base model)
r=8~30 MB200+ (GPU cache)
r=16~60 MB100+ (GPU cache)
r=32~120 MB50+ (GPU cache)
r=64~240 MB25+ (GPU cache)

In practice, keep only the most frequently requested adapters in GPU cache (--max-loras). Less-frequent adapters live in CPU RAM (--max-cpu-loras) and load on demand. This tiered approach handles hundreds of adapters with minimal VRAM overhead.

For a deeper breakdown of how VRAM is split between weights, KV cache, and activations, see our GPU memory requirements for LLMs guide.

vLLM LoRA Serving: Configuration and Flags

vLLM has supported LoRA since v0.3.0. Dynamic adapter loading via REST API is available in recent releases (v0.6.2+). Use the latest release.

Full production launch command:

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -v /path/to/adapters:/adapters \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype fp16 \
  --enable-lora \
  --max-loras 8 \
  --max-lora-rank 64 \
  --max-cpu-loras 64 \
  --lora-modules customer-a=/adapters/customer-a \
                 customer-b=/adapters/customer-b \
                 customer-c=/adapters/customer-c \
  --max-num-seqs 64

What each LoRA flag does:

  • --enable-lora: activates LoRA support in vLLM's scheduler and LoRAManager.
  • --max-loras: number of adapters that can be in GPU memory at once. Higher means more VRAM usage, lower means more adapter swaps (and latency on cache misses).
  • --max-lora-rank: maximum rank across all registered adapters. vLLM pre-allocates buffers for this rank at startup. Set it to your actual max rank, not higher.
  • --max-cpu-loras: adapters held in CPU RAM as an intermediate cache tier before eviction to disk. Acts as a buffer between GPU cache and disk/S3.
  • --lora-modules: space-separated alias=path pairs. Paths can be local directories or S3 URIs (s3://bucket/path/to/adapter).

Sending a request to a specific adapter uses the standard OpenAI API, with the model field set to the adapter alias:

python
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="customer-a",  # routes to the customer-a LoRA adapter
    messages=[{"role": "user", "content": "Summarize this support ticket: ..."}],
)
print(response.choices[0].message.content)

To add adapters dynamically at runtime without restarting the server, use the REST API (vLLM v0.6.2+). This requires the VLLM_ALLOW_RUNTIME_LORA_UPDATING=True environment variable to be set when the server starts:

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -e VLLM_ALLOW_RUNTIME_LORA_UPDATING=True \
  -v /path/to/adapters:/adapters \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 8 \
  --max-cpu-loras 64

Then register new adapters without restarting:

bash
curl -X POST http://localhost:8000/v1/load_lora_adapter \
  -H "Content-Type: application/json" \
  -d '{"lora_name": "customer-d", "lora_path": "/adapters/customer-d"}'

This is how you handle new customer onboarding without downtime: upload the adapter, call the endpoint, start serving.

SGLang LoRA Serving: The Weight Loading Overlap Advantage

SGLang v0.5.9 introduced weight loading overlap for LoRA, reducing time-to-first-token by up to 78% vs sequential loading. The feature overlaps the computation of the first few layers with loading remaining LoRA weights onto the GPU, hiding most of the adapter load latency. Full framework comparison in our vLLM vs TensorRT-LLM vs SGLang benchmarks.

SGLang launch command for LoRA serving:

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -v /path/to/adapters:/adapters \
  lmsysorg/sglang:v0.5.9-cu124-runtime \
  python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --lora-paths customer-a=/adapters/customer-a \
               customer-b=/adapters/customer-b \
  --max-loras-per-batch 4 \
  --port 8000

When to choose SGLang over vLLM for LoRA:

  • Your workload is latency-sensitive (TTFT matters more than throughput).
  • You have a small number of active adapters with high request concurrency per adapter.
  • You're already using SGLang for its RadixAttention prefix caching.

When to stick with vLLM:

  • You need dynamic adapter loading via REST API (SGLang's runtime adapter API is less mature).
  • You have a large adapter registry (50+) that needs LRU eviction and CPU offload.
  • You need the widest model compatibility.

Step-by-Step: Deploy on Spheron GPU Cloud

1. Provision a GPU Instance

Go to app.spheron.ai, select H100 SXM5 or A100 80GB from the GPU catalog. SSH into the instance and verify:

bash
nvidia-smi
# Should show your GPU with full VRAM available

For a 7-8B base model with up to 50 adapters, the A100 80GB at $1.05/hr is the right choice. For 13B+ models or adapters at rank 32+, use the H100 SXM5 at $2.40/hr.

2. Install vLLM

bash
pip install vllm

LoRA support is built-in from v0.3.0+. No extras needed.

3. Upload Your LoRA Adapters

Three options:

bash
# Option 1: scp from local machine
scp -r ./my-adapter user@your-instance:/home/user/adapters/customer-a

# Option 2: from Hugging Face Hub
huggingface-cli download my-org/customer-a-adapter --local-dir /adapters/customer-a

# Option 3: from S3 (use the S3 URI directly in --lora-modules)
# s3://your-bucket/adapters/customer-a

4. Launch the Multi-Adapter Server

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -v /home/user/adapters:/adapters \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype fp16 \
  --enable-lora \
  --max-loras 8 \
  --max-lora-rank 64 \
  --max-cpu-loras 64 \
  --lora-modules customer-a=/adapters/customer-a \
                 customer-b=/adapters/customer-b \
  --max-num-seqs 64

5. Test Per-Adapter Routing

bash
# Test customer-a adapter
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "customer-a",
    "messages": [{"role": "user", "content": "Hello, who are you?"}]
  }'

Or using the Python SDK:

python
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

for adapter in ["customer-a", "customer-b"]:
    response = client.chat.completions.create(
        model=adapter,
        messages=[{"role": "user", "content": "Hello"}],
    )
    print(f"{adapter}: {response.choices[0].message.content}")

6. Systemd Service for Persistence

Create /etc/systemd/system/vllm-lora.service so the server survives SSH disconnection:

ini
[Unit]
Description=vLLM LoRA Multi-Adapter Server
After=network.target

[Service]
Type=simple
Restart=on-failure
RestartSec=5
EnvironmentFile=/etc/vllm-lora.env
ExecStart=/usr/bin/docker run --rm --gpus all --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} \
  -v /home/user/adapters:/adapters \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype fp16 \
  --enable-lora \
  --max-loras 8 \
  --max-lora-rank 64 \
  --max-cpu-loras 64 \
  --lora-modules customer-a=/adapters/customer-a

[Install]
WantedBy=multi-user.target

Before enabling the service, create /etc/vllm-lora.env with restricted permissions to keep your token out of the unit file:

bash
sudo bash -c 'echo "HUGGING_FACE_HUB_TOKEN=your_actual_token" > /etc/vllm-lora.env'
sudo chmod 600 /etc/vllm-lora.env

Then enable and start:

bash
sudo systemctl daemon-reload
sudo systemctl enable vllm-lora
sudo systemctl start vllm-lora
sudo journalctl -u vllm-lora -f  # follow logs

Restart=on-failure means systemd restarts the service if it crashes, but systemctl stop vllm-lora performs a clean shutdown without an immediate restart.

Adapter Hot-Swapping: Request-Level Routing

vLLM's LoRAManager handles adapter selection at the request level. When a request arrives:

  1. The model field maps to an adapter alias registered at startup (or added via /v1/load_lora_adapter).
  2. LoRAManager checks GPU cache. If the adapter is there, it proceeds. Cache hit latency is sub-millisecond.
  3. On a cache miss, it loads from CPU RAM first (fast, tens of milliseconds), then from disk or S3 if not in CPU RAM (hundreds of milliseconds for S3).
  4. GPU cache eviction is LRU. If 8 slots are full and a 9th adapter is needed, the least recently used adapter gets evicted to CPU RAM.

Pre-warming: to avoid cold-adapter latency for important customers, send a dummy request per adapter at startup:

python
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
adapters = ["customer-a", "customer-b", "customer-c"]

for adapter in adapters:
    client.chat.completions.create(
        model=adapter,
        messages=[{"role": "user", "content": "warmup"}],
        max_tokens=1,
    )
    print(f"Pre-warmed: {adapter}")

For production deployments, build an adapter registry: a mapping from customer ID to adapter path, stored in Postgres or Redis. A thin FastAPI wrapper resolves customer_id to adapter alias before forwarding to vLLM:

python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import openai

app = FastAPI()
client = openai.AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="none")

# In practice, load this from your database
ADAPTER_REGISTRY = {
    "cust_abc123": "customer-a",
    "cust_def456": "customer-b",
}

class ChatRequest(BaseModel):
    customer_id: str
    messages: list

@app.post("/v1/chat")
async def chat(request: ChatRequest):
    adapter_alias = ADAPTER_REGISTRY.get(request.customer_id)
    if not adapter_alias:
        raise HTTPException(status_code=404, detail="unknown customer")

    response = await client.chat.completions.create(
        model=adapter_alias,
        messages=request.messages,
    )
    return response.model_dump()

Use namespaced aliases (tenant-{id}-v{version}) to avoid collisions across customers and model versions.

Cost Comparison: Multi-Adapter vs Per-Instance

Here's what the math looks like at 100 customers, running 24/7 on-demand on Spheron:

ApproachGPU SetupMonthly Cost (100 customers)
1 model instance per customer100x A100 80GB$75,600/month
Multi-adapter on 1 H100 SXM51x H100 SXM5$1,728/month
Multi-adapter on 2 H100 SXM5 (HA)2x H100 SXM5$3,456/month

The per-instance approach costs $75,600/month (100 A100s at $1.05/hr). One H100 with multi-adapter serving costs $1,728/month. That's a 97.7% cost reduction for the same 100 customers.

In practice, one H100 handles 100 customers at moderate concurrency (a few dozen simultaneous requests). If your traffic is higher, two H100s behind a load balancer gives you high availability and headroom for growth at $3,456/month, still 95.4% cheaper than the per-instance approach. For more ways to reduce GPU spend across your entire stack, see the GPU cost optimization playbook.

Rent H100 → | Rent A100 → | View all pricing →

Pricing fluctuates based on GPU availability. The prices above are based on 28 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Production Patterns: Scaling, Registries, and A/B Testing

Auto-Scaling

vLLM exposes a Prometheus-compatible /metrics endpoint. Key metrics for scaling decisions:

  • vllm:num_requests_waiting: queue depth. If this is consistently above 0, you need more capacity.
  • vllm:gpu_cache_usage_perc: KV cache fill rate. Above 90% means you're hitting memory limits.
  • vllm:time_to_first_token_seconds: latency by percentile.

Scale horizontally (multiple H100 instances behind a load balancer) when queue depth exceeds a threshold. A simple rule: if num_requests_waiting averages above 5 for more than 60 seconds, add another instance.

For full Prometheus and Grafana monitoring setup, see the vLLM production deployment guide.

Adapter Registries

Store adapter metadata in a database. Minimal Postgres schema:

sql
CREATE TABLE lora_adapters (
    id          SERIAL PRIMARY KEY,
    customer_id VARCHAR(128) NOT NULL,
    alias       VARCHAR(256) NOT NULL UNIQUE,
    path        TEXT NOT NULL,          -- local path or s3:// URI
    base_model  VARCHAR(256) NOT NULL,
    rank        INTEGER NOT NULL,
    version     INTEGER DEFAULT 1,
    created_at  TIMESTAMP DEFAULT NOW()
);

When a customer onboards: upload their adapter to S3, insert a row, call /v1/load_lora_adapter to register the adapter with the running vLLM instance. No restart needed.

When the server restarts, reload all adapters from the registry at startup using the --lora-modules flag or the REST API in a startup script.

A/B Testing Fine-Tuned Models

Register two adapter aliases for the same customer (tenant-abc-v1, tenant-abc-v2). Split traffic at the API gateway layer using a request header or simple percentage split:

python
import random

def get_adapter_alias(customer_id: str, ab_config: dict) -> str:
    """Route customer to adapter version based on A/B split."""
    split = ab_config.get(customer_id, {"v1": 1.0})  # default: 100% v1
    total_weight = sum(split.values())
    if abs(total_weight - 1.0) > 1e-6:
        raise ValueError(
            f"A/B weights for customer {customer_id!r} sum to {total_weight:.4f}, must equal 1.0"
        )
    r = random.random()
    cumulative = 0.0
    for version, weight in split.items():
        cumulative += weight
        if r < cumulative:
            return f"tenant-{customer_id}-{version}"
    # Unreachable after weight validation, but return last version as a safety net
    last_version = list(split.keys())[-1]
    return f"tenant-{customer_id}-{last_version}"

No model redeployment needed. Both adapters coexist in the same vLLM instance. Switch the split by updating the config, not the server.

Common Issues

IssueCauseFix
OOM on adapter load--max-lora-rank too highSet to max rank across all adapters, not higher
High TTFT on cold adapterAdapter not in GPU cacheIncrease --max-loras or pre-warm with dummy requests at startup
Adapter alias not foundNot registered at startupUse /v1/load_lora_adapter REST endpoint at runtime
Wrong adapter servedAlias collisionUse namespaced aliases like tenant-{id}-v{version}
S3 cold load latencyNetwork round trip on first requestPre-warm adapters at startup; set --max-cpu-loras to buffer in CPU RAM
Adapters incompatible with base modelDifferent base model or dtype used during trainingAll adapters must match base model architecture and dtype exactly

Multi-adapter LoRA serving makes per-customer fine-tuning economically viable: one H100 on Spheron replaces 100 separate model instances. Provision an H100 or A100 in minutes and start serving your adapters.

Rent H100 → | Rent A100 → | View all pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.