Engineering

Multi-Tenant LLM Serving on GPU Cloud: Per-Customer Isolation, Token Quotas, and Production SaaS Architecture Guide (2026)

Multi-Tenant LLM ServingLLM SaaS ArchitectureLLM Tenant IsolationPer-Customer Token QuotasMulti-Tenant Inference GPU CloudLLM Rate LimitingBuild LLM SaaS InfrastructurevLLMGPU Cloud
Multi-Tenant LLM Serving on GPU Cloud: Per-Customer Isolation, Token Quotas, and Production SaaS Architecture Guide (2026)

Most LLM SaaS startups bolt on multi-tenancy as an afterthought. They ship a shared endpoint with no quota enforcement, discover the noisy-neighbor problem at scale, and spend weeks retrofitting isolation that should have been in v1. If you are building a self-hosted endpoint with an OpenAI-compatible self-hosted API and plan to serve more than one external customer from it, multi-tenancy architecture is not optional. The AI gateway setup guide for LiteLLM, Portkey, and Kong covers the gateway layer for internal teams. This post goes deeper on the SaaS-specific problems: per-customer quota enforcement, billing-grade metering, noisy-neighbor mitigation, and the compliance controls (SOC 2 and EU AI Act) that enterprise customers will ask for before signing.

Why Multi-Tenancy Is the Hardest Problem in LLM SaaS Infrastructure

CPU-based SaaS is relatively forgiving. Processes are cheap, horizontal scaling is straightforward, and tenant isolation is handled mostly by the operating system. GPU inference is different in three ways that matter for multi-tenant design:

GPU batch slots are scarce and not divisible. A single H100 SXM5 running Llama 3.1 70B FP8 might handle 40-80 concurrent sequences at once via continuous batching and paged attention, depending on context length. Those batch slots are the resource. When one tenant sends 30 concurrent requests, the remaining 10-50 slots are available for everyone else. There's no OS-level isolation. One busy tenant genuinely degrades everyone else's latency.

Token consumption is a proxy for GPU time, not a perfect measure. A 1,000-token request with a 4,000-token system prompt costs far more GPU time than a 5,000-token request with a short prompt and a cached prefix. Billing by raw token count is simple but introduces attribution errors that compound at scale.

Prompt data is sensitive in ways storage is not. A tenant's prompt contains their user's inputs, often including personally identifiable information (PII). Cross-contamination in logs, traces, or debug output is not a compliance footnote. It is a contract violation and a potential regulatory event.

These three constraints together drive the architecture choices below.

Tenant Identity Propagation

Every multi-tenant system starts with identity: who is making this request, and how does that identity flow through the stack?

API Keys vs JWTs

Both work. The choice depends on what your auth system already looks like and whether you need stateless verification.

Opaque API keys are simpler to issue and rotate. The key is a random 32-byte token, stored hashed in your database, mapping to a tenant record. The gateway looks up the key on every request, which requires a Redis or Postgres round-trip. Latency impact is minimal (1-2ms) but the lookup is a required step.

JWTs with a sub claim enable stateless verification: the gateway validates the signature and reads tenant_id from the payload without a database round-trip. The downside is rotation. You cannot invalidate a JWT before it expires without a token revocation list, which brings back a lookup anyway. For SaaS with many small tenants, opaque keys are usually the right call. For internal service-to-service auth, JWTs are fine.

Header Propagation Through the Inference Stack

The tenant identity must follow the request through every layer of the stack. The pattern is straightforward:

  1. Client sends Authorization: Bearer <tenant-api-key> (or a JWT).
  2. Gateway validates the key, resolves tenant_id, and adds X-Tenant-ID: <id> to the forwarded request.
  3. vLLM receives the request with the tenant header. It doesn't use it directly, but the gateway can read it from the LiteLLM metadata field for logging.
  4. Langfuse / observability layer receives the completion event with the tenant header attached as a metadata attribute.

The critical rule: X-Tenant-ID must be set by the gateway, not trusted from the client. Clients should never be able to inject their own tenant headers.

Per-Customer Quota Architecture

Quota enforcement is where most teams make their first mistake. They set rate limits at the API gateway level but forget to account for token volume, leading to customers who stay under request limits while consuming 10x their expected compute.

Token Budgets and Daily Caps

The base data structure for per-tenant quotas is a Redis hash per tenant:

quota:<tenant_id>:daily     # token count for today, TTL = seconds until midnight UTC
quota:<tenant_id>:monthly   # token count for this calendar month
quota:<tenant_id>:limits    # HMSET: daily_cap, monthly_cap, rpm, tpm

On every completed request, increment the appropriate counters and ensure the daily key has a TTL set to midnight UTC:

python
import math
from datetime import datetime, timezone, timedelta

def seconds_until_midnight_utc() -> int:
    now = datetime.now(timezone.utc)
    next_midnight = (now + timedelta(days=1)).replace(hour=0, minute=0, second=0, microsecond=0)
    return math.ceil((next_midnight - now).total_seconds())

pipe = redis.pipeline()
pipe.incrby(f"quota:{tenant_id}:daily", total_tokens)
pipe.incrby(f"quota:{tenant_id}:monthly", total_tokens)
pipe.ttl(f"quota:{tenant_id}:daily")
_, _, daily_ttl = pipe.execute()

if daily_ttl == -1:  # key exists but has no expiry; set it now
    redis.expire(f"quota:{tenant_id}:daily", seconds_until_midnight_utc())

Before forwarding a request, check the daily counter against the cap. If the tenant has consumed 90% of their daily budget, return a soft warning in the response headers. At 100%, return 429 Too Many Requests with a Retry-After header set to midnight UTC.

For hard monthly caps (customers on fixed-price plans), check the monthly counter. Return 403 Forbidden with a clear error body when the cap is hit: {"error": "monthly_token_budget_exceeded", "reset_at": "2026-07-01T00:00:00Z"}.

RPM Limits with Sliding-Window Counters

Token budgets are daily. RPM limits are per-minute, and they need to be accurate under burst conditions. The sliding window approach prevents gaming with burst-then-idle patterns. Use a Lua script so the ZREMRANGEBYSCORE, ZADD, and ZCOUNT execute atomically. A plain redis.pipeline() without MULTI/EXEC is a batched send, not a transaction: concurrent requests can interleave between the commands and both read a count below the limit before either is reflected, allowing bursts to exceed the cap.

python
import time

_RPM_LUA = """
local key   = KEYS[1]
local now   = tonumber(ARGV[1])
local win   = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local uid   = ARGV[4]
redis.call('ZREMRANGEBYSCORE', key, 0, win)
local count = tonumber(redis.call('ZCOUNT', key, win, now))
if count >= limit then return count + 1 end
redis.call('ZADD', key, now, tostring(now) .. '-' .. uid)
redis.call('EXPIRE', key, 120)
return count + 1
"""

def check_rpm(redis, tenant_id: str, limit: int) -> bool:
    import uuid
    now = time.time()
    window_start = now - 60  # 60-second window
    key = f"rpm:{tenant_id}"
    count = redis.eval(_RPM_LUA, 1, key, now, window_start, limit, str(uuid.uuid4()))
    return int(count) <= limit

If count exceeds the limit, reject the request immediately. The Lua script runs atomically on the Redis server, so no external locking is needed.

Soft vs Hard Enforcement

Not all limits should be hard cuts. A well-designed quota system has three layers:

LayerThresholdAction
Warning80% of daily budget consumedAdd X-Quota-Warning: 20% remaining header
Soft limit100% daily budget429 with Retry-After: midnight UTC
Hard limit100% monthly budget403 with reset timestamp

Soft limits on daily budgets let customers with legitimate bursty use cases self-correct without hard failures. Hard limits on monthly budgets protect you from unexpected charges. The thresholds are configurable per tenant: premium plans get higher limits, not just higher caps.

Noisy-Neighbor Mitigation

Quota enforcement prevents billing overruns. Noisy-neighbor mitigation prevents one high-traffic tenant from degrading latency for everyone else, even within their quota.

Fair-Share Scheduling

The simplest approach is to cap each tenant at 1/N of available concurrency, where N is the number of active tenants. In practice, you implement this by tracking the active concurrent requests per tenant in Redis and rejecting (or queuing) requests that would exceed the per-tenant slot budget:

python
# Lua script makes the read-check-increment atomic. A plain GET followed by
# INCR has a TOCTOU race: two concurrent requests can both read below the limit
# before either increments, allowing both to proceed and exceed the slot budget.
#
# TTL is 1800 s (30 min). A 300 s TTL can expire mid-flight for long-context
# requests; the subsequent DECR then creates the key at -1, permanently
# disabling the guard. Use a TTL well above your expected max request duration.
_ACQUIRE_SLOT_LUA = """
local key   = KEYS[1]
local limit = tonumber(ARGV[1])
local ttl   = tonumber(ARGV[2])
local used  = tonumber(redis.call('GET', key) or 0)
if used >= limit then return 0 end
redis.call('INCR', key)
redis.call('EXPIRE', key, ttl)
return 1
"""

# Atomic DECR + conditional DEL. A plain decr() followed by a separate
# delete() has a TOCTOU race: the acquire Lua script can run between the two
# commands, increment the key to 1, and then the delete wipes that slot,
# leaving the counter at 0 even though a request is in-flight.
_RELEASE_SLOT_LUA = """
local remaining = redis.call('DECR', KEYS[1])
if tonumber(remaining) <= 0 then redis.call('DEL', KEYS[1]) end
return remaining
"""

active_key = f"active:{tenant_id}"
acquired = redis.eval(_ACQUIRE_SLOT_LUA, 1, active_key, per_tenant_slot_limit, 1800)

if not acquired:
    return Response(status_code=429, content={"error": "concurrency_limit"})

try:
    result = forward_to_vllm(request)
finally:
    redis.eval(_RELEASE_SLOT_LUA, 1, active_key)

Set per_tenant_slot_limit based on your total concurrency budget. With --max-num-seqs 64 in vLLM and 20 active tenants, a fair-share limit of 5-6 slots per tenant leaves headroom for burst absorption.

Priority Lanes

Fair-share treats all tenants equally. For SaaS products with tiered plans, you want to differentiate. The pattern is two virtual queues in the gateway:

  • Priority queue: premium plan customers with guaranteed slots. Requests in this lane are forwarded to vLLM immediately, up to the lane's concurrency limit.
  • Standard queue: base plan customers sharing a pool of remaining slots. If the priority lane is full, excess premium requests overflow here, not into the standard pool.

LiteLLM does not expose native queue priority as of v1.x. You implement this at the gateway layer: inspect the tenant's plan tier from your quota store and route to a dedicated LiteLLM proxy instance (or a separate vLLM endpoint) for priority tenants. Two vLLM instances with separate --max-num-seqs budgets is the simplest physical isolation: one for premium (say, 32 slots) and one for standard (say, 32 slots).

Dedicated Queue Isolation

Full isolation means each tenant gets a separate virtual queue in the inference server. vLLM itself doesn't expose per-tenant queue control, but you can approximate it with process-level isolation:

  • Separate vLLM processes (or Docker containers) per tier, sharing the same physical GPU via CUDA MPS.
  • Separate Kubernetes pods with GPU resource fractions (requires MIG-partitioned H100s or time-sliced GPUs).
  • Separate physical GPU instances per enterprise customer who needs SLA guarantees.

The cost of full isolation is high. Reserve it for customers who explicitly pay for it or whose data classification requires it.

GPU Pool Architecture Decision Tree

How you structure the GPU pool depends on your tenant mix, traffic patterns, and SLA requirements:

SituationArchitectureRationale
<10 tenants, similar load profilesSingle vLLM, shared pool, fair-share schedulingSimple ops; enough headroom per tenant
10-100 tenants, mixed tier plansTwo vLLM instances (premium / standard), Redis quotaTier isolation without per-tenant infra
Any tenant requires data sovereigntyDedicated vLLM instance per isolated tenant, shared infra for restData never co-mingles
100+ tenants, low average volumeSingle shared pool, LoRA multi-adapter servingLoRA multi-adapter serving consolidates per-customer fine-tunes without extra GPU cost
Any tenant with SLA uptime >99.5%Dedicated instance + fallback spot instanceSpot gives cost; reserved gives uptime

The most common mistake is prematurely moving to per-tenant dedicated instances. For most SaaS products, the shared pool with fair-share scheduling handles 50-100 customers comfortably on a single H100.

Metering for Billing

Token Accounting

vLLM returns token counts in every completion response:

json
{
  "usage": {
    "prompt_tokens": 312,
    "completion_tokens": 89,
    "total_tokens": 401
  }
}

Record both prompt and completion tokens separately because you will likely price them differently (input tokens are cheaper than output tokens). Write to Redis atomically on every completion. For billing-grade accuracy, also write to Postgres as the durable audit log:

sql
INSERT INTO token_usage (
  tenant_id, request_id, model, prompt_tokens,
  completion_tokens, cache_hit, created_at
) VALUES ($1, $2, $3, $4, $5, $6, NOW());

Redis is your real-time quota check. Postgres is the billing source of truth.

Cache-Hit Attribution

vLLM prefix caching means some requests have large portions of their prompt served from KV cache. The GPU work done for a cache-hit prompt is a fraction of a cold request. If you charge identical rates for cache-hit and cold requests, you are overcharging tenants whose prompts benefit from caching.

The practical rule: if LiteLLM or your gateway signals cache_hit: true, apply a 40-60% discount on prompt tokens for billing purposes. The exact discount should reflect your actual GPU cost savings from caching (measure it in production; the savings are model and workload dependent).

python
def compute_billing_tokens(prompt_tokens: int, completion_tokens: int, cache_hit: bool) -> dict:
    effective_prompt = prompt_tokens * (0.4 if cache_hit else 1.0)
    return {
        "prompt_tokens_billed": effective_prompt,
        "completion_tokens_billed": completion_tokens,
        "total_cost_usd": (effective_prompt * INPUT_RATE) + (completion_tokens * OUTPUT_RATE),
    }

Mixed-Precision Token Weight

FP8 and BF16 models produce the same logical token counts but consume different amounts of GPU time. An FP8 model running at 3,000 tokens/sec throughput vs a BF16 model at 1,500 tokens/sec means the same 1,000-token output cost half the GPU time in FP8. If you are running multiple model variants (FP8 for throughput, BF16 for quality-sensitive workloads), track model variant in the billing record and apply different per-token cost weights accordingly.

Implementation Patterns: LiteLLM, Portkey, and Envoy

LiteLLM Virtual Keys for Per-Tenant Budgets

LiteLLM's virtual key system handles the budget enforcement layer without custom code. Create one virtual key per tenant through the LiteLLM API or UI:

bash
curl -X POST http://litellm:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{
    "key_alias": "tenant-acme-corp",
    "max_budget": 50.0,
    "budget_duration": "monthly",
    "rpm_limit": 60,
    "tpm_limit": 100000,
    "models": ["llama-70b", "llama-8b"]
  }'

When ACME Corp's virtual key hits $50/month, LiteLLM returns a 429 automatically. No custom middleware needed. The LiteLLM Postgres database tracks spend per key. The dashboard shows per-key usage over time.

The limitation: LiteLLM's budget tracking is in dollars (cost-based), not tokens (usage-based). If you need to expose token-based quotas to customers rather than dollar budgets, you need a custom middleware layer in front of LiteLLM that converts token counts to dollar estimates before calling LiteLLM's internal budget check, or that tracks tokens separately in Redis and checks them before forwarding to LiteLLM.

Portkey as a Multi-Tenant Gateway

Portkey adds semantic caching and prompt management on top of gateway routing. For multi-tenant SaaS, the relevant features are virtual keys (same concept as LiteLLM) and guardrails that can be applied per-tenant.

The tradeoff: Portkey's full feature set is a cloud SaaS product. If your customers have data residency requirements or you cannot send prompt data to third parties, check whether the Portkey enterprise on-premises tier covers your specific requirements before evaluating further.

Envoy with Lua Filters for Low-Latency Routing

If you want to minimize gateway overhead, Envoy's Lua filter lets you implement tenant routing and basic rate limiting without a separate gateway process. The filter runs in Envoy's hot path:

lua
function envoy_on_request(request_handle)
  local auth_header = request_handle:headers():get("Authorization")
  local tenant_id = validate_and_extract_tenant(auth_header)
  
  if not tenant_id then
    request_handle:respond({[":status"] = "401"}, "Unauthorized")
    return
  end
  
  request_handle:headers():remove("x-tenant-id")
  request_handle:headers():add("x-tenant-id", tenant_id)
end

Envoy adds roughly 0.3-0.5ms per request vs LiteLLM's 2-5ms. For applications where latency is the primary constraint (real-time voice AI, sub-200ms TTFT requirements), the difference matters. For most SaaS inference workloads where the model itself takes 500ms+, the gateway overhead is irrelevant.

Production Deployment on Spheron

Below is a full Docker Compose configuration for a production multi-tenant inference stack: vLLM as the inference backend, LiteLLM as the gateway with virtual key enforcement, Redis for quota tracking, Langfuse for observability, and Postgres for durable billing records.

yaml
version: "3.8"

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}
    command: >
      --model meta-llama/Llama-3.1-70B-Instruct
      --dtype fp8
      --tensor-parallel-size 1
      --max-model-len 16384
      --max-num-seqs 64
      --gpu-memory-utilization 0.90
      --served-model-name llama-70b
      --disable-log-requests
      --port 8000
    ports:
      - "127.0.0.1:8000:8000"

  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    depends_on:
      - vllm
      - redis
      - postgres
    environment:
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
      DATABASE_URL: postgresql://litellm:${POSTGRES_PASSWORD}@postgres:5432/litellm
      REDIS_URL: redis://redis:6379
      LANGFUSE_PUBLIC_KEY: ${LANGFUSE_PUBLIC_KEY}
      LANGFUSE_SECRET_KEY: ${LANGFUSE_SECRET_KEY}
      LANGFUSE_HOST: http://langfuse:3000
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    command: --config /app/config.yaml --detailed_debug
    ports:
      - "127.0.0.1:4000:4000"

  redis:
    image: redis:7-alpine
    ports:
      - "127.0.0.1:6379:6379"

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./postgres-init.sql:/docker-entrypoint-initdb.d/init.sql

  langfuse:
    image: langfuse/langfuse:latest
    depends_on:
      - postgres
    environment:
      DATABASE_URL: postgresql://litellm:${POSTGRES_PASSWORD}@postgres:5432/langfuse
      NEXTAUTH_SECRET: ${LANGFUSE_SECRET}
      ENCRYPTION_KEY: ${LANGFUSE_ENCRYPTION_KEY}
      SALT: ${LANGFUSE_SALT}
      NEXTAUTH_URL: http://langfuse:3000
    ports:
      - "127.0.0.1:3000:3000"

volumes:
  postgres_data:

The postgres-init.sql file creates the langfuse database on first startup (Postgres's POSTGRES_DB environment variable only creates one database automatically):

sql
CREATE DATABASE langfuse;
GRANT ALL PRIVILEGES ON DATABASE langfuse TO litellm;
\c langfuse
GRANT ALL ON SCHEMA public TO litellm;

LiteLLM config (litellm-config.yaml):

yaml
model_list:
  - model_name: llama-70b
    litellm_params:
      model: openai/llama-70b
      api_base: http://vllm:8000/v1
      api_key: "none"

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  cache: true
  cache_params:
    type: redis
    url: os.environ/REDIS_URL
  success_callback: ["langfuse"]

Provision an H100 instance on Spheron, SSH in, and run docker compose up -d. The full stack is live in under 5 minutes.

Compliance: SOC 2 and EU AI Act

Per-Tenant Data Isolation

The highest-risk cross-contamination vector is server logs. A vllm serve instance running without --disable-log-requests writes every prompt to stdout in plaintext. If you pipe logs to a shared logging service, every prompt from every tenant is visible to anyone with log access.

Fix: always run vLLM with --disable-log-requests in production. If you need request-level debugging, implement structured logging at the gateway layer where you control exactly which fields are emitted. Log the tenant ID, token counts, latency, and model name. Never log the prompt content.

For customers who process medical or financial data, a further step is per-tenant KV cache partitioning. vLLM does not currently expose per-tenant KV cache isolation at the process level. The practical approach is a dedicated vLLM instance per isolated tenant, even if that instance runs on shared hardware via CUDA MPS.

Audit Logs

SOC 2 Type II requires an immutable audit trail of who accessed what data and when. In the LLM serving context, this means:

  • Per-request log entries with tenant ID, timestamp, model, token counts, and a synthetic request ID (not prompt content).
  • Write access controlled: the billing/compliance system can write to the audit log but no one can delete records.
  • Retention: common retention windows for SOC 2 Type II audited entities are 12 months or more; HIPAA requires 6 years; finance regulations like SOX typically require 7 years.

Langfuse writes immutable trace records per completion. Export those records to an append-only S3 bucket or GCS object (set bucket versioning and object lock policies). That bucket becomes your audit trail.

Key Rotation

Tenant API keys should be rotated on a 90-day schedule. The rotation process:

  1. Generate a new key for the tenant.
  2. Set both the old and new key as valid in your quota store (overlap period: 7-14 days).
  3. Notify the tenant with the new key and the old key's expiry date.
  4. After the overlap period, invalidate the old key.

For LiteLLM virtual keys, the API supports key updates:

bash
curl -X PUT http://litellm:4000/key/update \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"key": "old-key-hash", "duration": "14d"}'

To support EU AI Act Article 13 transparency principles, expose a per-tenant /usage endpoint that returns token counts, model names, and request timestamps for a given date range. Customers can download their own inference records without requiring support intervention.

Unit Economics: 50 Customers on One H100

The key question for any multi-tenant LLM SaaS is whether the GPU cost pencils out. Here is the math for a realistic SaaS scenario: 50 customers, each averaging 100k tokens per day (mix of input and output), served from a single H100 SXM5 running Llama 3.1 70B FP8.

GPU cost (spot pricing):

  • H100 SXM5 spot on Spheron: $1.49/hr
  • Monthly cost (720 hours): $1,073/month
  • Cost per customer per month: $1,073 / 50 = $21.46/month in GPU cost

This example uses spot pricing. Spot trades cost for interruption risk; the current on-demand H100 SXM5 rate is $4.06/hr, so on-demand operators should multiply the GPU cost line by roughly 2.7x: $2,923/month GPU cost, or $58.46/month per customer at 50 customers ($29.23 at 100 customers).

Token throughput:

  • 50 customers x 100k tokens/day = 5M tokens/day total
  • Llama 3.1 70B FP8 on H100: roughly 3,000-4,000 tokens/second at moderate concurrency
  • 5M tokens/day at 3,500 tokens/second average = 1,429 seconds of GPU time per day = 23.8 minutes out of 1,440 minutes
  • GPU utilization: under 2%. There is headroom for 50x more volume before the H100 bottlenecks.

Revenue per customer (baseline pricing):

  • Input tokens: $0.0003/1k tokens
  • Output tokens: $0.001/1k tokens
  • Assuming 70% input, 30% output split: 70k input tokens + 30k output tokens
  • Revenue: (70 x $0.0003) + (30 x $0.001) = $0.021 + $0.030 = $0.051/day per customer
  • Monthly revenue per customer: $0.051 x 30 = $1.53/month at usage price

Note that this is usage-revenue only. SaaS products typically charge a platform fee on top of usage. A $49/month SaaS plan with 100k tokens/day included converts to: $49 - $1.53 (usage cost at retail) - $21.46 (GPU cost share) = $25.83 gross margin per customer per month.

ItemPer Customer Per Month
GPU cost share (1/50 of H100)$21.46
Token revenue at $0.001/1k output$1.53
Platform subscription fee$49.00
Gross margin$29.07
Gross margin %~59%

At 100 customers (still well within what one H100 can serve at this volume), GPU cost per customer drops to $10.73/month. Gross margin approaches 70%.

Pricing fluctuates based on GPU availability. The prices above are based on 07 Jun 2026 and may have changed. Check current GPU pricing for live rates.

When does the economics change? At higher token volumes per customer (say, 1M tokens/day each), the H100 starts approaching utilization limits and you need a second GPU. That's the inflection point where the shared-pool architecture breaks and per-customer dedicated instances become worth the cost. The GPU FinOps and chargeback guide covers how to track which tenants are driving that inflection in your specific workload.


The architecture described here (Redis quota layer, LiteLLM virtual keys, vLLM with --disable-log-requests, Langfuse audit trail) is the minimum viable production setup for a compliant multi-tenant LLM SaaS. It runs on a single H100 and supports 50-100 customers without any custom distributed systems work. Scaling beyond that is largely horizontal replication of the same stack.

Spheron bare-metal GPU instances give AI SaaS startups a fixed, predictable cost foundation to build multi-tenant inference on. Rent at the hourly rate, mark up per-token to customers, and keep the spread.

H100 SXM5 capacity on Spheron → | View GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Design the tenant identity model

    Assign each SaaS customer an API key (opaque, rotatable) or JWT with a sub claim. The key maps to a tenant_id in your quota store (Redis). Add tenant_id as a custom header or extract from Authorization before the request reaches vLLM. LiteLLM virtual keys handle this natively; for custom stacks, a FastAPI middleware layer reads and validates the key, adds X-Tenant-ID, and forwards to vLLM.

  2. Set up the Redis quota store

    Use Redis with two data structures per tenant: a sliding-window counter for RPM limits (ZADD with timestamps, ZCOUNT for rate check) and a daily token bucket (INCRBY on a key with TTL set to seconds until midnight UTC). Soft limits return 429 with Retry-After. Hard daily caps return 403. Use Redis atomic INCRBY to avoid race conditions in concurrent request handling.

  3. Deploy the multi-tenant gateway

    Deploy LiteLLM proxy in front of vLLM. Create one virtual key per tenant, set max_budget_usd and budget_duration per key. For RPM enforcement, configure tpm and rpm per key in LiteLLM's key metadata. LiteLLM checks budgets before forwarding to the backend vLLM instance. Alternatively, use Envoy with a Lua filter for low-latency header-based tenant routing without a separate gateway process.

  4. Configure vLLM for multi-tenant workloads

    Start vLLM with --max-num-seqs tuned to your concurrency budget. For LoRA-based per-tenant fine-tunes, add --enable-lora --max-loras 8 --max-cpu-loras 64. Set --served-model-name to an alias so tenants see a stable model name regardless of backend quantization. Use --disable-log-requests in production to avoid logging prompt content from tenant A into logs readable by tenant B.

  5. Wire up billing-grade metering with Langfuse

    Configure LiteLLM's success_callback to langfuse. Each completion event writes prompt_tokens, completion_tokens, model, latency, tenant virtual key, and cost estimate to Langfuse. Query Langfuse Postgres daily to aggregate per-tenant token usage for invoicing. For cache-hit attribution, check cache_hit in the LiteLLM response metadata and apply a discounted token rate in the billing aggregation.

  6. Apply SOC 2 and EU AI Act isolation controls

    Enable --disable-log-requests on all vLLM instances to prevent prompt data cross-contamination in logs. Rotate tenant API keys on a 90-day schedule via an automated key rotation job. Maintain per-tenant audit logs in Langfuse with immutable write access for compliance export. To support EU AI Act Article 13 transparency principles, expose a per-tenant usage endpoint so customers can download their own inference logs.

FAQ / 06

Frequently Asked Questions

Multi-tenant LLM serving is the pattern of running one shared GPU-backed inference endpoint that serves multiple external customers (tenants), each with their own token quotas, rate limits, and data isolation guarantees. A single H100 can serve 50+ SaaS customers if token budgets are enforced upstream and tenant traffic is queued with fair-share scheduling.

The standard pattern is a Redis-backed quota store: each API key maps to a budget (tokens/day, RPM, daily cap). The gateway (LiteLLM or a custom FastAPI middleware) checks Redis before forwarding to vLLM. On soft limits it returns a 429 with Retry-After. On hard daily caps it returns 403 until the quota resets at midnight UTC.

Noisy-neighbor mitigation prevents one high-traffic tenant from consuming all GPU batch slots. Approaches include dedicated virtual queues per tenant in the inference server, priority lanes (premium tenants get dedicated slots), fair-share scheduling (each tenant is capped at 1/N of available concurrency), and per-tenant Kubernetes namespace with separate vLLM deployments for complete isolation.

Use a shared pool when tenants have bursty, unpredictable traffic and average token volumes are low. Use dedicated instances when a tenant has SLA guarantees, processes sensitive data with regulatory isolation requirements, or when their sustained load would monopolize a shared pool. Hybrid works well: shared pool for the long tail, dedicated for top-tier customers.

Use the token counts from vLLM's completion response (prompt_tokens, completion_tokens). For cache hits, attribute at a discounted rate since the GPU did less work. Store per-tenant token counts atomically in Redis with TTL-based daily/monthly windows, and write audit records to Postgres via Langfuse callbacks. Mixed-precision models (FP8 vs BF16) produce the same logical token counts but different compute costs, so track model variant alongside token count.

At Spheron's current H100 SXM5 spot pricing ($1.49/hr), one H100 serving 50 customers averaging 100k tokens/day each at Llama 3.1 70B FP8 throughput generates enough capacity for the load. The GPU cost is a fixed hourly rate. Charging customers $0.001/1k output tokens and $0.0003/1k input tokens against a spot GPU cost of $1.49/hr yields healthy per-customer margins at scale. On-demand pricing ($4.06/hr) narrows margins significantly, so on-demand operators should factor in a roughly 2.7x higher GPU cost line.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.