Tutorial

AI Gateway Setup 2026: LiteLLM, Portkey, and Kong AI Gateway for Multi-Model LLM Traffic

Back to BlogWritten by Mitrasish, Co-founderApr 23, 2026
AI GatewayLLM GatewayLiteLLMPortkey AI GatewayKong AI GatewayMulti-Model LLMLLM RoutingOpenTelemetryVirtual KeysSelf-Hosted LLM
AI Gateway Setup 2026: LiteLLM, Portkey, and Kong AI Gateway for Multi-Model LLM Traffic

Teams are migrating off single-provider SDKs fast. Multi-model stacks are now the norm: a self-hosted Llama 4 endpoint for cost-sensitive traffic, GPT-4o for complex tasks, Claude Sonnet as a quality backstop. But the routing, auth, and observability layer that ties all of this together is usually missing. If you have already built your own OpenAI-compatible self-hosted endpoint and layered on a complexity-based inference router, an AI gateway is the next piece: it adds virtual key auth, per-team budget caps, and automatic multi-provider failover above everything else you have already deployed.

This post compares three gateways, then walks through a full LiteLLM deployment fronting a Spheron vLLM backend with hybrid Spheron-first routing.

GatewayBest ForSelf-HostedOpen SourceBudget Enforcement
LiteLLMSelf-hosted + cloud mix, multi-provider routingYesYesYes (virtual keys)
PortkeyCloud-only teams, semantic caching, guardrailsLimitedNoYes (per-key)
Kong AI GatewayEnterprise Kong mesh, SSO, plugin ecosystemYesCommunity editionYes (plugin-based)

Why AI Gateways Exist

A single vLLM endpoint handles one model and one team. Once you add a second model, a second team, or a second provider, you need a management layer. AI gateways solve five specific problems:

  • Virtual keys, not API keys. You issue a virtual key per team or app. The real provider key lives only inside the gateway. Rotate the real key without touching any client code.
  • Rate limiting per team, per model. A team can be capped at 100 RPM without affecting other teams sharing the same backend pool.
  • Hard budget caps. Set a monthly dollar limit per virtual key. When the team hits $500, requests return 429 immediately. No surprise bills.
  • Full request logging. Every request is logged with model name, token counts, latency, cost, and the virtual key that made it. Required for compliance and chargeback.
  • Automatic failover. If your primary Spheron vLLM instance returns 503, the gateway retries on OpenAI or another fallback without any change to client code.

AI Gateway vs Inference Router vs Reverse Proxy

These three layers stack, but they do different jobs:

LayerJobTool
Load balancingDistribute traffic across identical model replicasNGINX, HAProxy
Complexity routingRoute simple queries to cheap models, complex ones to expensiveInference router (LLM-based classifier)
Multi-provider auth + observabilityAuth, budgets, logging, failover across providersAI gateway

They stack together naturally: vLLM replicas sit behind NGINX, that cluster gets fronted by an inference router that classifies query complexity, and the AI gateway sits above the router handling what NGINX cannot: virtual key auth, per-team spend, and fallover to cloud APIs when your self-hosted capacity is full.

LiteLLM

LiteLLM is the most widely deployed open-source AI gateway. 40k+ GitHub stars, 100+ provider integrations, and it speaks the OpenAI protocol natively so your existing application code needs zero changes.

What it does well. Every provider gets a unified interface: openai, anthropic, cohere, azure, bedrock, and your own openai-compatible vLLM endpoints all work behind a single /v1/chat/completions URL. Virtual keys are first-class: you create them in the LiteLLM UI or via API, attach budget limits and model restrictions, and LiteLLM tracks spend against those limits in Postgres.

Virtual keys and budgets. Create a key per team, set max_budget_usd=500 and budget_duration=monthly, and restrict it to specific model aliases. The key is what your app sends as the Bearer token. LiteLLM maintains a running spend counter in Postgres and returns a 429 with a descriptive error body when the budget is exhausted.

Model aliases. You define friendly names like llama-4-scout in config, and point them at backend URLs. Your app calls model="llama-4-scout" and never knows whether it is hitting Spheron or an OpenAI fallback.

Redis-backed caching. Identical prompts return cached responses from Redis. Cost and latency both drop for repeated queries.

Minimal config to get started:

yaml
model_list:
  - model_name: llama-4-scout
    litellm_params:
      model: openai/meta-llama/Llama-4-Scout-17B-16E-Instruct
      api_base: http://VLLM_HOST:8000/v1
      api_key: dummy-key

  - model_name: gpt-4o-mini
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

Limitations. LiteLLM has no built-in guardrails (content filtering, topic restrictions), no prompt versioning, and no native A/B testing. For those features, you are writing your own middleware or switching to Portkey.

Portkey AI Gateway

Portkey is a cloud SaaS gateway focused on guardrails, semantic caching, and prompt management. It has an SDK-first developer experience rather than a config-file approach.

Guardrails. Portkey's guardrails system lets you define content policies (block specific topics, detect PII, enforce output formats) that run on every request before it hits your model. These run on Portkey's infrastructure, not yours.

Semantic caching. Rather than exact-match caching, Portkey uses embedding similarity to return cached responses for semantically equivalent prompts. This has higher cache hit rates than Redis exact-match but adds a ~10-30ms embedding lookup on each request.

Prompt versioning and A/B testing. Portkey stores prompts with version history and lets you run traffic splits between prompt versions. If you iterate on system prompts frequently, this is useful.

python
from portkey_ai import Portkey

client = Portkey(
    api_key="PORTKEY_API_KEY",
    virtual_key="TEAM_VIRTUAL_KEY",
    config="PORTKEY_CONFIG_ID"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

Limitations. Portkey's semantic caching, guardrails, and prompt versioning are primarily cloud SaaS features. As of April 2026, Portkey offers an enterprise on-premises option, but most capabilities require the hosted plan. Teams that cannot send data to a third party for compliance reasons should not consider Portkey unless they have confirmed the on-prem tier covers their requirements. There is also vendor lock-in: the caching and guardrails features only work within Portkey's ecosystem.

Kong AI Gateway

Kong AI Gateway extends Kong's mature API management platform with LLM-specific plugins. If your organization already operates a Kong mesh, the AI gateway is a natural add-on rather than a new system to operate.

Plugin architecture. Kong's AI plugins include rate limiting (per-model, per-consumer, per-route), request transformation, and response filtering. The plugin model means you compose capabilities rather than configure a monolithic gateway.

PluginWhat It DoesEdition
AI ProxyRoute requests to any LLM backendCommunity
AI Rate Limiting AdvancedToken-bucket rate limiting per consumerEnterprise
AI PII SanitizerStrip PII before sending to LLMEnterprise
AI Semantic CachingEmbedding-based cache for LLM responsesEnterprise
OpenID ConnectEnterprise SSO with any OIDC providerEnterprise

Enterprise SSO. Kong's OIDC plugin integrates with Okta, Azure AD, and any OIDC provider. This is the right choice if your security team requires identity-provider-backed auth for every API gateway.

Limitations. Kong is heavy to operate without existing Kong infrastructure. PII redaction and enterprise SSO are enterprise-only features, requiring a paid license. For teams not already on Kong, LiteLLM is faster to get running and has lower operational overhead. The community edition has basic AI routing but lacks the features that make Kong compelling for AI workloads specifically.

Deploy LiteLLM in Front of vLLM on Spheron

This is the full deployment: a vLLM backend on Spheron serving Llama 4 Scout, fronted by LiteLLM proxy with Postgres for spend tracking and Redis for caching. Everything runs in Docker. If you prefer SGLang as your backend instead of vLLM (it has advantages for multi-turn and agentic workloads), the SGLang production deployment guide covers the same provisioning steps and produces an OpenAI-compatible endpoint that LiteLLM routes to identically.

Provision the vLLM backend

Launch an H100 GPU rental on Spheron (80GB, starting at $2.01/hr as of 23 Apr 2026) for 70B models. For 7B-13B models, L40S instances at $0.72/hr are the cost-efficient choice. SSH into the instance and start vLLM:

bash
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --quantization fp8 \
  --max-model-len 16384 \
  --port 8000 \
  --host 0.0.0.0

Note the instance's private IP address. This becomes VLLM_HOST in the LiteLLM config. For multi-GPU setup or production load balancing, see the vLLM production deployment guide. If you are still deciding between vLLM and Ollama for your backend, Ollama vs vLLM breaks down the tradeoffs: Ollama is simpler for local prototyping, vLLM is the right call for production throughput at scale.

Write the LiteLLM config

Create litellm-config.yaml on the LiteLLM host (this can be a separate small CPU instance or the same machine if you are running everything together):

yaml
model_list:
  - model_name: llama-4-scout
    litellm_params:
      model: openai/meta-llama/Llama-4-Scout-17B-16E-Instruct
      api_base: http://VLLM_HOST:8000/v1
      api_key: dummy-key

  - model_name: gpt-4o-mini
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  routing_strategy: cost-based-routing
  fallbacks: [{"llama-4-scout": ["gpt-4o-mini"]}]
  redis_host: redis
  redis_port: 6379
  redis_password: os.environ/REDIS_PASSWORD
  cache_responses: true

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true

Replace VLLM_HOST with the actual private IP of your Spheron instance. Never hardcode a real IP in a shared config file; use environment variable substitution for anything environment-specific.

docker-compose.yml

yaml
version: "3.8"

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    environment:
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=postgresql://litellm:${POSTGRES_PASSWORD}@postgres:5432/litellm
      - REDIS_PASSWORD=${REDIS_PASSWORD}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started

  postgres:
    image: postgres:15
    environment:
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=litellm
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U litellm"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    command: redis-server --requirepass ${REDIS_PASSWORD}
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

Set LITELLM_MASTER_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD, and OPENAI_API_KEY in a .env file (never commit it). Run with docker-compose up -d.

Smoke test

bash
# Health check
curl http://localhost:4000/health

# Request routed to Spheron vLLM (llama-4-scout)
curl http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-4-scout", "messages": [{"role": "user", "content": "Hello"}]}'

# Request routed to cloud fallback
curl http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}]}'

The health endpoint returns per-model status. If llama-4-scout shows unhealthy, check that vLLM is up on VLLM_HOST:8000 and that the security group allows traffic from the LiteLLM host.

Hybrid Routing: Spheron-first with Cloud Overflow

The fallbacks config in router_settings defines the routing priority. When llama-4-scout (your Spheron vLLM endpoint) returns a 503 or times out, LiteLLM automatically retries on gpt-4o-mini (your cloud fallback). Your application sends the same request and gets back a response. The provider switch is invisible.

The routing flow:

  1. Request arrives at LiteLLM on port 4000
  2. LiteLLM checks Redis cache: if hit, return cached response immediately
  3. LiteLLM forwards request to llama-4-scout (Spheron vLLM endpoint)
  4. If vLLM returns 503 or the connection times out: retry on gpt-4o-mini
  5. Return response to caller, log cost and latency to Postgres

The cost math at 100 RPS with Llama 4 Scout (~300 tokens/response, FP8 on H100): the vLLM backend processes roughly 30,000 tokens per second total. At $2.01/hr for the H100, you're paying about $0.00056 per 1,000 tokens. If 10% of traffic overflows to OpenAI GPT-4o-mini at $0.60/1M output tokens, the blended rate stays well below $1/1M tokens. Compare this to routing all traffic to OpenAI at $10/1M output tokens for GPT-4o. The savings compound fast at scale. To push throughput further without adding GPUs, KV cache optimization covers techniques that can cut KV cache memory use by 80%+, freeing headroom for more concurrent requests on the same H100.

Observability: OpenTelemetry, Langfuse, and Helicone

OpenTelemetry

Add these to the LiteLLM environment block in docker-compose.yml:

yaml
- OTEL_EXPORTER=otlp_http
- OTEL_ENDPOINT=http://langfuse:3000/api/public/otel
- OTEL_SERVICE_NAME=litellm-proxy

LiteLLM emits one span per request with these attributes: model, virtual_key, total_tokens, prompt_tokens, completion_tokens, latency_ms, cost_usd, provider. Every request is traceable from the virtual key that made it through to the backend that served it.

Langfuse

Add Langfuse to the same compose file:

yaml
  langfuse:
    image: langfuse/langfuse:latest
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://langfuse:${LANGFUSE_POSTGRES_PASSWORD}@postgres_langfuse:5432/langfuse
      - NEXTAUTH_SECRET=${LANGFUSE_SECRET}
      - NEXTAUTH_URL=${LANGFUSE_URL:-http://localhost:3000}
      - ENCRYPTION_KEY=${LANGFUSE_ENCRYPTION_KEY}
      - SALT=${LANGFUSE_SALT}
    depends_on:
      postgres_langfuse:
        condition: service_healthy

  postgres_langfuse:
    image: postgres:15
    environment:
      - POSTGRES_USER=langfuse
      - POSTGRES_PASSWORD=${LANGFUSE_POSTGRES_PASSWORD}
      - POSTGRES_DB=langfuse
    volumes:
      - langfuse_postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U langfuse"]
      interval: 10s
      timeout: 5s
      retries: 5

Also add langfuse_postgres_data: to the volumes: block at the bottom of your compose file alongside postgres_data and redis_data. When combining these snippets into a single compose file, also add - langfuse to the depends_on list of the litellm service. Without it, LiteLLM can start before Langfuse is ready on port 3000, dropping the first traces on slower systems or during Postgres initialization. Add LANGFUSE_POSTGRES_PASSWORD, LANGFUSE_SECRET, LANGFUSE_ENCRYPTION_KEY, and LANGFUSE_SALT to your .env file. LANGFUSE_ENCRYPTION_KEY must be a 64-character hex string (run openssl rand -hex 32 to generate one). LANGFUSE_SALT must be a random string (run openssl rand -hex 16 to generate one). Langfuse 3.x treats both as required and will exit immediately at startup if either is missing.

Access the Langfuse UI at port 3000. Filter traces by virtual key to see spend per team. Filter by model to compare llama-4-scout vs gpt-4o-mini latency distributions. The p95 latency breakdown shows whether LiteLLM overhead or the backend model is your bottleneck.

Helicone

Helicone is an alternative to Langfuse with a simpler setup: change one environment variable in the LiteLLM config to route observability data to Helicone's cloud.

yaml
general_settings:
  success_callback: ["helicone"]
  helicone_api_key: os.environ/HELICONE_API_KEY

No self-hosted Langfuse instance needed. The tradeoff: your request data goes to Helicone's cloud infrastructure.

ToolSelf-HostedSetup EffortKey Features
LangfuseYesMedium (add to compose)Traces, evals, user tracking, cost per team
HeliconeNo (cloud)Low (one env var)Cost analytics, latency dashboards, prompt versioning
OTLP collectorYesHigh (custom stack)Full flexibility, export to any backend

Cost and Latency at 100 RPS

SetupAvg Latency Overhead8-Hour Run CostProvider Breakdown
Direct vLLM (no gateway)0ms$16.08 (H100 only)100% Spheron
LiteLLM + Spheron vLLMsingle-digit ms$16.08 + ~$0.50 (LiteLLM CPU instance)100% Spheron
LiteLLM + hybrid routing (90% Spheron, 10% OpenAI)single-digit ms primary, +50-100ms on overflow~$17.00 total~95% Spheron, ~5% OpenAI cost

The H100 PCIe starting at $2.01/hr on Spheron as of 23 Apr 2026, 8 hours = $16.08. LiteLLM proxy itself runs on a small CPU instance ($0.05-0.10/hr). The 10% overflow to OpenAI adds minimal cost because GPT-4o-mini is cheap and the volume is low.

Pricing fluctuates based on GPU availability. The prices above are based on 23 Apr 2026 and may have changed. Check current GPU pricing for live rates.

When NOT to Use an AI Gateway

  • Single model, single provider, no migration plans. A direct vLLM endpoint is simpler, faster, and has zero additional ops overhead. Don't add a gateway because it sounds like the right thing to do.
  • Ultra-low-latency workloads. Real-time voice AI under 150ms TTFT or high-frequency trading where 3ms overhead is material. Even the best gateway adds a round-trip.
  • Very small teams. If two engineers are running inference for an internal tool, the ops burden of another service (Postgres, Redis, the gateway itself) is not worth it.
  • Pure batch workloads. Async jobs processing documents overnight don't need request-level auth or per-team budgets. Run vLLM directly.

Spheron GPU cloud gives you the self-hosted cost anchor that makes hybrid routing economically rational. Use A100 for mid-tier inference on 13B-40B models, pair H100s for 70B workloads, and let LiteLLM handle the cloud overflow automatically.

Rent H100 → | Rent A100 → | View all pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.