AI Gateway Setup 2026: LiteLLM, Portkey, and Kong AI Gateway for Multi-Model LLM Traffic

Teams are migrating off single-provider SDKs fast. Multi-model stacks are now the norm: a self-hosted Llama 4 endpoint for cost-sensitive traffic, GPT-4o for complex tasks, Claude Sonnet as a quality backstop. But the routing, auth, and observability layer that ties all of this together is usually missing. If you have already built your own OpenAI-compatible self-hosted endpoint and layered on a complexity-based inference router, an AI gateway is the next piece: it adds virtual key auth, per-team budget caps, and automatic multi-provider failover above everything else you have already deployed.

This post compares four gateways, then walks through a full LiteLLM deployment fronting a Spheron vLLM backend with hybrid Spheron-first routing.

Gateway	Best For	Self-Hosted	Open Source	Budget Enforcement
LiteLLM	Self-hosted + cloud mix, multi-provider routing	Yes	Yes	Yes (virtual keys)
Portkey	Cloud-only teams, semantic caching, guardrails	Limited	No	Yes (per-key)
Kong AI Gateway	Enterprise Kong mesh, SSO, plugin ecosystem	Yes	Community edition	Yes (plugin-based)
TrueFoundry	Enterprise governance, MCP/agent workloads, VPC/air-gapped	Yes (VPC, on-prem, air-gapped)	No	Yes (quotas + budgets)

Why AI Gateways Exist

A single vLLM endpoint handles one model and one team. Once you add a second model, a second team, or a second provider, you need a management layer. AI gateways solve five specific problems:

Virtual keys, not API keys. You issue a virtual key per team or app. The real provider key lives only inside the gateway. Rotate the real key without touching any client code.
Rate limiting per team, per model. A team can be capped at 100 RPM without affecting other teams sharing the same backend pool.
Hard budget caps. Set a monthly dollar limit per virtual key. When the team hits $500, requests return 429 immediately. No surprise bills.
Full request logging. Every request is logged with model name, token counts, latency, cost, and the virtual key that made it. Required for compliance and chargeback.
Automatic failover. If your primary Spheron vLLM instance returns 503, the gateway retries on OpenAI or another fallback without any change to client code.

AI Gateway vs Inference Router vs Reverse Proxy

These three layers stack, but they do different jobs:

Layer	Job	Tool
Load balancing	Distribute traffic across identical model replicas	NGINX, HAProxy
Complexity routing	Route simple queries to cheap models, complex ones to expensive	Inference router (LLM-based classifier)
Multi-provider auth + observability	Auth, budgets, logging, failover across providers	AI gateway

They stack together naturally: vLLM replicas sit behind NGINX, that cluster gets fronted by an inference router that classifies query complexity, and the AI gateway sits above the router handling what NGINX cannot: virtual key auth, per-team spend, and fallover to cloud APIs when your self-hosted capacity is full.

LiteLLM

LiteLLM is the most widely deployed open-source AI gateway. 40k+ GitHub stars, 100+ provider integrations, and it speaks the OpenAI protocol natively so your existing application code needs zero changes.

What it does well. Every provider gets a unified interface: openai, anthropic, cohere, azure, bedrock, and your own openai-compatible vLLM endpoints all work behind a single /v1/chat/completions URL. Virtual keys are first-class: you create them in the LiteLLM UI or via API, attach budget limits and model restrictions, and LiteLLM tracks spend against those limits in Postgres.

Virtual keys and budgets. Create a key per team, set max_budget_usd=500 and budget_duration=monthly, and restrict it to specific model aliases. The key is what your app sends as the Bearer token. LiteLLM maintains a running spend counter in Postgres and returns a 429 with a descriptive error body when the budget is exhausted. If you are serving external SaaS customers rather than internal teams, see the multi-tenant LLM serving architecture guide for a full quota-enforcement stack including Redis sliding-window counters, noisy-neighbor mitigation, and billing-grade metering on top of LiteLLM.

Model aliases. You define friendly names like llama-4-scout in config, and point them at backend URLs. Your app calls model="llama-4-scout" and never knows whether it is hitting Spheron or an OpenAI fallback.

Redis-backed caching. Identical prompts return cached responses from Redis. Cost and latency both drop for repeated queries.

Minimal config to get started:

yaml

model_list:
  - model_name: llama-4-scout
    litellm_params:
      model: openai/meta-llama/Llama-4-Scout-17B-16E-Instruct
      api_base: http://VLLM_HOST:8000/v1
      api_key: dummy-key

  - model_name: gpt-4o-mini
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

Limitations. LiteLLM has no built-in guardrails (content filtering, topic restrictions), no prompt versioning, and no native A/B testing. For those features, you are writing your own middleware or switching to Portkey.

Portkey AI Gateway

Portkey is a cloud SaaS gateway focused on guardrails, semantic caching, and prompt management. It has an SDK-first developer experience rather than a config-file approach.

Guardrails. Portkey's guardrails system lets you define content policies (block specific topics, detect PII, enforce output formats) that run on every request before it hits your model. These run on Portkey's infrastructure, not yours.

Semantic caching. Rather than exact-match caching, Portkey uses embedding similarity to return cached responses for semantically equivalent prompts. This has higher cache hit rates than Redis exact-match but adds a ~10-30ms embedding lookup on each request.

Prompt versioning and A/B testing. Portkey stores prompts with version history and lets you run traffic splits between prompt versions. If you iterate on system prompts frequently, this is useful.

python

from portkey_ai import Portkey

client = Portkey(
    api_key="PORTKEY_API_KEY",
    virtual_key="TEAM_VIRTUAL_KEY",
    config="PORTKEY_CONFIG_ID"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

Limitations. Portkey's semantic caching, guardrails, and prompt versioning are primarily cloud SaaS features. As of April 2026, Portkey offers an enterprise on-premises option, but most capabilities require the hosted plan. Teams that cannot send data to a third party for compliance reasons should not consider Portkey unless they have confirmed the on-prem tier covers their requirements. There is also vendor lock-in: the caching and guardrails features only work within Portkey's ecosystem.

Kong AI Gateway

Kong AI Gateway extends Kong's mature API management platform with LLM-specific plugins. If your organization already operates a Kong mesh, the AI gateway is a natural add-on rather than a new system to operate.

Plugin architecture. Kong's AI plugins include rate limiting (per-model, per-consumer, per-route), request transformation, and response filtering. The plugin model means you compose capabilities rather than configure a monolithic gateway.

Plugin	What It Does	Edition
AI Proxy	Route requests to any LLM backend	Community
AI Rate Limiting Advanced	Token-bucket rate limiting per consumer	Enterprise
AI PII Sanitizer	Strip PII before sending to LLM	Enterprise
AI Semantic Caching	Embedding-based cache for LLM responses	Enterprise
OpenID Connect	Enterprise SSO with any OIDC provider	Enterprise

Enterprise SSO. Kong's OIDC plugin integrates with Okta, Azure AD, and any OIDC provider. This is the right choice if your security team requires identity-provider-backed auth for every API gateway.

Limitations. Kong is heavy to operate without existing Kong infrastructure. PII redaction and enterprise SSO are enterprise-only features, requiring a paid license. For teams not already on Kong, LiteLLM is faster to get running and has lower operational overhead. The community edition has basic AI routing but lacks the features that make Kong compelling for AI workloads specifically.

TrueFoundry AI Gateway

TrueFoundry's AI Gateway is a managed enterprise control plane for LLM traffic. Where LiteLLM gives you an open-source proxy you assemble and run yourself, TrueFoundry ships governance, observability, and routing as one product you deploy inside your own boundary. It is the closest fit to Kong on this list, but it is purpose-built for AI workloads rather than a general API mesh with AI plugins bolted on. If you want the background on the category first, TrueFoundry also maintains a solid primer on what an AI gateway is.

Unified model access. One OpenAI-compatible gateway API and one key front OpenAI, Anthropic, Azure, Gemini, Mistral, Bedrock, Vertex AI, and self-hosted models, with 1,000+ LLMs supported by TrueFoundry's count. Intelligent load balancing, automatic failover, and fallback chains keep traffic flowing when a provider hits a quota or outage. Switching models is a one-field config change, not an integration rewrite, which is the same pattern LiteLLM uses. Semantic caching returns cached responses for similar prompts to cut repeat cost and latency.

Self-hosted and open-source model support. This is the part that matters for a Spheron backend. TrueFoundry runs in your VPC, on-prem, or air-gapped, and its gateway is compatible with vLLM, SGLang, KServe, and Triton. So the vLLM endpoint you deploy on a Spheron H100 can sit behind TrueFoundry with the same policies and observability as your hosted models, no SDK changes required.

Governance and observability. RBAC, SSO, OAuth2, and audit logging on the access side. Per-user, per-team, and per-model rate limits, token budgets, and quotas on the cost side. Request and response logging, per-model latency and token metrics, and P50/P90/P99 breakdowns on the observability side. Guardrails cover PII filtering and toxicity detection and can plug into external services like Azure Content Safety or AWS Guardrails.

MCP and agent workloads. TrueFoundry has native MCP gateway support, so tool calls run through the same governed layer with OAuth2 and RBAC applied, and the same cost view covers both LLM and tool usage. It can aggregate tools behind a virtual MCP server and convert existing OpenAPI endpoints into MCP tools. If your roadmap includes agents rather than just chat completions, this is a real differentiator over LiteLLM.

Performance. Authentication, rate limiting, and routing run in-memory, so TrueFoundry publishes a sub-3ms gateway overhead figure and cites 350+ RPS on a single vCPU. As with any gateway in a latency-critical path, validate the number against your own traffic before committing.

Limitations. TrueFoundry is proprietary and commercially licensed, so it is not the pick if open source is a hard requirement (choose LiteLLM). It is also Kubernetes-native: teams without an existing Kubernetes footprint or a real AI workload will find the setup heavier than a lighter proxy. The value shows up when you need governance, compliance (SOC 2 Type 2, HIPAA), and multi-team controls from day one, not when two engineers are fronting a single model. Pricing is a free tier, a $499/month Pro tier covering up to 1M requests with the enterprise features, and quote-based enterprise plans.

Deploy LiteLLM in Front of vLLM on Spheron

This is the full deployment: a vLLM backend on Spheron serving Llama 4 Scout, fronted by LiteLLM proxy with Postgres for spend tracking and Redis for caching. Everything runs in Docker. If you prefer SGLang as your backend instead of vLLM (it has advantages for multi-turn and agentic workloads), the SGLang production deployment guide covers the same provisioning steps and produces an OpenAI-compatible endpoint that LiteLLM routes to identically.

Provision the vLLM backend

Launch an H100 GPU rental on Spheron (80GB, starting at $2.01/hr as of 23 Apr 2026) for 70B models. For 7B-13B models, L40S instances at $0.72/hr are the cost-efficient choice. SSH into the instance and start vLLM:

bash

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --quantization fp8 \
  --max-model-len 16384 \
  --port 8000 \
  --host 0.0.0.0

Note the instance's private IP address. This becomes VLLM_HOST in the LiteLLM config. For multi-GPU setup or production load balancing, see the vLLM production deployment guide. If you are still deciding between vLLM and Ollama for your backend, Ollama vs vLLM breaks down the tradeoffs: Ollama is simpler for local prototyping, vLLM is the right call for production throughput at scale.

Write the LiteLLM config

Create litellm-config.yaml on the LiteLLM host (this can be a separate small CPU instance or the same machine if you are running everything together):

yaml

model_list:
  - model_name: llama-4-scout
    litellm_params:
      model: openai/meta-llama/Llama-4-Scout-17B-16E-Instruct
      api_base: http://VLLM_HOST:8000/v1
      api_key: dummy-key

  - model_name: gpt-4o-mini
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  routing_strategy: cost-based-routing
  fallbacks: [{"llama-4-scout": ["gpt-4o-mini"]}]
  redis_host: redis
  redis_port: 6379
  redis_password: os.environ/REDIS_PASSWORD
  cache_responses: true

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true

Replace VLLM_HOST with the actual private IP of your Spheron instance. Never hardcode a real IP in a shared config file; use environment variable substitution for anything environment-specific.

docker-compose.yml

yaml

version: "3.8"

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    environment:
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=postgresql://litellm:${POSTGRES_PASSWORD}@postgres:5432/litellm
      - REDIS_PASSWORD=${REDIS_PASSWORD}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started

  postgres:
    image: postgres:15
    environment:
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=litellm
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U litellm"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    command: redis-server --requirepass ${REDIS_PASSWORD}
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

Set LITELLM_MASTER_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD, and OPENAI_API_KEY in a .env file (never commit it). Run with docker-compose up -d.

Smoke test

bash

# Health check
curl http://localhost:4000/health

# Request routed to Spheron vLLM (llama-4-scout)
curl http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-4-scout", "messages": [{"role": "user", "content": "Hello"}]}'

# Request routed to cloud fallback
curl http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}]}'

The health endpoint returns per-model status. If llama-4-scout shows unhealthy, check that vLLM is up on VLLM_HOST:8000 and that the security group allows traffic from the LiteLLM host.

Hybrid Routing: Spheron-first with Cloud Overflow

The fallbacks config in router_settings defines the routing priority. When llama-4-scout (your Spheron vLLM endpoint) returns a 503 or times out, LiteLLM automatically retries on gpt-4o-mini (your cloud fallback). Your application sends the same request and gets back a response. The provider switch is invisible.

The routing flow:

Request arrives at LiteLLM on port 4000
LiteLLM checks Redis cache: if hit, return cached response immediately
LiteLLM forwards request to llama-4-scout (Spheron vLLM endpoint)
If vLLM returns 503 or the connection times out: retry on gpt-4o-mini
Return response to caller, log cost and latency to Postgres

The cost math at 100 RPS with Llama 4 Scout (~300 tokens/response, FP8 on H100): the vLLM backend processes roughly 30,000 tokens per second total. At $2.01/hr for the H100, you're paying about $0.00056 per 1,000 tokens. If 10% of traffic overflows to OpenAI GPT-4o-mini at $0.60/1M output tokens, the blended rate stays well below $1/1M tokens. Compare this to routing all traffic to OpenAI at $10/1M output tokens for GPT-4o. The savings compound fast at scale. To push throughput further without adding GPUs, KV cache optimization covers techniques that can cut KV cache memory use by 80%+, freeing headroom for more concurrent requests on the same H100.

Observability: OpenTelemetry, Langfuse, and Helicone

OpenTelemetry

Add these to the LiteLLM environment block in docker-compose.yml:

yaml

- OTEL_EXPORTER=otlp_http
- OTEL_ENDPOINT=http://langfuse:3000/api/public/otel
- OTEL_SERVICE_NAME=litellm-proxy

LiteLLM emits one span per request with these attributes: model, virtual_key, total_tokens, prompt_tokens, completion_tokens, latency_ms, cost_usd, provider. Every request is traceable from the virtual key that made it through to the backend that served it.

Langfuse

Add Langfuse to the same compose file:

yaml

  langfuse:
    image: langfuse/langfuse:latest
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://langfuse:${LANGFUSE_POSTGRES_PASSWORD}@postgres_langfuse:5432/langfuse
      - NEXTAUTH_SECRET=${LANGFUSE_SECRET}
      - NEXTAUTH_URL=${LANGFUSE_URL:-http://localhost:3000}
      - ENCRYPTION_KEY=${LANGFUSE_ENCRYPTION_KEY}
      - SALT=${LANGFUSE_SALT}
    depends_on:
      postgres_langfuse:
        condition: service_healthy

  postgres_langfuse:
    image: postgres:15
    environment:
      - POSTGRES_USER=langfuse
      - POSTGRES_PASSWORD=${LANGFUSE_POSTGRES_PASSWORD}
      - POSTGRES_DB=langfuse
    volumes:
      - langfuse_postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U langfuse"]
      interval: 10s
      timeout: 5s
      retries: 5

Also add langfuse_postgres_data: to the volumes: block at the bottom of your compose file alongside postgres_data and redis_data. When combining these snippets into a single compose file, also add - langfuse to the depends_on list of the litellm service. Without it, LiteLLM can start before Langfuse is ready on port 3000, dropping the first traces on slower systems or during Postgres initialization. Add LANGFUSE_POSTGRES_PASSWORD, LANGFUSE_SECRET, LANGFUSE_ENCRYPTION_KEY, and LANGFUSE_SALT to your .env file. LANGFUSE_ENCRYPTION_KEY must be a 64-character hex string (run openssl rand -hex 32 to generate one). LANGFUSE_SALT must be a random string (run openssl rand -hex 16 to generate one). Langfuse 3.x treats both as required and will exit immediately at startup if either is missing.

Access the Langfuse UI at port 3000. Filter traces by virtual key to see spend per team. Filter by model to compare llama-4-scout vs gpt-4o-mini latency distributions. The p95 latency breakdown shows whether LiteLLM overhead or the backend model is your bottleneck.

Helicone

Helicone is an alternative to Langfuse with a simpler setup: change one environment variable in the LiteLLM config to route observability data to Helicone's cloud.

yaml

general_settings:
  success_callback: ["helicone"]
  helicone_api_key: os.environ/HELICONE_API_KEY

No self-hosted Langfuse instance needed. The tradeoff: your request data goes to Helicone's cloud infrastructure.

Tool	Self-Hosted	Setup Effort	Key Features
Langfuse	Yes	Medium (add to compose)	Traces, evals, user tracking, cost per team
Helicone	No (cloud)	Low (one env var)	Cost analytics, latency dashboards, prompt versioning
OTLP collector	Yes	High (custom stack)	Full flexibility, export to any backend

Cost and Latency at 100 RPS

Setup	Avg Latency Overhead	8-Hour Run Cost	Provider Breakdown
Direct vLLM (no gateway)	0ms	$16.08 (H100 only)	100% Spheron
LiteLLM + Spheron vLLM	single-digit ms	$16.08 + ~$0.50 (LiteLLM CPU instance)	100% Spheron
LiteLLM + hybrid routing (90% Spheron, 10% OpenAI)	single-digit ms primary, +50-100ms on overflow	~$17.00 total	~95% Spheron, ~5% OpenAI cost

The H100 PCIe starting at $2.01/hr on Spheron as of 23 Apr 2026, 8 hours = $16.08. LiteLLM proxy itself runs on a small CPU instance ($0.05-0.10/hr). The 10% overflow to OpenAI adds minimal cost because GPT-4o-mini is cheap and the volume is low.

Pricing fluctuates based on GPU availability. The prices above are based on 23 Apr 2026 and may have changed. Check current GPU pricing for live rates.

When NOT to Use an AI Gateway

Single model, single provider, no migration plans. A direct vLLM endpoint is simpler, faster, and has zero additional ops overhead. Don't add a gateway because it sounds like the right thing to do.
Ultra-low-latency workloads. Real-time voice AI under 150ms TTFT or high-frequency trading where 3ms overhead is material. Even the best gateway adds a round-trip.
Very small teams. If two engineers are running inference for an internal tool, the ops burden of another service (Postgres, Redis, the gateway itself) is not worth it.
Pure batch workloads. Async jobs processing documents overnight don't need request-level auth or per-team budgets. Run vLLM directly.

Spheron GPU cloud gives you the self-hosted cost anchor that makes hybrid routing economically rational. Use A100 for mid-tier inference on 13B-40B models, pair H100s for 70B workloads, and let LiteLLM handle the cloud overflow automatically.
H100 SXM5 on Spheron → | Check A100 availability → | View all pricing →

STEPS / 06

Quick Setup Guide

Choose your AI gateway
Pick LiteLLM if you self-host models alongside cloud APIs and need open-source virtual key budgeting and multi-provider routing. Pick Portkey if you're cloud-only and need semantic caching and guardrails without ops overhead. Pick Kong AI Gateway if you already operate a Kong API mesh and need enterprise SSO with PII redaction plugins. Pick TrueFoundry if you need a managed enterprise control plane with governance, MCP/agent support, and VPC or air-gapped deployment. For most teams deploying vLLM on Spheron GPU cloud, LiteLLM is the right starting point.
Provision a vLLM backend on Spheron
Launch an H100 80GB instance on app.spheron.ai. SSH in and start vLLM: 'vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --quantization fp8 --max-model-len 16384 --port 8000 --host 0.0.0.0'. Note the instance's internal IP address; this becomes the VLLM_ENDPOINT in the LiteLLM config.
Deploy LiteLLM proxy with docker-compose
Create litellm-config.yaml with a model_list block mapping 'llama-4-scout' to your Spheron vLLM endpoint (provider: openai, base_url: http://VLLM_HOST:8000/v1) and 'gpt-4o-mini' as a cloud fallback (provider: openai, api_key: $OPENAI_API_KEY). Add a router_settings block with fallbacks and enable Redis caching. Run 'docker-compose up -d' mounting litellm-config.yaml and setting LITELLM_MASTER_KEY and DATABASE_URL environment variables.
Create virtual keys and configure per-team budgets
Open the LiteLLM UI at port 4000. Create one virtual key per team: set max_budget_usd and budget_duration (daily or monthly) per key, and restrict model access to allowed models. Your application sends the team's virtual key as the Bearer token. LiteLLM tracks spend and returns a 429 with a clear error when the budget is hit.
Configure hybrid routing: Spheron-first with cloud overflow
In the router_settings section of litellm-config.yaml, set routing_strategy to 'cost-based-routing'. In the fallbacks list, map your self-hosted Spheron endpoint as primary and gpt-4o-mini as fallback. When the Spheron vLLM returns a 503 (capacity exceeded or instance down), LiteLLM automatically retries the request against the cloud fallback with no code change required in your app.
Enable OpenTelemetry tracing
Add OTEL_EXPORTER=otlp_http, OTEL_ENDPOINT=http://langfuse:3000/api/public/otel, and OTEL_SERVICE_NAME=litellm-proxy to the LiteLLM environment block in docker-compose.yml. Add Langfuse as a service in the same compose file pointing at a Postgres database. LiteLLM will emit one span per request. Access traces in the Langfuse UI at port 3000 - filter by virtual key, model, or latency percentile.

FAQ / 06

Frequently Asked Questions

An AI gateway is a proxy layer that sits between your application and one or more LLM backends. It handles authentication via virtual keys, rate limiting, per-team budget enforcement, request logging, and automatic failover across providers. You need one when you route to more than one LLM, enforce per-team spending caps, or need an audit trail for compliance.

LiteLLM is the best open-source option for teams running self-hosted models alongside cloud APIs. It supports 100+ providers via an OpenAI-compatible protocol, ships with virtual key budgeting and a dashboard, and deploys via Docker. Portkey adds semantic caching and guardrails but runs as a cloud SaaS with limited self-hosted options. Kong AI Gateway is enterprise-focused with strong plugin ecosystem and SSO but is heavier to operate and requires Kong infrastructure. TrueFoundry is a managed enterprise control plane that adds governance, MCP/agent support, and VPC or air-gapped deployment, and is compatible with vLLM and SGLang backends. For most teams with a Spheron vLLM backend, start with LiteLLM; move to Kong or TrueFoundry when enterprise governance and compliance become the priority.

LiteLLM proxy is CPU-bound and can run on any small instance alongside the vLLM backends it fronts. For vLLM, GPU choice depends on model size: H100 80GB for 70B models, A100 80GB for 13B-40B models, L40S 48GB for 7B-13B models. LiteLLM typically adds single-digit millisecond overhead per request.

By routing traffic to your self-hosted Spheron GPU fleet first and falling back to OpenAI or Anthropic only when Spheron capacity is full, you can cut inference costs by 60-80%. At 100 RPS with Llama 4, self-hosted output on an H100 costs roughly $1-2/hr total vs hundreds of dollars per hour at commercial API rates. LiteLLM's fallback routing automates this with no code changes in your application.

Yes. LiteLLM emits OpenTelemetry spans when you set OTEL_EXPORTER=otlp_http and OTEL_ENDPOINT in its environment. Every request gets one span with model name, latency, token counts, cost, and the virtual key used. You can export to Langfuse, Helicone, or any OTLP-compatible collector with a couple of environment variables and zero code changes in your app.

Skip the gateway if you use a single LLM provider with no plans to change, if your workload has ultra-low-latency requirements where even 3ms overhead matters (high-frequency trading, real-time voice AI under 200ms TTFT), or if your team is too small to justify the ops overhead of another service. A direct vLLM endpoint is simpler and faster when you have one model and one provider.

Why AI Gateways Exist

AI Gateway vs Inference Router vs Reverse Proxy

LiteLLM

Portkey AI Gateway

Kong AI Gateway

TrueFoundry AI Gateway

Deploy LiteLLM in Front of vLLM on Spheron

Provision the vLLM backend

Write the LiteLLM config

docker-compose.yml

Smoke test

Hybrid Routing: Spheron-first with Cloud Overflow

Observability: OpenTelemetry, Langfuse, and Helicone

OpenTelemetry

Langfuse

Helicone

Cost and Latency at 100 RPS

When NOT to Use an AI Gateway

Quick Setup Guide

Choose your AI gateway

Provision a vLLM backend on Spheron

Deploy LiteLLM proxy with docker-compose

Create virtual keys and configure per-team budgets

Configure hybrid routing: Spheron-first with cloud overflow

Enable OpenTelemetry tracing

Frequently Asked Questions

01What is an AI gateway and why do I need one?

02How does LiteLLM compare to Portkey and Kong AI Gateway?

03What GPU do I need to run LiteLLM and vLLM in production?

04How much can an AI gateway reduce inference costs?

05Can I use OpenTelemetry with LiteLLM?

06When should I NOT use an AI gateway?

Try It on Real GPUs