Most teams running LLMs in production are paying for a 70B model to answer "What is the capital of France?" Running every query through your largest model is the easiest architecture and the most expensive one. A 70B model on H100 costs $2.40/hr on-demand - sending every request there when 60% of them could be answered by a 7B model at $0.51/hr is burning money. The fix is a routing layer that classifies each query and sends it to the cheapest model that can handle it. This post covers how to build that router: classifier design, NGINX proxy setup, GPU tier selection, and the monitoring you need to catch quality drift before users notice. For background on why inference costs compound quickly at scale, see the reasoning model inference cost guide and the GPU cost optimization playbook. If you need auth, multi-provider failover, and per-team budget tracking on top of your routing layer, see the LiteLLM and AI gateway deployment guide - the gateway sits above the router to handle what NGINX cannot.
Why You Need an LLM Inference Router in 2026
The RunPod March 2026 State of AI report showed that Qwen overtook Llama as the most-deployed self-hosted model family. Teams are now routinely running Qwen2.5-7B alongside Qwen2.5-72B and Llama-3.3-70B simultaneously on the same infrastructure. That multi-model footprint makes routing economically rational in a way it wasn't two years ago.
The core math is simple. A typical production workload breaks down roughly like this:
- ~60% simple queries: factual lookups, short summaries, classification, simple Q&A
- ~25% medium queries: document analysis, multi-turn conversation, moderate reasoning
- ~15% complex queries: code generation, multi-step reasoning, long-form synthesis
If you send all of that to one large model, you pay the large model price for every request. If you route by complexity, your average cost drops to something close to a weighted average:
| Strategy | GPU | On-Demand ($/hr) | Handles |
|---|---|---|---|
| No routing (all queries) | H100 SXM5 | $2.40 | 100% of traffic |
| Tier 1 (routed) | RTX 4090 | $0.51 | ~60% of traffic |
| Tier 2 (routed) | A100 80GB PCIe | $1.43 | ~25% of traffic |
| Tier 3 (routed) | H100 SXM5 | $2.40 | ~15% of traffic |
The weighted average GPU cost with routing: (0.60 × $0.51) + (0.25 × $1.43) + (0.15 × $2.40) = $0.306 + $0.358 + $0.360 = $1.02/hr equivalent. That's a 57% cost reduction before you touch spot pricing.
Router Architecture Overview
The router sits between your clients and your model pool. Every request passes through it, gets classified, and gets forwarded to the appropriate model tier.
Client Request
|
v
[Router Proxy - NGINX]
|
v
[Classifier Microservice - FastAPI]
|
+---> Tier 1: [vLLM pool - RTX 4090] (7B-13B models)
|
+---> Tier 2: [vLLM pool - A100 80GB] (32B-70B models)
|
+---> Tier 3: [vLLM pool - H100 SXM5] (70B-200B+ models)Three components:
- Router proxy (NGINX or Envoy): receives all incoming requests, calls the classifier, proxies to the correct upstream. Handles connection pooling and retry logic.
- Classifier microservice (FastAPI): reads the request body, runs the complexity classification pipeline, returns a routing tier in a response header. Should complete in under 5ms.
- Model pool (vLLM per tier): each tier is one or more vLLM instances behind a load balancer. All expose OpenAI-compatible
/v1/chat/completionsendpoints so clients don't need to change anything.
For vLLM setup on each tier, see the vLLM production deployment guide. If you want to use SGLang for your Tier 3 (better throughput on long-context workloads), the SGLang production deployment guide covers that setup.
Model Tier Strategy: Small vs Medium vs Large
| Tier | Models | Params | Use Case | GPU | On-Demand | Spot |
|---|---|---|---|---|---|---|
| Tier 1 | Qwen2.5-7B, Llama-3.2-3B | 3B-13B | Simple Q&A, classification, short summaries | RTX 4090 PCIe | $0.51/hr | N/A |
| Tier 2 | Qwen2.5-32B, Llama-3.1-8B+ | 8B-32B | Document analysis, moderate reasoning, multi-turn | A100 80GB PCIe | $1.43/hr | $1.14/hr |
| Tier 3 | Llama-3.3-70B, Qwen2.5-72B, 200B+ | 70B+ | Code, complex reasoning, long-form synthesis | H100 SXM5 | $2.40/hr | $0.80/hr |
For detailed per-GPU memory requirements at each model size, see the GPU memory requirements guide and the best GPU for AI inference 2026 guide.
Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Choosing Tier 1 Models
Qwen2.5-7B-Instruct is a strong default: it scores higher than many 13B models from 2024 on standard benchmarks, fits on a single RTX 4090 in FP16 (model weights ~14GB, ~17GB+ total with KV cache), and costs $0.51/hr. Llama-3.2-3B is a reasonable fallback if you need even lower latency at the cost of some quality headroom.
For structured reasoning tasks specifically (constraint satisfaction, ARC-style problems, classification with iterative refinement), the Hierarchical Reasoning Model is a strong candidate for the small tier: a 27M-parameter model that handles structured reasoning on a single RTX 4090 before escalating to a larger backend, at a fraction of even 7B model costs.
Run your Tier 1 on spot instances where available. Tier 1 handles simple, stateless queries - an interruption just means retrying the request, which the router handles automatically in the fallback chain.
Choosing Tier 3 Models
For Tier 3, the choice between Llama-3.3-70B and Qwen2.5-72B depends on your workload. Llama-3.3-70B is stronger on English code generation. Qwen2.5-72B has better multilingual performance and math reasoning. Both fit on a single H100 80GB in FP8.
Deploying the Model Pool with vLLM
Provision one GPU node per tier on Spheron at app.spheron.ai. SSH into each node and run the appropriate vLLM server.
Tier 1 (RTX 4090, 7B model):
pip install vllm
vllm serve Qwen/Qwen2.5-7B-Instruct \
--dtype fp8 \
--port 8001 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 256Tier 2 (A100 80GB, 32B model):
vllm serve Qwen/Qwen2.5-32B-Instruct \
--dtype fp8 \
--port 8002 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 128Tier 3 (H100 SXM5, 70B model):
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--port 8003 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 64 \
--max-model-len 32768For detailed provisioning steps and Docker-based deployment, see the Spheron documentation. For multi-GPU tensor parallelism on Tier 3 (2x H100 for larger models), add --tensor-parallel-size 2.
| Tier | Model | GPU | VRAM Used | Est. Throughput (tokens/sec) |
|---|---|---|---|---|
| 1 | Qwen2.5-7B FP8 | RTX 4090 24GB | ~7GB | 800-1200 |
| 2 | Qwen2.5-32B FP8 | A100 80GB | ~32GB | 400-600 |
| 3 | Llama-3.3-70B FP8 | H100 SXM5 80GB | ~70GB | 300-500 |
Use spot instances for Tier 1 and Tier 2 where available on Spheron. Spot pricing brings A100 SXM4 down to $0.45/hr and H100 SXM5 down to $0.80/hr, cutting tier costs significantly.
Query Classification: Three Approaches
1. Heuristic Rules (Fast, Free, Crude)
Heuristics are your baseline. They're deterministic, add zero latency, and handle obvious cases reliably. They fail on edge cases.
import re
def classify_heuristic(messages: list) -> int:
"""Returns tier 1, 2, or 3."""
# Combine all message content
full_text = " ".join(m.get("content", "") for m in messages if isinstance(m.get("content"), str))
token_estimate = len(full_text.split())
# Hard signals for Tier 3
has_code = bool(re.search(r'```|def |class |import |#include', full_text))
has_math = bool(re.search(r'\$\$|\\\[|\\begin\{|integral|derivative|proof', full_text))
is_multistep = bool(re.search(
r'step by step|analyze|compare and contrast|write a (detailed|comprehensive)|'
r'explain (how|why|the difference)', full_text, re.IGNORECASE
))
is_long = token_estimate > 500
if has_code or has_math or (is_multistep and is_long):
return 3
# Hard signals for Tier 1
is_short = token_estimate < 100
# Apply the ^ anchor to the last user message only.
# full_text joins all turns, so ^ would anchor to the start of older
# messages in multi-turn conversations and almost never match.
last_user_msg = next(
(m.get("content", "") for m in reversed(messages)
if m.get("role") == "user" and isinstance(m.get("content"), str)),
""
)
is_factual = bool(re.search(
r'^(what is|who is|when did|where is|how many|what are)',
last_user_msg.strip().lower()
))
if is_short and is_factual:
return 1
# Default: Tier 2 for everything in between
return 2Latency: under 0.1ms. Accuracy on clear cases: ~70-80%. Weak on moderate-complexity queries that don't match any pattern.
2. Embedding-Based Classifier (Balanced)
Train a lightweight logistic regression on top of sentence embeddings. This gives you a classifier that generalizes to your actual query distribution, not just the patterns you anticipated. If your router relies on query embeddings for classification, see the self-hosted TEI embedding guide for the embedding server setup.
from sentence_transformers import SentenceTransformer
import numpy as np
import joblib
# Training (run once, offline)
def train_classifier(labeled_queries: list[dict]):
"""
labeled_queries: [{"text": "...", "tier": 1|2|3}, ...]
Needs ~500-1000 labeled examples for decent accuracy.
"""
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
texts = [q["text"] for q in labeled_queries]
labels = [q["tier"] for q in labeled_queries]
embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000, C=1.0)
clf.fit(embeddings, labels)
joblib.dump((model, clf), "router_classifier.pkl")
return clf
# Inference (per request)
class EmbeddingClassifier:
def __init__(self, model_path: str = "router_classifier.pkl"):
self.embed_model, self.clf = joblib.load(model_path)
def classify(self, text: str) -> tuple[int, float]:
embedding = self.embed_model.encode([text])
tier = self.clf.predict(embedding)[0]
confidence = self.clf.predict_proba(embedding).max()
return tier, confidenceLatency on CPU: 1-5ms per request. Sub-1ms if batched. The all-MiniLM-L6-v2 model (22.7M parameters) loads in ~80MB of RAM and runs on CPU without a GPU allocation. Accuracy on your own query distribution: typically 85-92% after labeling 500-1000 examples.
3. LLM-Judge Routing (Accurate, Adds Latency)
For borderline queries that the embedding classifier is uncertain about (confidence below 0.7), escalate to an LLM judge. Use the Tier 1 model itself to classify the query before serving it.
async def llm_judge_classify(prompt: str, tier1_client) -> int:
"""Ask the Tier 1 model to rate the complexity. Adds 50-200ms."""
system = (
"Rate this query's complexity: "
"1 (simple factual, one-sentence answer), "
"2 (moderate, requires explanation), "
"3 (complex reasoning, code, or multi-step analysis). "
"Reply with only the digit."
)
response = await tier1_client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt[:500]} # truncate to limit judge latency
],
max_tokens=1,
temperature=0
)
try:
tier = int(response.choices[0].message.content.strip())
return tier if tier in (1, 2, 3) else 2
except (ValueError, IndexError):
return 2 # fallback to Tier 2 if parse failsRun the judge on the same Tier 1 GPU. It adds zero extra GPU cost. The latency overhead (50-200ms) is only paid for borderline queries, which are typically 15-20% of requests.
Building the Routing Proxy with NGINX
The NGINX auth_request module can call the classifier before proxying, but it has a fundamental limitation: auth_request subrequests are always internal GET requests and never include the original request body, regardless of proxy_pass_request_body on. That directive only applies to regular proxy_pass locations. The result is that your FastAPI /classify endpoint always receives an empty body, json.loads(body) raises an exception, and the except block silently falls back to tier 2 for every request. The router never routes to Tier 1 or Tier 3.
The recommended approach is the Python httpx proxy shown after this section, which reads the body normally. The NGINX config below is included for reference if you need NGINX for TLS termination or connection pooling, but the classification logic should live in the Python layer, not as an auth_request subrequest.
Note: auth_request requires ngx_http_auth_request_module, which is compiled into most standard NGINX builds (check with nginx -V 2>&1 | grep auth_request).
upstream tier1 {
server 10.0.0.10:8001;
keepalive 32;
}
upstream tier2 {
server 10.0.0.11:8002;
keepalive 32;
}
upstream tier3 {
server 10.0.0.12:8003;
keepalive 32;
}
upstream classifier {
server 127.0.0.1:9000;
keepalive 16;
}
server {
listen 8080;
# Map classifier header to upstream backend
# (map block belongs at http level in a real config; shown here for clarity)
# map $route_tier $upstream_backend { "1" http://tier1; "2" http://tier2; default http://tier3; }
location /v1/chat/completions {
auth_request /classify;
auth_request_set $route_tier $upstream_http_x_route_tier;
# Route based on classifier header.
# proxy_set_header and proxy_read_timeout are repeated in every if block because
# NGINX if blocks create implicit inner contexts that do not inherit array-type
# directives from the surrounding location block.
if ($route_tier = "1") {
proxy_pass http://tier1;
proxy_set_header Host $host;
proxy_set_header X-Request-ID $request_id;
proxy_read_timeout 120s;
}
if ($route_tier = "2") {
proxy_pass http://tier2;
proxy_set_header Host $host;
proxy_set_header X-Request-ID $request_id;
proxy_read_timeout 120s;
}
# Default to tier3
proxy_pass http://tier3;
proxy_set_header Host $host;
proxy_set_header X-Request-ID $request_id;
proxy_read_timeout 120s;
}
location /classify {
internal;
proxy_pass http://classifier/classify;
# NOTE: auth_request subrequests are always GET with no body.
# proxy_pass_request_body has no effect here — the classifier
# will always receive an empty body via this path.
# Use the Python httpx proxy below for body-aware classification.
proxy_set_header Content-Type application/json;
}
}Classifier microservice (FastAPI):
from fastapi import FastAPI, Request
from fastapi.responses import Response
import json
app = FastAPI()
embedding_clf = EmbeddingClassifier()
@app.post("/classify")
async def classify(request: Request):
body = await request.body()
try:
data = json.loads(body)
messages = data.get("messages", [])
user_text = " ".join(
m["content"] for m in messages
if m.get("role") == "user" and isinstance(m.get("content"), str)
)
# Stage 1: heuristic
h_tier = classify_heuristic(messages)
# Stage 2: embedding classifier for uncertain cases
if h_tier == 2:
e_tier, confidence = embedding_clf.classify(user_text)
tier = e_tier if confidence > 0.7 else 2
else:
tier = h_tier
except Exception:
tier = 2 # safe default
response = Response(content="", status_code=200)
response.headers["X-Route-Tier"] = str(tier)
return responseFor production deployments, use a Python proxy with httpx instead. It reads the request body directly and avoids the auth_request body-stripping issue described above:
import httpx
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, StreamingResponse
BACKENDS = {1: "http://10.0.0.10:8001", 2: "http://10.0.0.11:8002", 3: "http://10.0.0.12:8003"}
@app.post("/v1/chat/completions")
async def route(request: Request):
body = await request.json()
tier = classify_request(body) # your classifier function
backend = BACKENDS[tier]
is_streaming = body.get("stream", False)
if is_streaming:
# Stream SSE chunks directly back to the client
async def stream_response():
async with httpx.AsyncClient(timeout=120.0) as client:
async with client.stream(
"POST", f"{backend}/v1/chat/completions", json=body
) as response:
async for chunk in response.aiter_bytes():
yield chunk
return StreamingResponse(stream_response(), media_type="text/event-stream")
# Non-streaming: propagate the backend status code to the caller
async with httpx.AsyncClient(timeout=120.0) as client:
response = await client.post(f"{backend}/v1/chat/completions", json=body)
return JSONResponse(content=response.json(), status_code=response.status_code)Latency-Aware Routing: SLA Tiers and Fallback Chains
Define SLA contracts per tier upfront. These become the circuit-breaker thresholds that trigger automatic tier escalation.
| Tier | Target TTFT | Max TTFT (p95) | Trigger Escalation At |
|---|---|---|---|
| Tier 1 | <200ms | 500ms | KV cache >90% OR TTFT p95 >500ms |
| Tier 2 | <500ms | 1000ms | KV cache >90% OR TTFT p95 >1000ms |
| Tier 3 | <2000ms | 4000ms | Scale out (add GPU) |
Poll vLLM's /metrics endpoint every 10 seconds to get KV cache utilization:
import httpx
import asyncio
TIER_ENDPOINTS = {
1: "http://10.0.0.10:8001/metrics",
2: "http://10.0.0.11:8002/metrics",
3: "http://10.0.0.12:8003/metrics",
}
async def get_kv_usage(tier: int) -> float:
"""Returns KV cache utilization as a float 0.0-1.0. Returns 0.0 on any network error."""
try:
async with httpx.AsyncClient() as client:
resp = await client.get(TIER_ENDPOINTS[tier], timeout=2.0)
for line in resp.text.splitlines():
if line.startswith("vllm:kv_cache_usage_perc"):
return float(line.split()[-1])
except (httpx.RequestError, httpx.TimeoutException, ValueError):
return 0.0
return 0.0
# In your classifier: check saturation before routing
async def route_with_saturation_check(base_tier: int) -> int:
try:
usage = await get_kv_usage(base_tier)
except Exception:
usage = 0.0
if usage > 0.90 and base_tier < 3:
return base_tier + 1 # escalate to next tier
return base_tierFor a deep dive on KV cache monitoring and optimization, see the KV cache optimization guide.
GPU Allocation Strategy: Matching GPU Tier to Model Tier
| Model Tier | GPU | VRAM | Models That Fit | On-Demand | Spot | Best For |
|---|---|---|---|---|---|---|
| Tier 1 | RTX 4090 PCIe | 24GB | 7B FP16, 13B INT4 | $0.51/hr | N/A | Simple queries, stateless |
| Tier 2 | A100 80GB PCIe | 80GB | 32B FP16, 70B FP8 | $1.43/hr | $1.14/hr | Medium reasoning, conversation |
| Tier 2 | A100 80GB SXM4 | 80GB | 32B FP16, 70B FP8 | $1.05/hr | $0.45/hr | Same but lower spot price |
| Tier 3 | H100 SXM5 | 80GB | 70B FP8, 100B+ (2x) | $2.40/hr | $0.80/hr | Complex reasoning, code |
| Tier 3 (XL) | H200 SXM5 | 141GB | 72B FP16, 200B+ | $4.50/hr | $1.19/hr | Very long context, 200B+ models |
Spheron bills per second. That means you can scale Tier 1 GPUs up during peak hours and terminate them during off-peak without paying for idle capacity. A workload that runs for 4 hours at peak instead of 24 hours idle costs 83% less.
For MIG (Multi-Instance GPU) as an alternative to separate GPU instances for Tier 1 - partitioning a single A100 to run multiple smaller models - see the MIG and GPU time-slicing guide. A practical example of this pattern is pairing Ministral 3 3B at the edge with Ministral 3 14B Reasoning on Spheron cloud for complex query escalation.
Cost Savings Analysis: Router vs Single Large Model
Scenario: 100,000 requests/day, average 500 input + 200 output tokens per request.
Baseline: all requests to H100 SXM5 at $2.40/hr
At 100,000 requests/day with ~700 tokens average, and assuming ~300 tokens/sec throughput on H100 SXM5, you need roughly:
- Tokens per day: 100,000 × 700 = 70M tokens
- GPU-seconds needed: 70M / 300 = ~233,000 seconds = ~65 GPU-hours
- Daily GPU cost: 65 × $2.40 = $156/day
Routed: 60% Tier 1, 25% Tier 2, 15% Tier 3
Tier 1 (60,000 requests on RTX 4090, ~1,000 tokens/sec):
- Tokens: 60,000 × 700 = 42M tokens
- GPU-seconds: 42M / 1,000 = 42,000 seconds = ~12 GPU-hours
- Cost: 12 × $0.51 = $6.12/day
Tier 2 (25,000 requests on A100 80GB, ~500 tokens/sec):
- Tokens: 25,000 × 700 = 17.5M tokens
- GPU-seconds: 17.5M / 500 = 35,000 seconds = ~10 GPU-hours
- Cost: 10 × $1.43 = $14.30/day
Tier 3 (15,000 requests on H100 SXM5, ~400 tokens/sec):
- Tokens: 15,000 × 700 = 10.5M tokens
- GPU-seconds: 10.5M / 400 = 26,250 seconds = ~7 GPU-hours
- Cost: 7 × $2.40 = $16.80/day
Total routed cost: $6.12 + $14.30 + $16.80 = $37.22/day
Compared to $156/day baseline, that's a 76% cost reduction.
Using spot pricing where available (A100 SXM4 spot at $0.45/hr for Tier 2, H100 SXM5 spot at $0.80/hr for Tier 3) brings the total down further to roughly $16-18/day, an 88-90% reduction from baseline.
For more on reducing per-token reasoning costs, see the reasoning model inference cost guide.
If you need higher output quality rather than lower cost per query, a Mixture of Agents architecture - running multiple proposer models in parallel and aggregating their outputs - is a complementary approach to routing.
Production Monitoring: Tracking Route Decisions
Emit these Prometheus labels on every request:
from prometheus_client import Counter, Histogram
routing_requests = Counter(
"llm_router_requests_total",
"Total routed requests",
["routing_tier", "escalated", "original_tier", "model_id"]
)
routing_latency = Histogram(
"llm_router_latency_ms",
"End-to-end latency per routed request",
["routing_tier"],
buckets=[50, 100, 200, 500, 1000, 2000, 5000]
)Three dashboards to track:
- Tier distribution over time: if Tier 1 drops below 50% of traffic, your classifier is over-routing. Check if query distribution has shifted or if the embedding classifier is degrading.
- Escalation rate per tier: rising escalation rates (>5% from baseline) indicate quality issues with the source tier. Either the model degraded, the tier thresholds are wrong, or your query distribution shifted.
- p50/p95/p99 latency per tier: a latency spike on Tier 1 usually means GPU memory pressure. Check
vllm:kv_cache_usage_perc. A spike on Tier 3 means you need to scale out.
Quality drift detection: run a secondary LLM judge on a random 1% sample of Tier 1 and Tier 2 outputs. Score each on a 1-5 scale. If average quality drops below your threshold (typically 3.5/5), tighten the routing thresholds - send fewer queries to the lower tier until quality recovers.
For MoE model deployment and cost optimization, see the MoE inference optimization guide.
Case Study: 61% Cost Reduction with Smart Routing
A team running a customer support LLM was processing 150,000 queries/day, routing everything through a Llama-3.3-70B on H100 SXM5. The GPU was running at ~60% average utilization - not terrible, but still paying H100 prices for every "What are your business hours?" query.
After classifying 1,000 historical queries from their logs, the breakdown was:
- 65%: simple factual queries (product specs, hours, FAQ, return policies)
- 25%: moderate queries (account questions, order status, policy lookups)
- 10%: complex queries (billing disputes, refund edge cases, escalations)
They deployed a 3-tier router:
- Qwen2.5-7B on RTX 4090 for Tier 1 ($0.51/hr)
- Llama-3.1-8B on A100 80GB for Tier 2 ($1.43/hr)
- Llama-3.3-70B on H100 SXM5 for Tier 3 ($2.40/hr)
Before: one H100 SXM5 at $2.40/hr handling all 150k queries/day. At ~6,250 queries/hr and ~400 tokens/sec throughput, they needed roughly 65-70 GPU-hours/day.
Daily cost before: ~68 GPU-hours × $2.40 = $163/day
After routing (65% Tier 1, 25% Tier 2, 10% Tier 3):
- Tier 1: 41 GPU-hours × $0.51 = $20.91/day
- Tier 2: 12 GPU-hours × $1.43 = $17.16/day
- Tier 3: 11 GPU-hours × $2.40 = $26.40/day
- Total: $64.47/day
Result: 61% cost reduction with escalation rate on Tier 1 staying below 3% (measured via the confidence fallback chain). Customer satisfaction scores were unchanged across a 2-week A/B test.
The team then added A100 SXM4 spot instances for Tier 2, bringing that cost down to $0.45/hr and pushing the total daily cost to ~$53, a 68% reduction from the original baseline.
Spheron's GPU marketplace lets you rent RTX 4090, A100, and H100 GPUs from the same platform, billed per second. That per-second billing is what makes dynamic routing practical: scale Tier 1 up during peak hours and back down at night without paying for idle capacity.
Rent RTX 4090 → | Rent A100 → | Rent H100 → | View all pricing →
Quick Setup Guide
Classify queries into 3 tiers: Tier 1 (simple, <500 tokens) routed to a 7B model on RTX 4090 ($0.51/hr); Tier 2 (medium complexity, 500-2000 tokens or domain-specific) routed to a 32B-70B model on A100 80GB ($1.43/hr); Tier 3 (complex reasoning, code, multi-step) routed to a 70B-200B+ model on H100 SXM5 ($2.40/hr on-demand, $0.80/hr spot). Measure your query mix first - run 1,000 sample queries through a complexity classifier before deploying the router in production.
Provision one GPU node per tier on Spheron. For each node, run vLLM with its model: 'vllm serve Qwen/Qwen2.5-7B-Instruct --dtype fp8 --port 8001' (Tier 1), 'vllm serve Qwen/Qwen2.5-32B-Instruct --dtype fp8 --port 8002' (Tier 2), 'vllm serve meta-llama/Llama-3.3-70B-Instruct --dtype fp8 --port 8003' (Tier 3). All expose OpenAI-compatible /v1/chat/completions endpoints. Document each node's internal IP for the router config.
Start with a heuristic classifier as a baseline: route queries with <200 tokens and no code/math markers to Tier 1; queries with code blocks, math notation, or multi-step instructions to Tier 3; everything else to Tier 2. Then layer an embedding classifier using a 22.7M-parameter model (e.g., sentence-transformers/all-MiniLM-L6-v2) trained on a labeled sample of your own queries. The embedding classifier adds <2ms latency and is far more accurate than pure heuristics for edge cases.
Use a Python httpx proxy as your routing layer: read the request body, call classify_request(), look up the backend in BACKENDS = {1: ..., 2: ..., 3: ...}, and forward the request. This approach handles body-based classification correctly. If you need NGINX for TLS termination, put it in front but keep classification in the Python layer — NGINX auth_request subrequests strip the request body by design, so the classifier will never receive the query via that path. Set upstream keepalive connections to avoid TCP overhead per request. Add request_id logging so every route decision is auditable.
In your classifier microservice, implement a fallback chain: if the Tier 1 model's response includes low_confidence=true (detected by a secondary scoring pass or a short confidence prompt), re-route the request to Tier 2. If Tier 2 escalates similarly, route to Tier 3. Log all escalations with the original tier, final tier, and latency overhead. A healthy router should escalate <5% of initially classified Tier 1 requests.
Emit a routing_tier label on every request to your metrics system (Prometheus or Datadog). Track: (1) tier distribution over time - if Tier 1 drops below 50% of traffic, your classifier may be over-routing; (2) escalation rate per tier - rising escalation rates signal quality issues with lower tiers; (3) p50/p95 latency per tier - a latency spike on Tier 1 may indicate GPU memory pressure and a need to scale that tier. Set alerts on escalation_rate > 10% above baseline and on p95 latency > SLA threshold.
Frequently Asked Questions
An LLM inference router is a proxy layer that classifies incoming queries by complexity and forwards them to the most cost-appropriate model. Simple queries go to a cheap 7B model; complex ones go to a large 70B or 200B+ model. The goal is to maintain output quality while reducing average cost per token.
In practice, 60-70% of queries to a large model are simple enough to be handled by a 7B model with no noticeable quality drop. If your large model costs $2.40/hr and your small model costs $0.51/hr, routing 65% of traffic to the small model reduces your average per-query cost by over 50%. The exact savings depend on your query mix and the quality threshold you set.
Three approaches work in practice: heuristic rules (token count, keyword patterns) are fast and free but crude; embedding-based classifiers (a small BERT-class model) are more accurate with sub-millisecond latency; LLM-judge routing (asking a small model to classify the query) is most accurate but adds 50-200ms overhead. Most production routers combine heuristics for obvious cases with an embedding classifier for borderline queries.
For small models (7B-13B), RTX 4090 at $0.51/hr is the most cost-efficient option on Spheron. For medium models (32B-70B), A100 80GB at $1.43/hr handles FP16 serving and FP8 allows larger models. For large models (70B-200B+), H100 SXM5 at $0.80/hr spot or $2.40/hr on-demand provides the throughput needed for complex queries. Use spot instances for your small model tier since they handle non-critical traffic.
Implement a fallback chain: if the small model returns a low-confidence output (detected via a secondary confidence scorer or by monitoring user feedback), escalate to the next tier. Track routing decisions with request IDs in your logs so you can audit whether escalations are happening at the expected rate. Set a quality drift alert that fires if escalation rate exceeds your baseline by more than 10 percentage points.
