Most teams running LLMs in production are paying for a 70B model to answer "What is the capital of France?" Running every query through your largest model is the easiest architecture and the most expensive one. A 70B model on H100 costs $2.40/hr on-demand - sending every request there when 60% of them could be answered by a 7B model at $0.51/hr is burning money. The fix is a routing layer that classifies each query and sends it to the cheapest model that can handle it. This post covers how to build that router: classifier design, NGINX proxy setup, GPU tier selection, and the monitoring you need to catch quality drift before users notice. For background on why inference costs compound quickly at scale, see the reasoning model inference cost guide and the GPU cost optimization playbook.
Why You Need an LLM Inference Router in 2026
The RunPod March 2026 State of AI report showed that Qwen overtook Llama as the most-deployed self-hosted model family. Teams are now routinely running Qwen2.5-7B alongside Qwen2.5-72B and Llama-3.3-70B simultaneously on the same infrastructure. That multi-model footprint makes routing economically rational in a way it wasn't two years ago.
The core math is simple. A typical production workload breaks down roughly like this:
- ~60% simple queries: factual lookups, short summaries, classification, simple Q&A
- ~25% medium queries: document analysis, multi-turn conversation, moderate reasoning
- ~15% complex queries: code generation, multi-step reasoning, long-form synthesis
If you send all of that to one large model, you pay the large model price for every request. If you route by complexity, your average cost drops to something close to a weighted average:
| Strategy | GPU | On-Demand ($/hr) | Handles |
|---|---|---|---|
| No routing (all queries) | H100 SXM5 | $2.40 | 100% of traffic |
| Tier 1 (routed) | RTX 4090 | $0.51 | ~60% of traffic |
| Tier 2 (routed) | A100 80GB PCIe | $1.43 | ~25% of traffic |
| Tier 3 (routed) | H100 SXM5 | $2.40 | ~15% of traffic |
The weighted average GPU cost with routing: (0.60 × $0.51) + (0.25 × $1.43) + (0.15 × $2.40) = $0.306 + $0.358 + $0.360 = $1.02/hr equivalent. That's a 57% cost reduction before you touch spot pricing.
Router Architecture Overview
The router sits between your clients and your model pool. Every request passes through it, gets classified, and gets forwarded to the appropriate model tier.
Client Request
|
v
[Router Proxy - NGINX]
|
v
[Classifier Microservice - FastAPI]
|
+---> Tier 1: [vLLM pool - RTX 4090] (7B-13B models)
|
+---> Tier 2: [vLLM pool - A100 80GB] (32B-70B models)
|
+---> Tier 3: [vLLM pool - H100 SXM5] (70B-200B+ models)Three components:
- Router proxy (NGINX or Envoy): receives all incoming requests, calls the classifier, proxies to the correct upstream. Handles connection pooling and retry logic.
- Classifier microservice (FastAPI): reads the request body, runs the complexity classification pipeline, returns a routing tier in a response header. Should complete in under 5ms.
- Model pool (vLLM per tier): each tier is one or more vLLM instances behind a load balancer. All expose OpenAI-compatible
/v1/chat/completionsendpoints so clients don't need to change anything.
For vLLM setup on each tier, see the vLLM production deployment guide. If you want to use SGLang for your Tier 3 (better throughput on long-context workloads), the SGLang production deployment guide covers that setup.
Model Tier Strategy: Small vs Medium vs Large
| Tier | Models | Params | Use Case | GPU | On-Demand | Spot |
|---|---|---|---|---|---|---|
| Tier 1 | Qwen2.5-7B, Llama-3.2-3B | 3B-13B | Simple Q&A, classification, short summaries | RTX 4090 PCIe | $0.51/hr | N/A |
| Tier 2 | Qwen2.5-32B, Llama-3.1-8B+ | 8B-32B | Document analysis, moderate reasoning, multi-turn | A100 80GB PCIe | $1.43/hr | $1.14/hr |
| Tier 3 | Llama-3.3-70B, Qwen2.5-72B, 200B+ | 70B+ | Code, complex reasoning, long-form synthesis | H100 SXM5 | $2.40/hr | $0.80/hr |
For detailed per-GPU memory requirements at each model size, see the GPU memory requirements guide and the best GPU for AI inference 2026 guide.
Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Choosing Tier 1 Models
Qwen2.5-7B-Instruct is a strong default: it scores higher than many 13B models from 2024 on standard benchmarks, fits on a single RTX 4090 in FP16 (model weights ~14GB, ~17GB+ total with KV cache), and costs $0.51/hr. Llama-3.2-3B is a reasonable fallback if you need even lower latency at the cost of some quality headroom.
Run your Tier 1 on spot instances where available. Tier 1 handles simple, stateless queries - an interruption just means retrying the request, which the router handles automatically in the fallback chain.
Choosing Tier 3 Models
For Tier 3, the choice between Llama-3.3-70B and Qwen2.5-72B depends on your workload. Llama-3.3-70B is stronger on English code generation. Qwen2.5-72B has better multilingual performance and math reasoning. Both fit on a single H100 80GB in FP8.
Deploying the Model Pool with vLLM
Provision one GPU node per tier on Spheron at app.spheron.ai. SSH into each node and run the appropriate vLLM server.
Tier 1 (RTX 4090, 7B model):
pip install vllm
vllm serve Qwen/Qwen2.5-7B-Instruct \
--dtype fp8 \
--port 8001 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 256Tier 2 (A100 80GB, 32B model):
vllm serve Qwen/Qwen2.5-32B-Instruct \
--dtype fp8 \
--port 8002 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 128Tier 3 (H100 SXM5, 70B model):
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--port 8003 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 64 \
--max-model-len 32768For detailed provisioning steps and Docker-based deployment, see the Spheron documentation. For multi-GPU tensor parallelism on Tier 3 (2x H100 for larger models), add --tensor-parallel-size 2.
| Tier | Model | GPU | VRAM Used | Est. Throughput (tokens/sec) |
|---|---|---|---|---|
| 1 | Qwen2.5-7B FP8 | RTX 4090 24GB | ~7GB | 800-1200 |
| 2 | Qwen2.5-32B FP8 | A100 80GB | ~32GB | 400-600 |
| 3 | Llama-3.3-70B FP8 | H100 SXM5 80GB | ~70GB | 300-500 |
Use spot instances for Tier 1 and Tier 2 where available on Spheron. Spot pricing brings A100 SXM4 down to $0.45/hr and H100 SXM5 down to $0.80/hr, cutting tier costs significantly.
Query Classification: Three Approaches
1. Heuristic Rules (Fast, Free, Crude)
Heuristics are your baseline. They're deterministic, add zero latency, and handle obvious cases reliably. They fail on edge cases.
import re
def classify_heuristic(messages: list) -> int:
"""Returns tier 1, 2, or 3."""
# Combine all message content
full_text = " ".join(m.get("content", "") for m in messages if isinstance(m.get("content"), str))
token_estimate = len(full_text.split())
# Hard signals for Tier 3
has_code = bool(re.search(r'```|def |class |import |#include', full_text))
has_math = bool(re.search(r'\$\$|\\\[|\\begin\{|integral|derivative|proof', full_text))
is_multistep = bool(re.search(
r'step by step|analyze|compare and contrast|write a (detailed|comprehensive)|'
r'explain (how|why|the difference)', full_text, re.IGNORECASE
))
is_long = token_estimate > 500
if has_code or has_math or (is_multistep and is_long):
return 3
# Hard signals for Tier 1
is_short = token_estimate < 100
# Apply the ^ anchor to the last user message only.
# full_text joins all turns, so ^ would anchor to the start of older
# messages in multi-turn conversations and almost never match.
last_user_msg = next(
(m.get("content", "") for m in reversed(messages)
if m.get("role") == "user" and isinstance(m.get("content"), str)),
""
)
is_factual = bool(re.search(
r'^(what is|who is|when did|where is|how many|what are)',
last_user_msg.strip().lower()
))
if is_short and is_factual:
return 1
# Default: Tier 2 for everything in between
return 2Latency: under 0.1ms. Accuracy on clear cases: ~70-80%. Weak on moderate-complexity queries that don't match any pattern.
2. Embedding-Based Classifier (Balanced)
Train a lightweight logistic regression on top of sentence embeddings. This gives you a classifier that generalizes to your actual query distribution, not just the patterns you anticipated.
from sentence_transformers import SentenceTransformer
import numpy as np
import joblib
# Training (run once, offline)
def train_classifier(labeled_queries: list[dict]):
"""
labeled_queries: [{"text": "...", "tier": 1|2|3}, ...]
Needs ~500-1000 labeled examples for decent accuracy.
"""
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
texts = [q["text"] for q in labeled_queries]
labels = [q["tier"] for q in labeled_queries]
embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000, C=1.0)
clf.fit(embeddings, labels)
joblib.dump((model, clf), "router_classifier.pkl")
return clf
# Inference (per request)
class EmbeddingClassifier:
def __init__(self, model_path: str = "router_classifier.pkl"):
self.embed_model, self.clf = joblib.load(model_path)
def classify(self, text: str) -> tuple[int, float]:
embedding = self.embed_model.encode([text])
tier = self.clf.predict(embedding)[0]
confidence = self.clf.predict_proba(embedding).max()
return tier, confidenceLatency on CPU: 1-5ms per request. Sub-1ms if batched. The all-MiniLM-L6-v2 model (22.7M parameters) loads in ~80MB of RAM and runs on CPU without a GPU allocation. Accuracy on your own query distribution: typically 85-92% after labeling 500-1000 examples.
3. LLM-Judge Routing (Accurate, Adds Latency)
For borderline queries that the embedding classifier is uncertain about (confidence below 0.7), escalate to an LLM judge. Use the Tier 1 model itself to classify the query before serving it.
async def llm_judge_classify(prompt: str, tier1_client) -> int:
"""Ask the Tier 1 model to rate the complexity. Adds 50-200ms."""
system = (
"Rate this query's complexity: "
"1 (simple factual, one-sentence answer), "
"2 (moderate, requires explanation), "
"3 (complex reasoning, code, or multi-step analysis). "
"Reply with only the digit."
)
response = await tier1_client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt[:500]} # truncate to limit judge latency
],
max_tokens=1,
temperature=0
)
try:
tier = int(response.choices[0].message.content.strip())
return tier if tier in (1, 2, 3) else 2
except (ValueError, IndexError):
return 2 # fallback to Tier 2 if parse failsRun the judge on the same Tier 1 GPU. It adds zero extra GPU cost. The latency overhead (50-200ms) is only paid for borderline queries, which are typically 15-20% of requests.
Building the Routing Proxy with NGINX
The NGINX auth_request module can call the classifier before proxying, but it has a fundamental limitation: auth_request subrequests are always internal GET requests and never include the original request body, regardless of proxy_pass_request_body on. That directive only applies to regular proxy_pass locations. The result is that your FastAPI /classify endpoint always receives an empty body, json.loads(body) raises an exception, and the except block silently falls back to tier 2 for every request. The router never routes to Tier 1 or Tier 3.
The recommended approach is the Python httpx proxy shown after this section, which reads the body normally. The NGINX config below is included for reference if you need NGINX for TLS termination or connection pooling, but the classification logic should live in the Python layer, not as an auth_request subrequest.
Note: auth_request requires ngx_http_auth_request_module, which is compiled into most standard NGINX builds (check with nginx -V 2>&1 | grep auth_request).
upstream tier1 {
server 10.0.0.10:8001;
keepalive 32;
}
upstream tier2 {
server 10.0.0.11:8002;
keepalive 32;
}
upstream tier3 {
server 10.0.0.12:8003;
keepalive 32;
}
upstream classifier {
server 127.0.0.1:9000;
keepalive 16;
}
server {
listen 8080;
# Map classifier header to upstream backend
# (map block belongs at http level in a real config; shown here for clarity)
# map $route_tier $upstream_backend { "1" http://tier1; "2" http://tier2; default http://tier3; }
location /v1/chat/completions {
auth_request /classify;
auth_request_set $route_tier $upstream_http_x_route_tier;
# Route based on classifier header.
# proxy_set_header and proxy_read_timeout are repeated in every if block because
# NGINX if blocks create implicit inner contexts that do not inherit array-type
# directives from the surrounding location block.
if ($route_tier = "1") {
proxy_pass http://tier1;
proxy_set_header Host $host;
proxy_set_header X-Request-ID $request_id;
proxy_read_timeout 120s;
}
if ($route_tier = "2") {
proxy_pass http://tier2;
proxy_set_header Host $host;
proxy_set_header X-Request-ID $request_id;
proxy_read_timeout 120s;
}
# Default to tier3
proxy_pass http://tier3;
proxy_set_header Host $host;
proxy_set_header X-Request-ID $request_id;
proxy_read_timeout 120s;
}
location /classify {
internal;
proxy_pass http://classifier/classify;
# NOTE: auth_request subrequests are always GET with no body.
# proxy_pass_request_body has no effect here — the classifier
# will always receive an empty body via this path.
# Use the Python httpx proxy below for body-aware classification.
proxy_set_header Content-Type application/json;
}
}Classifier microservice (FastAPI):
from fastapi import FastAPI, Request
from fastapi.responses import Response
import json
app = FastAPI()
embedding_clf = EmbeddingClassifier()
@app.post("/classify")
async def classify(request: Request):
body = await request.body()
try:
data = json.loads(body)
messages = data.get("messages", [])
user_text = " ".join(
m["content"] for m in messages
if m.get("role") == "user" and isinstance(m.get("content"), str)
)
# Stage 1: heuristic
h_tier = classify_heuristic(messages)
# Stage 2: embedding classifier for uncertain cases
if h_tier == 2:
e_tier, confidence = embedding_clf.classify(user_text)
tier = e_tier if confidence > 0.7 else 2
else:
tier = h_tier
except Exception:
tier = 2 # safe default
response = Response(content="", status_code=200)
response.headers["X-Route-Tier"] = str(tier)
return responseFor production deployments, use a Python proxy with httpx instead. It reads the request body directly and avoids the auth_request body-stripping issue described above:
import httpx
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, StreamingResponse
BACKENDS = {1: "http://10.0.0.10:8001", 2: "http://10.0.0.11:8002", 3: "http://10.0.0.12:8003"}
@app.post("/v1/chat/completions")
async def route(request: Request):
body = await request.json()
tier = classify_request(body) # your classifier function
backend = BACKENDS[tier]
is_streaming = body.get("stream", False)
if is_streaming:
# Stream SSE chunks directly back to the client
async def stream_response():
async with httpx.AsyncClient(timeout=120.0) as client:
async with client.stream(
"POST", f"{backend}/v1/chat/completions", json=body
) as response:
async for chunk in response.aiter_bytes():
yield chunk
return StreamingResponse(stream_response(), media_type="text/event-stream")
# Non-streaming: propagate the backend status code to the caller
async with httpx.AsyncClient(timeout=120.0) as client:
response = await client.post(f"{backend}/v1/chat/completions", json=body)
return JSONResponse(content=response.json(), status_code=response.status_code)Latency-Aware Routing: SLA Tiers and Fallback Chains
Define SLA contracts per tier upfront. These become the circuit-breaker thresholds that trigger automatic tier escalation.
| Tier | Target TTFT | Max TTFT (p95) | Trigger Escalation At |
|---|---|---|---|
| Tier 1 | <200ms | 500ms | KV cache >90% OR TTFT p95 >500ms |
| Tier 2 | <500ms | 1000ms | KV cache >90% OR TTFT p95 >1000ms |
| Tier 3 | <2000ms | 4000ms | Scale out (add GPU) |
Poll vLLM's /metrics endpoint every 10 seconds to get KV cache utilization:
import httpx
import asyncio
TIER_ENDPOINTS = {
1: "http://10.0.0.10:8001/metrics",
2: "http://10.0.0.11:8002/metrics",
3: "http://10.0.0.12:8003/metrics",
}
async def get_kv_usage(tier: int) -> float:
"""Returns KV cache utilization as a float 0.0-1.0. Returns 0.0 on any network error."""
try:
async with httpx.AsyncClient() as client:
resp = await client.get(TIER_ENDPOINTS[tier], timeout=2.0)
for line in resp.text.splitlines():
if line.startswith("vllm:kv_cache_usage_perc"):
return float(line.split()[-1])
except (httpx.RequestError, httpx.TimeoutException, ValueError):
return 0.0
return 0.0
# In your classifier: check saturation before routing
async def route_with_saturation_check(base_tier: int) -> int:
try:
usage = await get_kv_usage(base_tier)
except Exception:
usage = 0.0
if usage > 0.90 and base_tier < 3:
return base_tier + 1 # escalate to next tier
return base_tierFor a deep dive on KV cache monitoring and optimization, see the KV cache optimization guide.
GPU Allocation Strategy: Matching GPU Tier to Model Tier
| Model Tier | GPU | VRAM | Models That Fit | On-Demand | Spot | Best For |
|---|---|---|---|---|---|---|
| Tier 1 | RTX 4090 PCIe | 24GB | 7B FP16, 13B INT4 | $0.51/hr | N/A | Simple queries, stateless |
| Tier 2 | A100 80GB PCIe | 80GB | 32B FP16, 70B FP8 | $1.43/hr | $1.14/hr | Medium reasoning, conversation |
| Tier 2 | A100 80GB SXM4 | 80GB | 32B FP16, 70B FP8 | $1.05/hr | $0.45/hr | Same but lower spot price |
| Tier 3 | H100 SXM5 | 80GB | 70B FP8, 100B+ (2x) | $2.40/hr | $0.80/hr | Complex reasoning, code |
| Tier 3 (XL) | H200 SXM5 | 141GB | 72B FP16, 200B+ | $4.50/hr | $1.19/hr | Very long context, 200B+ models |
Spheron bills per second. That means you can scale Tier 1 GPUs up during peak hours and terminate them during off-peak without paying for idle capacity. A workload that runs for 4 hours at peak instead of 24 hours idle costs 83% less.
For MIG (Multi-Instance GPU) as an alternative to separate GPU instances for Tier 1 - partitioning a single A100 to run multiple smaller models - see the MIG and GPU time-slicing guide.
Cost Savings Analysis: Router vs Single Large Model
Scenario: 100,000 requests/day, average 500 input + 200 output tokens per request.
Baseline: all requests to H100 SXM5 at $2.40/hr
At 100,000 requests/day with ~700 tokens average, and assuming ~300 tokens/sec throughput on H100 SXM5, you need roughly:
- Tokens per day: 100,000 × 700 = 70M tokens
- GPU-seconds needed: 70M / 300 = ~233,000 seconds = ~65 GPU-hours
- Daily GPU cost: 65 × $2.40 = $156/day
Routed: 60% Tier 1, 25% Tier 2, 15% Tier 3
Tier 1 (60,000 requests on RTX 4090, ~1,000 tokens/sec):
- Tokens: 60,000 × 700 = 42M tokens
- GPU-seconds: 42M / 1,000 = 42,000 seconds = ~12 GPU-hours
- Cost: 12 × $0.51 = $6.12/day
Tier 2 (25,000 requests on A100 80GB, ~500 tokens/sec):
- Tokens: 25,000 × 700 = 17.5M tokens
- GPU-seconds: 17.5M / 500 = 35,000 seconds = ~10 GPU-hours
- Cost: 10 × $1.43 = $14.30/day
Tier 3 (15,000 requests on H100 SXM5, ~400 tokens/sec):
- Tokens: 15,000 × 700 = 10.5M tokens
- GPU-seconds: 10.5M / 400 = 26,250 seconds = ~7 GPU-hours
- Cost: 7 × $2.40 = $16.80/day
Total routed cost: $6.12 + $14.30 + $16.80 = $37.22/day
Compared to $156/day baseline, that's a 76% cost reduction.
Using spot pricing where available (A100 SXM4 spot at $0.45/hr for Tier 2, H100 SXM5 spot at $0.80/hr for Tier 3) brings the total down further to roughly $16-18/day, an 88-90% reduction from baseline.
For more on reducing per-token reasoning costs, see the reasoning model inference cost guide.
Production Monitoring: Tracking Route Decisions
Emit these Prometheus labels on every request:
from prometheus_client import Counter, Histogram
routing_requests = Counter(
"llm_router_requests_total",
"Total routed requests",
["routing_tier", "escalated", "original_tier", "model_id"]
)
routing_latency = Histogram(
"llm_router_latency_ms",
"End-to-end latency per routed request",
["routing_tier"],
buckets=[50, 100, 200, 500, 1000, 2000, 5000]
)Three dashboards to track:
- Tier distribution over time: if Tier 1 drops below 50% of traffic, your classifier is over-routing. Check if query distribution has shifted or if the embedding classifier is degrading.
- Escalation rate per tier: rising escalation rates (>5% from baseline) indicate quality issues with the source tier. Either the model degraded, the tier thresholds are wrong, or your query distribution shifted.
- p50/p95/p99 latency per tier: a latency spike on Tier 1 usually means GPU memory pressure. Check
vllm:kv_cache_usage_perc. A spike on Tier 3 means you need to scale out.
Quality drift detection: run a secondary LLM judge on a random 1% sample of Tier 1 and Tier 2 outputs. Score each on a 1-5 scale. If average quality drops below your threshold (typically 3.5/5), tighten the routing thresholds - send fewer queries to the lower tier until quality recovers.
For MoE model deployment and cost optimization, see the MoE inference optimization guide.
Case Study: 61% Cost Reduction with Smart Routing
A team running a customer support LLM was processing 150,000 queries/day, routing everything through a Llama-3.3-70B on H100 SXM5. The GPU was running at ~60% average utilization - not terrible, but still paying H100 prices for every "What are your business hours?" query.
After classifying 1,000 historical queries from their logs, the breakdown was:
- 65%: simple factual queries (product specs, hours, FAQ, return policies)
- 25%: moderate queries (account questions, order status, policy lookups)
- 10%: complex queries (billing disputes, refund edge cases, escalations)
They deployed a 3-tier router:
- Qwen2.5-7B on RTX 4090 for Tier 1 ($0.51/hr)
- Llama-3.1-8B on A100 80GB for Tier 2 ($1.43/hr)
- Llama-3.3-70B on H100 SXM5 for Tier 3 ($2.40/hr)
Before: one H100 SXM5 at $2.40/hr handling all 150k queries/day. At ~6,250 queries/hr and ~400 tokens/sec throughput, they needed roughly 65-70 GPU-hours/day.
Daily cost before: ~68 GPU-hours × $2.40 = $163/day
After routing (65% Tier 1, 25% Tier 2, 10% Tier 3):
- Tier 1: 41 GPU-hours × $0.51 = $20.91/day
- Tier 2: 12 GPU-hours × $1.43 = $17.16/day
- Tier 3: 11 GPU-hours × $2.40 = $26.40/day
- Total: $64.47/day
Result: 61% cost reduction with escalation rate on Tier 1 staying below 3% (measured via the confidence fallback chain). Customer satisfaction scores were unchanged across a 2-week A/B test.
The team then added A100 SXM4 spot instances for Tier 2, bringing that cost down to $0.45/hr and pushing the total daily cost to ~$53, a 68% reduction from the original baseline.
Spheron's GPU marketplace lets you rent RTX 4090, A100, and H100 GPUs from the same platform, billed per second. That per-second billing is what makes dynamic routing practical: scale Tier 1 up during peak hours and back down at night without paying for idle capacity.
Rent RTX 4090 → | Rent A100 → | Rent H100 → | View all pricing →
