Engineering

LLM Inference Router on GPU Cloud: Smart Model Routing for Cost and Latency (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 5, 2026
LLM Inference RouterGPU CloudModel RoutingvLLMSGLangCost OptimizationH100A100
LLM Inference Router on GPU Cloud: Smart Model Routing for Cost and Latency (2026)

Most teams running LLMs in production are paying for a 70B model to answer "What is the capital of France?" Running every query through your largest model is the easiest architecture and the most expensive one. A 70B model on H100 costs $2.40/hr on-demand - sending every request there when 60% of them could be answered by a 7B model at $0.51/hr is burning money. The fix is a routing layer that classifies each query and sends it to the cheapest model that can handle it. This post covers how to build that router: classifier design, NGINX proxy setup, GPU tier selection, and the monitoring you need to catch quality drift before users notice. For background on why inference costs compound quickly at scale, see the reasoning model inference cost guide and the GPU cost optimization playbook. If you need auth, multi-provider failover, and per-team budget tracking on top of your routing layer, see the LiteLLM and AI gateway deployment guide - the gateway sits above the router to handle what NGINX cannot.

Why You Need an LLM Inference Router in 2026

The RunPod March 2026 State of AI report showed that Qwen overtook Llama as the most-deployed self-hosted model family. Teams are now routinely running Qwen2.5-7B alongside Qwen2.5-72B and Llama-3.3-70B simultaneously on the same infrastructure. That multi-model footprint makes routing economically rational in a way it wasn't two years ago.

The core math is simple. A typical production workload breaks down roughly like this:

  • ~60% simple queries: factual lookups, short summaries, classification, simple Q&A
  • ~25% medium queries: document analysis, multi-turn conversation, moderate reasoning
  • ~15% complex queries: code generation, multi-step reasoning, long-form synthesis

If you send all of that to one large model, you pay the large model price for every request. If you route by complexity, your average cost drops to something close to a weighted average:

StrategyGPUOn-Demand ($/hr)Handles
No routing (all queries)H100 SXM5$2.40100% of traffic
Tier 1 (routed)RTX 4090$0.51~60% of traffic
Tier 2 (routed)A100 80GB PCIe$1.43~25% of traffic
Tier 3 (routed)H100 SXM5$2.40~15% of traffic

The weighted average GPU cost with routing: (0.60 × $0.51) + (0.25 × $1.43) + (0.15 × $2.40) = $0.306 + $0.358 + $0.360 = $1.02/hr equivalent. That's a 57% cost reduction before you touch spot pricing.

Router Architecture Overview

The router sits between your clients and your model pool. Every request passes through it, gets classified, and gets forwarded to the appropriate model tier.

Client Request
     |
     v
[Router Proxy - NGINX]
     |
     v
[Classifier Microservice - FastAPI]
     |
     +---> Tier 1: [vLLM pool - RTX 4090]   (7B-13B models)
     |
     +---> Tier 2: [vLLM pool - A100 80GB]  (32B-70B models)
     |
     +---> Tier 3: [vLLM pool - H100 SXM5]  (70B-200B+ models)

Three components:

  1. Router proxy (NGINX or Envoy): receives all incoming requests, calls the classifier, proxies to the correct upstream. Handles connection pooling and retry logic.
  1. Classifier microservice (FastAPI): reads the request body, runs the complexity classification pipeline, returns a routing tier in a response header. Should complete in under 5ms.
  1. Model pool (vLLM per tier): each tier is one or more vLLM instances behind a load balancer. All expose OpenAI-compatible /v1/chat/completions endpoints so clients don't need to change anything.

For vLLM setup on each tier, see the vLLM production deployment guide. If you want to use SGLang for your Tier 3 (better throughput on long-context workloads), the SGLang production deployment guide covers that setup.

Model Tier Strategy: Small vs Medium vs Large

TierModelsParamsUse CaseGPUOn-DemandSpot
Tier 1Qwen2.5-7B, Llama-3.2-3B3B-13BSimple Q&A, classification, short summariesRTX 4090 PCIe$0.51/hrN/A
Tier 2Qwen2.5-32B, Llama-3.1-8B+8B-32BDocument analysis, moderate reasoning, multi-turnA100 80GB PCIe$1.43/hr$1.14/hr
Tier 3Llama-3.3-70B, Qwen2.5-72B, 200B+70B+Code, complex reasoning, long-form synthesisH100 SXM5$2.40/hr$0.80/hr

For detailed per-GPU memory requirements at each model size, see the GPU memory requirements guide and the best GPU for AI inference 2026 guide.

Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Choosing Tier 1 Models

Qwen2.5-7B-Instruct is a strong default: it scores higher than many 13B models from 2024 on standard benchmarks, fits on a single RTX 4090 in FP16 (model weights ~14GB, ~17GB+ total with KV cache), and costs $0.51/hr. Llama-3.2-3B is a reasonable fallback if you need even lower latency at the cost of some quality headroom.

For structured reasoning tasks specifically (constraint satisfaction, ARC-style problems, classification with iterative refinement), the Hierarchical Reasoning Model is a strong candidate for the small tier: a 27M-parameter model that handles structured reasoning on a single RTX 4090 before escalating to a larger backend, at a fraction of even 7B model costs.

Run your Tier 1 on spot instances where available. Tier 1 handles simple, stateless queries - an interruption just means retrying the request, which the router handles automatically in the fallback chain.

Choosing Tier 3 Models

For Tier 3, the choice between Llama-3.3-70B and Qwen2.5-72B depends on your workload. Llama-3.3-70B is stronger on English code generation. Qwen2.5-72B has better multilingual performance and math reasoning. Both fit on a single H100 80GB in FP8.

Deploying the Model Pool with vLLM

Provision one GPU node per tier on Spheron at app.spheron.ai. SSH into each node and run the appropriate vLLM server.

Tier 1 (RTX 4090, 7B model):

bash
pip install vllm
vllm serve Qwen/Qwen2.5-7B-Instruct \
  --dtype fp8 \
  --port 8001 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 256

Tier 2 (A100 80GB, 32B model):

bash
vllm serve Qwen/Qwen2.5-32B-Instruct \
  --dtype fp8 \
  --port 8002 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 128

Tier 3 (H100 SXM5, 70B model):

bash
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --port 8003 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 64 \
  --max-model-len 32768

For detailed provisioning steps and Docker-based deployment, see the Spheron documentation. For multi-GPU tensor parallelism on Tier 3 (2x H100 for larger models), add --tensor-parallel-size 2.

TierModelGPUVRAM UsedEst. Throughput (tokens/sec)
1Qwen2.5-7B FP8RTX 4090 24GB~7GB800-1200
2Qwen2.5-32B FP8A100 80GB~32GB400-600
3Llama-3.3-70B FP8H100 SXM5 80GB~70GB300-500

Use spot instances for Tier 1 and Tier 2 where available on Spheron. Spot pricing brings A100 SXM4 down to $0.45/hr and H100 SXM5 down to $0.80/hr, cutting tier costs significantly.

Query Classification: Three Approaches

1. Heuristic Rules (Fast, Free, Crude)

Heuristics are your baseline. They're deterministic, add zero latency, and handle obvious cases reliably. They fail on edge cases.

python
import re

def classify_heuristic(messages: list) -> int:
    """Returns tier 1, 2, or 3."""
    # Combine all message content
    full_text = " ".join(m.get("content", "") for m in messages if isinstance(m.get("content"), str))
    token_estimate = len(full_text.split())

    # Hard signals for Tier 3
    has_code = bool(re.search(r'```|def |class |import |#include', full_text))
    has_math = bool(re.search(r'\$\$|\\\[|\\begin\{|integral|derivative|proof', full_text))
    is_multistep = bool(re.search(
        r'step by step|analyze|compare and contrast|write a (detailed|comprehensive)|'
        r'explain (how|why|the difference)', full_text, re.IGNORECASE
    ))
    is_long = token_estimate > 500

    if has_code or has_math or (is_multistep and is_long):
        return 3

    # Hard signals for Tier 1
    is_short = token_estimate < 100
    # Apply the ^ anchor to the last user message only.
    # full_text joins all turns, so ^ would anchor to the start of older
    # messages in multi-turn conversations and almost never match.
    last_user_msg = next(
        (m.get("content", "") for m in reversed(messages)
         if m.get("role") == "user" and isinstance(m.get("content"), str)),
        ""
    )
    is_factual = bool(re.search(
        r'^(what is|who is|when did|where is|how many|what are)',
        last_user_msg.strip().lower()
    ))

    if is_short and is_factual:
        return 1

    # Default: Tier 2 for everything in between
    return 2

Latency: under 0.1ms. Accuracy on clear cases: ~70-80%. Weak on moderate-complexity queries that don't match any pattern.

2. Embedding-Based Classifier (Balanced)

Train a lightweight logistic regression on top of sentence embeddings. This gives you a classifier that generalizes to your actual query distribution, not just the patterns you anticipated. If your router relies on query embeddings for classification, see the self-hosted TEI embedding guide for the embedding server setup.

python
from sentence_transformers import SentenceTransformer
import numpy as np
import joblib

# Training (run once, offline)
def train_classifier(labeled_queries: list[dict]):
    """
    labeled_queries: [{"text": "...", "tier": 1|2|3}, ...]
    Needs ~500-1000 labeled examples for decent accuracy.
    """
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    texts = [q["text"] for q in labeled_queries]
    labels = [q["tier"] for q in labeled_queries]

    embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)

    from sklearn.linear_model import LogisticRegression
    clf = LogisticRegression(max_iter=1000, C=1.0)
    clf.fit(embeddings, labels)

    joblib.dump((model, clf), "router_classifier.pkl")
    return clf

# Inference (per request)
class EmbeddingClassifier:
    def __init__(self, model_path: str = "router_classifier.pkl"):
        self.embed_model, self.clf = joblib.load(model_path)

    def classify(self, text: str) -> tuple[int, float]:
        embedding = self.embed_model.encode([text])
        tier = self.clf.predict(embedding)[0]
        confidence = self.clf.predict_proba(embedding).max()
        return tier, confidence

Latency on CPU: 1-5ms per request. Sub-1ms if batched. The all-MiniLM-L6-v2 model (22.7M parameters) loads in ~80MB of RAM and runs on CPU without a GPU allocation. Accuracy on your own query distribution: typically 85-92% after labeling 500-1000 examples.

3. LLM-Judge Routing (Accurate, Adds Latency)

For borderline queries that the embedding classifier is uncertain about (confidence below 0.7), escalate to an LLM judge. Use the Tier 1 model itself to classify the query before serving it.

python
async def llm_judge_classify(prompt: str, tier1_client) -> int:
    """Ask the Tier 1 model to rate the complexity. Adds 50-200ms."""
    system = (
        "Rate this query's complexity: "
        "1 (simple factual, one-sentence answer), "
        "2 (moderate, requires explanation), "
        "3 (complex reasoning, code, or multi-step analysis). "
        "Reply with only the digit."
    )
    response = await tier1_client.chat.completions.create(
        model="Qwen/Qwen2.5-7B-Instruct",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt[:500]}  # truncate to limit judge latency
        ],
        max_tokens=1,
        temperature=0
    )
    try:
        tier = int(response.choices[0].message.content.strip())
        return tier if tier in (1, 2, 3) else 2
    except (ValueError, IndexError):
        return 2  # fallback to Tier 2 if parse fails

Run the judge on the same Tier 1 GPU. It adds zero extra GPU cost. The latency overhead (50-200ms) is only paid for borderline queries, which are typically 15-20% of requests.

Building the Routing Proxy with NGINX

The NGINX auth_request module can call the classifier before proxying, but it has a fundamental limitation: auth_request subrequests are always internal GET requests and never include the original request body, regardless of proxy_pass_request_body on. That directive only applies to regular proxy_pass locations. The result is that your FastAPI /classify endpoint always receives an empty body, json.loads(body) raises an exception, and the except block silently falls back to tier 2 for every request. The router never routes to Tier 1 or Tier 3.

The recommended approach is the Python httpx proxy shown after this section, which reads the body normally. The NGINX config below is included for reference if you need NGINX for TLS termination or connection pooling, but the classification logic should live in the Python layer, not as an auth_request subrequest.

Note: auth_request requires ngx_http_auth_request_module, which is compiled into most standard NGINX builds (check with nginx -V 2>&1 | grep auth_request).

nginx
upstream tier1 {
    server 10.0.0.10:8001;
    keepalive 32;
}

upstream tier2 {
    server 10.0.0.11:8002;
    keepalive 32;
}

upstream tier3 {
    server 10.0.0.12:8003;
    keepalive 32;
}

upstream classifier {
    server 127.0.0.1:9000;
    keepalive 16;
}

server {
    listen 8080;

    # Map classifier header to upstream backend
    # (map block belongs at http level in a real config; shown here for clarity)
    # map $route_tier $upstream_backend { "1" http://tier1; "2" http://tier2; default http://tier3; }

    location /v1/chat/completions {
        auth_request /classify;
        auth_request_set $route_tier $upstream_http_x_route_tier;

        # Route based on classifier header.
        # proxy_set_header and proxy_read_timeout are repeated in every if block because
        # NGINX if blocks create implicit inner contexts that do not inherit array-type
        # directives from the surrounding location block.
        if ($route_tier = "1") {
            proxy_pass http://tier1;
            proxy_set_header Host $host;
            proxy_set_header X-Request-ID $request_id;
            proxy_read_timeout 120s;
        }
        if ($route_tier = "2") {
            proxy_pass http://tier2;
            proxy_set_header Host $host;
            proxy_set_header X-Request-ID $request_id;
            proxy_read_timeout 120s;
        }
        # Default to tier3
        proxy_pass http://tier3;
        proxy_set_header Host $host;
        proxy_set_header X-Request-ID $request_id;
        proxy_read_timeout 120s;
    }

    location /classify {
        internal;
        proxy_pass http://classifier/classify;
        # NOTE: auth_request subrequests are always GET with no body.
        # proxy_pass_request_body has no effect here — the classifier
        # will always receive an empty body via this path.
        # Use the Python httpx proxy below for body-aware classification.
        proxy_set_header Content-Type application/json;
    }
}

Classifier microservice (FastAPI):

python
from fastapi import FastAPI, Request
from fastapi.responses import Response
import json

app = FastAPI()
embedding_clf = EmbeddingClassifier()

@app.post("/classify")
async def classify(request: Request):
    body = await request.body()
    try:
        data = json.loads(body)
        messages = data.get("messages", [])
        user_text = " ".join(
            m["content"] for m in messages
            if m.get("role") == "user" and isinstance(m.get("content"), str)
        )

        # Stage 1: heuristic
        h_tier = classify_heuristic(messages)

        # Stage 2: embedding classifier for uncertain cases
        if h_tier == 2:
            e_tier, confidence = embedding_clf.classify(user_text)
            tier = e_tier if confidence > 0.7 else 2
        else:
            tier = h_tier

    except Exception:
        tier = 2  # safe default

    response = Response(content="", status_code=200)
    response.headers["X-Route-Tier"] = str(tier)
    return response

For production deployments, use a Python proxy with httpx instead. It reads the request body directly and avoids the auth_request body-stripping issue described above:

python
import httpx
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, StreamingResponse

BACKENDS = {1: "http://10.0.0.10:8001", 2: "http://10.0.0.11:8002", 3: "http://10.0.0.12:8003"}

@app.post("/v1/chat/completions")
async def route(request: Request):
    body = await request.json()
    tier = classify_request(body)  # your classifier function
    backend = BACKENDS[tier]

    is_streaming = body.get("stream", False)

    if is_streaming:
        # Stream SSE chunks directly back to the client
        async def stream_response():
            async with httpx.AsyncClient(timeout=120.0) as client:
                async with client.stream(
                    "POST", f"{backend}/v1/chat/completions", json=body
                ) as response:
                    async for chunk in response.aiter_bytes():
                        yield chunk

        return StreamingResponse(stream_response(), media_type="text/event-stream")

    # Non-streaming: propagate the backend status code to the caller
    async with httpx.AsyncClient(timeout=120.0) as client:
        response = await client.post(f"{backend}/v1/chat/completions", json=body)
    return JSONResponse(content=response.json(), status_code=response.status_code)

Latency-Aware Routing: SLA Tiers and Fallback Chains

Define SLA contracts per tier upfront. These become the circuit-breaker thresholds that trigger automatic tier escalation.

TierTarget TTFTMax TTFT (p95)Trigger Escalation At
Tier 1<200ms500msKV cache >90% OR TTFT p95 >500ms
Tier 2<500ms1000msKV cache >90% OR TTFT p95 >1000ms
Tier 3<2000ms4000msScale out (add GPU)

Poll vLLM's /metrics endpoint every 10 seconds to get KV cache utilization:

python
import httpx
import asyncio

TIER_ENDPOINTS = {
    1: "http://10.0.0.10:8001/metrics",
    2: "http://10.0.0.11:8002/metrics",
    3: "http://10.0.0.12:8003/metrics",
}

async def get_kv_usage(tier: int) -> float:
    """Returns KV cache utilization as a float 0.0-1.0. Returns 0.0 on any network error."""
    try:
        async with httpx.AsyncClient() as client:
            resp = await client.get(TIER_ENDPOINTS[tier], timeout=2.0)
        for line in resp.text.splitlines():
            if line.startswith("vllm:kv_cache_usage_perc"):
                return float(line.split()[-1])
    except (httpx.RequestError, httpx.TimeoutException, ValueError):
        return 0.0
    return 0.0

# In your classifier: check saturation before routing
async def route_with_saturation_check(base_tier: int) -> int:
    try:
        usage = await get_kv_usage(base_tier)
    except Exception:
        usage = 0.0
    if usage > 0.90 and base_tier < 3:
        return base_tier + 1  # escalate to next tier
    return base_tier

For a deep dive on KV cache monitoring and optimization, see the KV cache optimization guide.

GPU Allocation Strategy: Matching GPU Tier to Model Tier

Model TierGPUVRAMModels That FitOn-DemandSpotBest For
Tier 1RTX 4090 PCIe24GB7B FP16, 13B INT4$0.51/hrN/ASimple queries, stateless
Tier 2A100 80GB PCIe80GB32B FP16, 70B FP8$1.43/hr$1.14/hrMedium reasoning, conversation
Tier 2A100 80GB SXM480GB32B FP16, 70B FP8$1.05/hr$0.45/hrSame but lower spot price
Tier 3H100 SXM580GB70B FP8, 100B+ (2x)$2.40/hr$0.80/hrComplex reasoning, code
Tier 3 (XL)H200 SXM5141GB72B FP16, 200B+$4.50/hr$1.19/hrVery long context, 200B+ models

Spheron bills per second. That means you can scale Tier 1 GPUs up during peak hours and terminate them during off-peak without paying for idle capacity. A workload that runs for 4 hours at peak instead of 24 hours idle costs 83% less.

For MIG (Multi-Instance GPU) as an alternative to separate GPU instances for Tier 1 - partitioning a single A100 to run multiple smaller models - see the MIG and GPU time-slicing guide. A practical example of this pattern is pairing Ministral 3 3B at the edge with Ministral 3 14B Reasoning on Spheron cloud for complex query escalation.

Cost Savings Analysis: Router vs Single Large Model

Scenario: 100,000 requests/day, average 500 input + 200 output tokens per request.

Baseline: all requests to H100 SXM5 at $2.40/hr

At 100,000 requests/day with ~700 tokens average, and assuming ~300 tokens/sec throughput on H100 SXM5, you need roughly:

  • Tokens per day: 100,000 × 700 = 70M tokens
  • GPU-seconds needed: 70M / 300 = ~233,000 seconds = ~65 GPU-hours
  • Daily GPU cost: 65 × $2.40 = $156/day

Routed: 60% Tier 1, 25% Tier 2, 15% Tier 3

Tier 1 (60,000 requests on RTX 4090, ~1,000 tokens/sec):

  • Tokens: 60,000 × 700 = 42M tokens
  • GPU-seconds: 42M / 1,000 = 42,000 seconds = ~12 GPU-hours
  • Cost: 12 × $0.51 = $6.12/day

Tier 2 (25,000 requests on A100 80GB, ~500 tokens/sec):

  • Tokens: 25,000 × 700 = 17.5M tokens
  • GPU-seconds: 17.5M / 500 = 35,000 seconds = ~10 GPU-hours
  • Cost: 10 × $1.43 = $14.30/day

Tier 3 (15,000 requests on H100 SXM5, ~400 tokens/sec):

  • Tokens: 15,000 × 700 = 10.5M tokens
  • GPU-seconds: 10.5M / 400 = 26,250 seconds = ~7 GPU-hours
  • Cost: 7 × $2.40 = $16.80/day

Total routed cost: $6.12 + $14.30 + $16.80 = $37.22/day

Compared to $156/day baseline, that's a 76% cost reduction.

Using spot pricing where available (A100 SXM4 spot at $0.45/hr for Tier 2, H100 SXM5 spot at $0.80/hr for Tier 3) brings the total down further to roughly $16-18/day, an 88-90% reduction from baseline.

For more on reducing per-token reasoning costs, see the reasoning model inference cost guide.

If you need higher output quality rather than lower cost per query, a Mixture of Agents architecture - running multiple proposer models in parallel and aggregating their outputs - is a complementary approach to routing.

Production Monitoring: Tracking Route Decisions

Emit these Prometheus labels on every request:

python
from prometheus_client import Counter, Histogram

routing_requests = Counter(
    "llm_router_requests_total",
    "Total routed requests",
    ["routing_tier", "escalated", "original_tier", "model_id"]
)

routing_latency = Histogram(
    "llm_router_latency_ms",
    "End-to-end latency per routed request",
    ["routing_tier"],
    buckets=[50, 100, 200, 500, 1000, 2000, 5000]
)

Three dashboards to track:

  1. Tier distribution over time: if Tier 1 drops below 50% of traffic, your classifier is over-routing. Check if query distribution has shifted or if the embedding classifier is degrading.
  1. Escalation rate per tier: rising escalation rates (>5% from baseline) indicate quality issues with the source tier. Either the model degraded, the tier thresholds are wrong, or your query distribution shifted.
  1. p50/p95/p99 latency per tier: a latency spike on Tier 1 usually means GPU memory pressure. Check vllm:kv_cache_usage_perc. A spike on Tier 3 means you need to scale out.

Quality drift detection: run a secondary LLM judge on a random 1% sample of Tier 1 and Tier 2 outputs. Score each on a 1-5 scale. If average quality drops below your threshold (typically 3.5/5), tighten the routing thresholds - send fewer queries to the lower tier until quality recovers.

For MoE model deployment and cost optimization, see the MoE inference optimization guide.

Case Study: 61% Cost Reduction with Smart Routing

A team running a customer support LLM was processing 150,000 queries/day, routing everything through a Llama-3.3-70B on H100 SXM5. The GPU was running at ~60% average utilization - not terrible, but still paying H100 prices for every "What are your business hours?" query.

After classifying 1,000 historical queries from their logs, the breakdown was:

  • 65%: simple factual queries (product specs, hours, FAQ, return policies)
  • 25%: moderate queries (account questions, order status, policy lookups)
  • 10%: complex queries (billing disputes, refund edge cases, escalations)

They deployed a 3-tier router:

  • Qwen2.5-7B on RTX 4090 for Tier 1 ($0.51/hr)
  • Llama-3.1-8B on A100 80GB for Tier 2 ($1.43/hr)
  • Llama-3.3-70B on H100 SXM5 for Tier 3 ($2.40/hr)

Before: one H100 SXM5 at $2.40/hr handling all 150k queries/day. At ~6,250 queries/hr and ~400 tokens/sec throughput, they needed roughly 65-70 GPU-hours/day.

Daily cost before: ~68 GPU-hours × $2.40 = $163/day

After routing (65% Tier 1, 25% Tier 2, 10% Tier 3):

  • Tier 1: 41 GPU-hours × $0.51 = $20.91/day
  • Tier 2: 12 GPU-hours × $1.43 = $17.16/day
  • Tier 3: 11 GPU-hours × $2.40 = $26.40/day
  • Total: $64.47/day

Result: 61% cost reduction with escalation rate on Tier 1 staying below 3% (measured via the confidence fallback chain). Customer satisfaction scores were unchanged across a 2-week A/B test.

The team then added A100 SXM4 spot instances for Tier 2, bringing that cost down to $0.45/hr and pushing the total daily cost to ~$53, a 68% reduction from the original baseline.


Spheron's GPU marketplace lets you rent RTX 4090, A100, and H100 GPUs from the same platform, billed per second. That per-second billing is what makes dynamic routing practical: scale Tier 1 up during peak hours and back down at night without paying for idle capacity.

Rent RTX 4090 → | Rent A100 → | Rent H100 → | View all pricing →

Start building on Spheron →

STEPS / 06

Quick Setup Guide

  1. Define your model tier strategy

    Classify queries into 3 tiers: Tier 1 (simple, <500 tokens) routed to a 7B model on RTX 4090 ($0.51/hr); Tier 2 (medium complexity, 500-2000 tokens or domain-specific) routed to a 32B-70B model on A100 80GB ($1.43/hr); Tier 3 (complex reasoning, code, multi-step) routed to a 70B-200B+ model on H100 SXM5 ($2.40/hr on-demand, $0.80/hr spot). Measure your query mix first - run 1,000 sample queries through a complexity classifier before deploying the router in production.

  2. Deploy model pool with vLLM

    Provision one GPU node per tier on Spheron. For each node, run vLLM with its model: 'vllm serve Qwen/Qwen2.5-7B-Instruct --dtype fp8 --port 8001' (Tier 1), 'vllm serve Qwen/Qwen2.5-32B-Instruct --dtype fp8 --port 8002' (Tier 2), 'vllm serve meta-llama/Llama-3.3-70B-Instruct --dtype fp8 --port 8003' (Tier 3). All expose OpenAI-compatible /v1/chat/completions endpoints. Document each node's internal IP for the router config.

  3. Build the complexity classifier

    Start with a heuristic classifier as a baseline: route queries with <200 tokens and no code/math markers to Tier 1; queries with code blocks, math notation, or multi-step instructions to Tier 3; everything else to Tier 2. Then layer an embedding classifier using a 22.7M-parameter model (e.g., sentence-transformers/all-MiniLM-L6-v2) trained on a labeled sample of your own queries. The embedding classifier adds <2ms latency and is far more accurate than pure heuristics for edge cases.

  4. Configure the routing proxy

    Use a Python httpx proxy as your routing layer: read the request body, call classify_request(), look up the backend in BACKENDS = {1: ..., 2: ..., 3: ...}, and forward the request. This approach handles body-based classification correctly. If you need NGINX for TLS termination, put it in front but keep classification in the Python layer — NGINX auth_request subrequests strip the request body by design, so the classifier will never receive the query via that path. Set upstream keepalive connections to avoid TCP overhead per request. Add request_id logging so every route decision is auditable.

  5. Implement the fallback chain

    In your classifier microservice, implement a fallback chain: if the Tier 1 model's response includes low_confidence=true (detected by a secondary scoring pass or a short confidence prompt), re-route the request to Tier 2. If Tier 2 escalates similarly, route to Tier 3. Log all escalations with the original tier, final tier, and latency overhead. A healthy router should escalate <5% of initially classified Tier 1 requests.

  6. Monitor routing decisions and quality drift

    Emit a routing_tier label on every request to your metrics system (Prometheus or Datadog). Track: (1) tier distribution over time - if Tier 1 drops below 50% of traffic, your classifier may be over-routing; (2) escalation rate per tier - rising escalation rates signal quality issues with lower tiers; (3) p50/p95 latency per tier - a latency spike on Tier 1 may indicate GPU memory pressure and a need to scale that tier. Set alerts on escalation_rate > 10% above baseline and on p95 latency > SLA threshold.

FAQ / 05

Frequently Asked Questions

An LLM inference router is a proxy layer that classifies incoming queries by complexity and forwards them to the most cost-appropriate model. Simple queries go to a cheap 7B model; complex ones go to a large 70B or 200B+ model. The goal is to maintain output quality while reducing average cost per token.

In practice, 60-70% of queries to a large model are simple enough to be handled by a 7B model with no noticeable quality drop. If your large model costs $2.40/hr and your small model costs $0.51/hr, routing 65% of traffic to the small model reduces your average per-query cost by over 50%. The exact savings depend on your query mix and the quality threshold you set.

Three approaches work in practice: heuristic rules (token count, keyword patterns) are fast and free but crude; embedding-based classifiers (a small BERT-class model) are more accurate with sub-millisecond latency; LLM-judge routing (asking a small model to classify the query) is most accurate but adds 50-200ms overhead. Most production routers combine heuristics for obvious cases with an embedding classifier for borderline queries.

For small models (7B-13B), RTX 4090 at $0.51/hr is the most cost-efficient option on Spheron. For medium models (32B-70B), A100 80GB at $1.43/hr handles FP16 serving and FP8 allows larger models. For large models (70B-200B+), H100 SXM5 at $0.80/hr spot or $2.40/hr on-demand provides the throughput needed for complex queries. Use spot instances for your small model tier since they handle non-critical traffic.

Implement a fallback chain: if the small model returns a low-confidence output (detected via a secondary confidence scorer or by monitoring user feedback), escalate to the next tier. Track routing decisions with request IDs in your logs so you can audit whether escalations are happening at the expected rate. Set a quality drift alert that fires if escalation rate exceeds your baseline by more than 10 percentage points.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.