Engineering

LLM Inference Router on GPU Cloud: Smart Model Routing for Cost and Latency (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 5, 2026
LLM InferenceModel RoutingvLLMSGLangGPU CloudCost OptimizationH100A100Multi-Model InferenceInference Gateway
LLM Inference Router on GPU Cloud: Smart Model Routing for Cost and Latency (2026)

Most teams running LLMs in production are paying for a 70B model to answer "What is the capital of France?" Running every query through your largest model is the easiest architecture and the most expensive one. A 70B model on H100 costs $2.40/hr on-demand - sending every request there when 60% of them could be answered by a 7B model at $0.51/hr is burning money. The fix is a routing layer that classifies each query and sends it to the cheapest model that can handle it. This post covers how to build that router: classifier design, NGINX proxy setup, GPU tier selection, and the monitoring you need to catch quality drift before users notice. For background on why inference costs compound quickly at scale, see the reasoning model inference cost guide and the GPU cost optimization playbook.

Why You Need an LLM Inference Router in 2026

The RunPod March 2026 State of AI report showed that Qwen overtook Llama as the most-deployed self-hosted model family. Teams are now routinely running Qwen2.5-7B alongside Qwen2.5-72B and Llama-3.3-70B simultaneously on the same infrastructure. That multi-model footprint makes routing economically rational in a way it wasn't two years ago.

The core math is simple. A typical production workload breaks down roughly like this:

  • ~60% simple queries: factual lookups, short summaries, classification, simple Q&A
  • ~25% medium queries: document analysis, multi-turn conversation, moderate reasoning
  • ~15% complex queries: code generation, multi-step reasoning, long-form synthesis

If you send all of that to one large model, you pay the large model price for every request. If you route by complexity, your average cost drops to something close to a weighted average:

StrategyGPUOn-Demand ($/hr)Handles
No routing (all queries)H100 SXM5$2.40100% of traffic
Tier 1 (routed)RTX 4090$0.51~60% of traffic
Tier 2 (routed)A100 80GB PCIe$1.43~25% of traffic
Tier 3 (routed)H100 SXM5$2.40~15% of traffic

The weighted average GPU cost with routing: (0.60 × $0.51) + (0.25 × $1.43) + (0.15 × $2.40) = $0.306 + $0.358 + $0.360 = $1.02/hr equivalent. That's a 57% cost reduction before you touch spot pricing.

Router Architecture Overview

The router sits between your clients and your model pool. Every request passes through it, gets classified, and gets forwarded to the appropriate model tier.

Client Request
     |
     v
[Router Proxy - NGINX]
     |
     v
[Classifier Microservice - FastAPI]
     |
     +---> Tier 1: [vLLM pool - RTX 4090]   (7B-13B models)
     |
     +---> Tier 2: [vLLM pool - A100 80GB]  (32B-70B models)
     |
     +---> Tier 3: [vLLM pool - H100 SXM5]  (70B-200B+ models)

Three components:

  1. Router proxy (NGINX or Envoy): receives all incoming requests, calls the classifier, proxies to the correct upstream. Handles connection pooling and retry logic.
  1. Classifier microservice (FastAPI): reads the request body, runs the complexity classification pipeline, returns a routing tier in a response header. Should complete in under 5ms.
  1. Model pool (vLLM per tier): each tier is one or more vLLM instances behind a load balancer. All expose OpenAI-compatible /v1/chat/completions endpoints so clients don't need to change anything.

For vLLM setup on each tier, see the vLLM production deployment guide. If you want to use SGLang for your Tier 3 (better throughput on long-context workloads), the SGLang production deployment guide covers that setup.

Model Tier Strategy: Small vs Medium vs Large

TierModelsParamsUse CaseGPUOn-DemandSpot
Tier 1Qwen2.5-7B, Llama-3.2-3B3B-13BSimple Q&A, classification, short summariesRTX 4090 PCIe$0.51/hrN/A
Tier 2Qwen2.5-32B, Llama-3.1-8B+8B-32BDocument analysis, moderate reasoning, multi-turnA100 80GB PCIe$1.43/hr$1.14/hr
Tier 3Llama-3.3-70B, Qwen2.5-72B, 200B+70B+Code, complex reasoning, long-form synthesisH100 SXM5$2.40/hr$0.80/hr

For detailed per-GPU memory requirements at each model size, see the GPU memory requirements guide and the best GPU for AI inference 2026 guide.

Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Choosing Tier 1 Models

Qwen2.5-7B-Instruct is a strong default: it scores higher than many 13B models from 2024 on standard benchmarks, fits on a single RTX 4090 in FP16 (model weights ~14GB, ~17GB+ total with KV cache), and costs $0.51/hr. Llama-3.2-3B is a reasonable fallback if you need even lower latency at the cost of some quality headroom.

Run your Tier 1 on spot instances where available. Tier 1 handles simple, stateless queries - an interruption just means retrying the request, which the router handles automatically in the fallback chain.

Choosing Tier 3 Models

For Tier 3, the choice between Llama-3.3-70B and Qwen2.5-72B depends on your workload. Llama-3.3-70B is stronger on English code generation. Qwen2.5-72B has better multilingual performance and math reasoning. Both fit on a single H100 80GB in FP8.

Deploying the Model Pool with vLLM

Provision one GPU node per tier on Spheron at app.spheron.ai. SSH into each node and run the appropriate vLLM server.

Tier 1 (RTX 4090, 7B model):

bash
pip install vllm
vllm serve Qwen/Qwen2.5-7B-Instruct \
  --dtype fp8 \
  --port 8001 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 256

Tier 2 (A100 80GB, 32B model):

bash
vllm serve Qwen/Qwen2.5-32B-Instruct \
  --dtype fp8 \
  --port 8002 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 128

Tier 3 (H100 SXM5, 70B model):

bash
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --port 8003 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 64 \
  --max-model-len 32768

For detailed provisioning steps and Docker-based deployment, see the Spheron documentation. For multi-GPU tensor parallelism on Tier 3 (2x H100 for larger models), add --tensor-parallel-size 2.

TierModelGPUVRAM UsedEst. Throughput (tokens/sec)
1Qwen2.5-7B FP8RTX 4090 24GB~7GB800-1200
2Qwen2.5-32B FP8A100 80GB~32GB400-600
3Llama-3.3-70B FP8H100 SXM5 80GB~70GB300-500

Use spot instances for Tier 1 and Tier 2 where available on Spheron. Spot pricing brings A100 SXM4 down to $0.45/hr and H100 SXM5 down to $0.80/hr, cutting tier costs significantly.

Query Classification: Three Approaches

1. Heuristic Rules (Fast, Free, Crude)

Heuristics are your baseline. They're deterministic, add zero latency, and handle obvious cases reliably. They fail on edge cases.

python
import re

def classify_heuristic(messages: list) -> int:
    """Returns tier 1, 2, or 3."""
    # Combine all message content
    full_text = " ".join(m.get("content", "") for m in messages if isinstance(m.get("content"), str))
    token_estimate = len(full_text.split())

    # Hard signals for Tier 3
    has_code = bool(re.search(r'```|def |class |import |#include', full_text))
    has_math = bool(re.search(r'\$\$|\\\[|\\begin\{|integral|derivative|proof', full_text))
    is_multistep = bool(re.search(
        r'step by step|analyze|compare and contrast|write a (detailed|comprehensive)|'
        r'explain (how|why|the difference)', full_text, re.IGNORECASE
    ))
    is_long = token_estimate > 500

    if has_code or has_math or (is_multistep and is_long):
        return 3

    # Hard signals for Tier 1
    is_short = token_estimate < 100
    # Apply the ^ anchor to the last user message only.
    # full_text joins all turns, so ^ would anchor to the start of older
    # messages in multi-turn conversations and almost never match.
    last_user_msg = next(
        (m.get("content", "") for m in reversed(messages)
         if m.get("role") == "user" and isinstance(m.get("content"), str)),
        ""
    )
    is_factual = bool(re.search(
        r'^(what is|who is|when did|where is|how many|what are)',
        last_user_msg.strip().lower()
    ))

    if is_short and is_factual:
        return 1

    # Default: Tier 2 for everything in between
    return 2

Latency: under 0.1ms. Accuracy on clear cases: ~70-80%. Weak on moderate-complexity queries that don't match any pattern.

2. Embedding-Based Classifier (Balanced)

Train a lightweight logistic regression on top of sentence embeddings. This gives you a classifier that generalizes to your actual query distribution, not just the patterns you anticipated.

python
from sentence_transformers import SentenceTransformer
import numpy as np
import joblib

# Training (run once, offline)
def train_classifier(labeled_queries: list[dict]):
    """
    labeled_queries: [{"text": "...", "tier": 1|2|3}, ...]
    Needs ~500-1000 labeled examples for decent accuracy.
    """
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    texts = [q["text"] for q in labeled_queries]
    labels = [q["tier"] for q in labeled_queries]

    embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)

    from sklearn.linear_model import LogisticRegression
    clf = LogisticRegression(max_iter=1000, C=1.0)
    clf.fit(embeddings, labels)

    joblib.dump((model, clf), "router_classifier.pkl")
    return clf

# Inference (per request)
class EmbeddingClassifier:
    def __init__(self, model_path: str = "router_classifier.pkl"):
        self.embed_model, self.clf = joblib.load(model_path)

    def classify(self, text: str) -> tuple[int, float]:
        embedding = self.embed_model.encode([text])
        tier = self.clf.predict(embedding)[0]
        confidence = self.clf.predict_proba(embedding).max()
        return tier, confidence

Latency on CPU: 1-5ms per request. Sub-1ms if batched. The all-MiniLM-L6-v2 model (22.7M parameters) loads in ~80MB of RAM and runs on CPU without a GPU allocation. Accuracy on your own query distribution: typically 85-92% after labeling 500-1000 examples.

3. LLM-Judge Routing (Accurate, Adds Latency)

For borderline queries that the embedding classifier is uncertain about (confidence below 0.7), escalate to an LLM judge. Use the Tier 1 model itself to classify the query before serving it.

python
async def llm_judge_classify(prompt: str, tier1_client) -> int:
    """Ask the Tier 1 model to rate the complexity. Adds 50-200ms."""
    system = (
        "Rate this query's complexity: "
        "1 (simple factual, one-sentence answer), "
        "2 (moderate, requires explanation), "
        "3 (complex reasoning, code, or multi-step analysis). "
        "Reply with only the digit."
    )
    response = await tier1_client.chat.completions.create(
        model="Qwen/Qwen2.5-7B-Instruct",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt[:500]}  # truncate to limit judge latency
        ],
        max_tokens=1,
        temperature=0
    )
    try:
        tier = int(response.choices[0].message.content.strip())
        return tier if tier in (1, 2, 3) else 2
    except (ValueError, IndexError):
        return 2  # fallback to Tier 2 if parse fails

Run the judge on the same Tier 1 GPU. It adds zero extra GPU cost. The latency overhead (50-200ms) is only paid for borderline queries, which are typically 15-20% of requests.

Building the Routing Proxy with NGINX

The NGINX auth_request module can call the classifier before proxying, but it has a fundamental limitation: auth_request subrequests are always internal GET requests and never include the original request body, regardless of proxy_pass_request_body on. That directive only applies to regular proxy_pass locations. The result is that your FastAPI /classify endpoint always receives an empty body, json.loads(body) raises an exception, and the except block silently falls back to tier 2 for every request. The router never routes to Tier 1 or Tier 3.

The recommended approach is the Python httpx proxy shown after this section, which reads the body normally. The NGINX config below is included for reference if you need NGINX for TLS termination or connection pooling, but the classification logic should live in the Python layer, not as an auth_request subrequest.

Note: auth_request requires ngx_http_auth_request_module, which is compiled into most standard NGINX builds (check with nginx -V 2>&1 | grep auth_request).

nginx
upstream tier1 {
    server 10.0.0.10:8001;
    keepalive 32;
}

upstream tier2 {
    server 10.0.0.11:8002;
    keepalive 32;
}

upstream tier3 {
    server 10.0.0.12:8003;
    keepalive 32;
}

upstream classifier {
    server 127.0.0.1:9000;
    keepalive 16;
}

server {
    listen 8080;

    # Map classifier header to upstream backend
    # (map block belongs at http level in a real config; shown here for clarity)
    # map $route_tier $upstream_backend { "1" http://tier1; "2" http://tier2; default http://tier3; }

    location /v1/chat/completions {
        auth_request /classify;
        auth_request_set $route_tier $upstream_http_x_route_tier;

        # Route based on classifier header.
        # proxy_set_header and proxy_read_timeout are repeated in every if block because
        # NGINX if blocks create implicit inner contexts that do not inherit array-type
        # directives from the surrounding location block.
        if ($route_tier = "1") {
            proxy_pass http://tier1;
            proxy_set_header Host $host;
            proxy_set_header X-Request-ID $request_id;
            proxy_read_timeout 120s;
        }
        if ($route_tier = "2") {
            proxy_pass http://tier2;
            proxy_set_header Host $host;
            proxy_set_header X-Request-ID $request_id;
            proxy_read_timeout 120s;
        }
        # Default to tier3
        proxy_pass http://tier3;
        proxy_set_header Host $host;
        proxy_set_header X-Request-ID $request_id;
        proxy_read_timeout 120s;
    }

    location /classify {
        internal;
        proxy_pass http://classifier/classify;
        # NOTE: auth_request subrequests are always GET with no body.
        # proxy_pass_request_body has no effect here — the classifier
        # will always receive an empty body via this path.
        # Use the Python httpx proxy below for body-aware classification.
        proxy_set_header Content-Type application/json;
    }
}

Classifier microservice (FastAPI):

python
from fastapi import FastAPI, Request
from fastapi.responses import Response
import json

app = FastAPI()
embedding_clf = EmbeddingClassifier()

@app.post("/classify")
async def classify(request: Request):
    body = await request.body()
    try:
        data = json.loads(body)
        messages = data.get("messages", [])
        user_text = " ".join(
            m["content"] for m in messages
            if m.get("role") == "user" and isinstance(m.get("content"), str)
        )

        # Stage 1: heuristic
        h_tier = classify_heuristic(messages)

        # Stage 2: embedding classifier for uncertain cases
        if h_tier == 2:
            e_tier, confidence = embedding_clf.classify(user_text)
            tier = e_tier if confidence > 0.7 else 2
        else:
            tier = h_tier

    except Exception:
        tier = 2  # safe default

    response = Response(content="", status_code=200)
    response.headers["X-Route-Tier"] = str(tier)
    return response

For production deployments, use a Python proxy with httpx instead. It reads the request body directly and avoids the auth_request body-stripping issue described above:

python
import httpx
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, StreamingResponse

BACKENDS = {1: "http://10.0.0.10:8001", 2: "http://10.0.0.11:8002", 3: "http://10.0.0.12:8003"}

@app.post("/v1/chat/completions")
async def route(request: Request):
    body = await request.json()
    tier = classify_request(body)  # your classifier function
    backend = BACKENDS[tier]

    is_streaming = body.get("stream", False)

    if is_streaming:
        # Stream SSE chunks directly back to the client
        async def stream_response():
            async with httpx.AsyncClient(timeout=120.0) as client:
                async with client.stream(
                    "POST", f"{backend}/v1/chat/completions", json=body
                ) as response:
                    async for chunk in response.aiter_bytes():
                        yield chunk

        return StreamingResponse(stream_response(), media_type="text/event-stream")

    # Non-streaming: propagate the backend status code to the caller
    async with httpx.AsyncClient(timeout=120.0) as client:
        response = await client.post(f"{backend}/v1/chat/completions", json=body)
    return JSONResponse(content=response.json(), status_code=response.status_code)

Latency-Aware Routing: SLA Tiers and Fallback Chains

Define SLA contracts per tier upfront. These become the circuit-breaker thresholds that trigger automatic tier escalation.

TierTarget TTFTMax TTFT (p95)Trigger Escalation At
Tier 1<200ms500msKV cache >90% OR TTFT p95 >500ms
Tier 2<500ms1000msKV cache >90% OR TTFT p95 >1000ms
Tier 3<2000ms4000msScale out (add GPU)

Poll vLLM's /metrics endpoint every 10 seconds to get KV cache utilization:

python
import httpx
import asyncio

TIER_ENDPOINTS = {
    1: "http://10.0.0.10:8001/metrics",
    2: "http://10.0.0.11:8002/metrics",
    3: "http://10.0.0.12:8003/metrics",
}

async def get_kv_usage(tier: int) -> float:
    """Returns KV cache utilization as a float 0.0-1.0. Returns 0.0 on any network error."""
    try:
        async with httpx.AsyncClient() as client:
            resp = await client.get(TIER_ENDPOINTS[tier], timeout=2.0)
        for line in resp.text.splitlines():
            if line.startswith("vllm:kv_cache_usage_perc"):
                return float(line.split()[-1])
    except (httpx.RequestError, httpx.TimeoutException, ValueError):
        return 0.0
    return 0.0

# In your classifier: check saturation before routing
async def route_with_saturation_check(base_tier: int) -> int:
    try:
        usage = await get_kv_usage(base_tier)
    except Exception:
        usage = 0.0
    if usage > 0.90 and base_tier < 3:
        return base_tier + 1  # escalate to next tier
    return base_tier

For a deep dive on KV cache monitoring and optimization, see the KV cache optimization guide.

GPU Allocation Strategy: Matching GPU Tier to Model Tier

Model TierGPUVRAMModels That FitOn-DemandSpotBest For
Tier 1RTX 4090 PCIe24GB7B FP16, 13B INT4$0.51/hrN/ASimple queries, stateless
Tier 2A100 80GB PCIe80GB32B FP16, 70B FP8$1.43/hr$1.14/hrMedium reasoning, conversation
Tier 2A100 80GB SXM480GB32B FP16, 70B FP8$1.05/hr$0.45/hrSame but lower spot price
Tier 3H100 SXM580GB70B FP8, 100B+ (2x)$2.40/hr$0.80/hrComplex reasoning, code
Tier 3 (XL)H200 SXM5141GB72B FP16, 200B+$4.50/hr$1.19/hrVery long context, 200B+ models

Spheron bills per second. That means you can scale Tier 1 GPUs up during peak hours and terminate them during off-peak without paying for idle capacity. A workload that runs for 4 hours at peak instead of 24 hours idle costs 83% less.

For MIG (Multi-Instance GPU) as an alternative to separate GPU instances for Tier 1 - partitioning a single A100 to run multiple smaller models - see the MIG and GPU time-slicing guide.

Cost Savings Analysis: Router vs Single Large Model

Scenario: 100,000 requests/day, average 500 input + 200 output tokens per request.

Baseline: all requests to H100 SXM5 at $2.40/hr

At 100,000 requests/day with ~700 tokens average, and assuming ~300 tokens/sec throughput on H100 SXM5, you need roughly:

  • Tokens per day: 100,000 × 700 = 70M tokens
  • GPU-seconds needed: 70M / 300 = ~233,000 seconds = ~65 GPU-hours
  • Daily GPU cost: 65 × $2.40 = $156/day

Routed: 60% Tier 1, 25% Tier 2, 15% Tier 3

Tier 1 (60,000 requests on RTX 4090, ~1,000 tokens/sec):

  • Tokens: 60,000 × 700 = 42M tokens
  • GPU-seconds: 42M / 1,000 = 42,000 seconds = ~12 GPU-hours
  • Cost: 12 × $0.51 = $6.12/day

Tier 2 (25,000 requests on A100 80GB, ~500 tokens/sec):

  • Tokens: 25,000 × 700 = 17.5M tokens
  • GPU-seconds: 17.5M / 500 = 35,000 seconds = ~10 GPU-hours
  • Cost: 10 × $1.43 = $14.30/day

Tier 3 (15,000 requests on H100 SXM5, ~400 tokens/sec):

  • Tokens: 15,000 × 700 = 10.5M tokens
  • GPU-seconds: 10.5M / 400 = 26,250 seconds = ~7 GPU-hours
  • Cost: 7 × $2.40 = $16.80/day

Total routed cost: $6.12 + $14.30 + $16.80 = $37.22/day

Compared to $156/day baseline, that's a 76% cost reduction.

Using spot pricing where available (A100 SXM4 spot at $0.45/hr for Tier 2, H100 SXM5 spot at $0.80/hr for Tier 3) brings the total down further to roughly $16-18/day, an 88-90% reduction from baseline.

For more on reducing per-token reasoning costs, see the reasoning model inference cost guide.

Production Monitoring: Tracking Route Decisions

Emit these Prometheus labels on every request:

python
from prometheus_client import Counter, Histogram

routing_requests = Counter(
    "llm_router_requests_total",
    "Total routed requests",
    ["routing_tier", "escalated", "original_tier", "model_id"]
)

routing_latency = Histogram(
    "llm_router_latency_ms",
    "End-to-end latency per routed request",
    ["routing_tier"],
    buckets=[50, 100, 200, 500, 1000, 2000, 5000]
)

Three dashboards to track:

  1. Tier distribution over time: if Tier 1 drops below 50% of traffic, your classifier is over-routing. Check if query distribution has shifted or if the embedding classifier is degrading.
  1. Escalation rate per tier: rising escalation rates (>5% from baseline) indicate quality issues with the source tier. Either the model degraded, the tier thresholds are wrong, or your query distribution shifted.
  1. p50/p95/p99 latency per tier: a latency spike on Tier 1 usually means GPU memory pressure. Check vllm:kv_cache_usage_perc. A spike on Tier 3 means you need to scale out.

Quality drift detection: run a secondary LLM judge on a random 1% sample of Tier 1 and Tier 2 outputs. Score each on a 1-5 scale. If average quality drops below your threshold (typically 3.5/5), tighten the routing thresholds - send fewer queries to the lower tier until quality recovers.

For MoE model deployment and cost optimization, see the MoE inference optimization guide.

Case Study: 61% Cost Reduction with Smart Routing

A team running a customer support LLM was processing 150,000 queries/day, routing everything through a Llama-3.3-70B on H100 SXM5. The GPU was running at ~60% average utilization - not terrible, but still paying H100 prices for every "What are your business hours?" query.

After classifying 1,000 historical queries from their logs, the breakdown was:

  • 65%: simple factual queries (product specs, hours, FAQ, return policies)
  • 25%: moderate queries (account questions, order status, policy lookups)
  • 10%: complex queries (billing disputes, refund edge cases, escalations)

They deployed a 3-tier router:

  • Qwen2.5-7B on RTX 4090 for Tier 1 ($0.51/hr)
  • Llama-3.1-8B on A100 80GB for Tier 2 ($1.43/hr)
  • Llama-3.3-70B on H100 SXM5 for Tier 3 ($2.40/hr)

Before: one H100 SXM5 at $2.40/hr handling all 150k queries/day. At ~6,250 queries/hr and ~400 tokens/sec throughput, they needed roughly 65-70 GPU-hours/day.

Daily cost before: ~68 GPU-hours × $2.40 = $163/day

After routing (65% Tier 1, 25% Tier 2, 10% Tier 3):

  • Tier 1: 41 GPU-hours × $0.51 = $20.91/day
  • Tier 2: 12 GPU-hours × $1.43 = $17.16/day
  • Tier 3: 11 GPU-hours × $2.40 = $26.40/day
  • Total: $64.47/day

Result: 61% cost reduction with escalation rate on Tier 1 staying below 3% (measured via the confidence fallback chain). Customer satisfaction scores were unchanged across a 2-week A/B test.

The team then added A100 SXM4 spot instances for Tier 2, bringing that cost down to $0.45/hr and pushing the total daily cost to ~$53, a 68% reduction from the original baseline.


Spheron's GPU marketplace lets you rent RTX 4090, A100, and H100 GPUs from the same platform, billed per second. That per-second billing is what makes dynamic routing practical: scale Tier 1 up during peak hours and back down at night without paying for idle capacity.

Rent RTX 4090 → | Rent A100 → | Rent H100 → | View all pricing →

Start building on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.