When should I use GPU cloud inference instead of an on-device model?

Use GPU cloud when your query needs a model larger than ~13B parameters, when you have burst traffic that would require over-provisioning edge hardware, or when latency above 100ms is acceptable. GPU cloud is also the right choice for multi-tenant APIs where spinning up a shared inference server is cheaper than putting GPUs in every client device.

What is hybrid AI inference architecture?

Hybrid inference routes queries between an on-device (edge) model and a cloud GPU model based on complexity, latency requirements, or cost. Simple or latency-sensitive queries run locally on the edge model. Complex queries that require higher quality or larger context are escalated to a cloud GPU. The edge model acts as a first-pass filter, reducing cloud GPU spend by 60-80% compared to sending all traffic to the cloud.

What are the real costs of on-device AI inference vs cloud GPU?

On-device inference amortizes the upfront hardware cost over the device lifetime (typically 3-5 years). A consumer GPU like the RTX 5090 at ~$2,000 runs roughly 100 tokens/second for a 7B model continuously, costing around $0.0006 per 1,000 tokens assuming 24/7 utilization. Cloud GPU on-demand at $2.01/hr for an H100 serving the same 7B model with vLLM at 3,000 tokens/second costs around $0.19 per 1,000 tokens, but with no upfront investment and linear scaling. The crossover point is roughly 4-6 months of sustained load if you are always-on.

How do I route between edge and cloud inference in a hybrid system?

The simplest routing strategies are token count thresholds (send requests over N tokens to cloud), confidence scoring from the edge model (escalate low-confidence responses), and task type classification (local for simple lookups, cloud for multi-step reasoning). Implement routing as a lightweight middleware layer that checks these signals before forwarding the request to either the local model endpoint or the cloud API.

Can I run a 70B model on a consumer GPU for edge inference?

Yes, with quantization and caveats. A 70B model in Q4_K_M quantization requires approximately 38-42GB VRAM, which exceeds the RTX 5090's 32GB. Running it on an RTX 5090 requires partial CPU offloading via llama.cpp, which drops throughput to 3-8 tokens/second and introduces CPU memory bandwidth as the bottleneck. An H100 PCIe (80GB) handles 70B Q4 natively in FP8 without offloading. For edge deployments requiring 70B at usable throughput, a 48GB+ GPU or a two-GPU setup is more reliable.

Cloud vs Edge AI Inference: 2026 Hybrid Decision Guide

Quantized 70B models now run on consumer GPUs. Jetson edge devices hit sub-100ms for vision tasks. Yet H100 cloud GPU remains the only practical option for frontier model inference. The decision of where to run inference is no longer binary, and the right answer depends on four variables that most teams get wrong: latency, cost, privacy, and model quality.

This guide gives you a decision framework for choosing between GPU cloud, on-device models, and hybrid architectures. For GPU benchmark context, see best GPU for AI inference in 2026. For model size and VRAM requirements, see the GPU memory requirements guide for LLMs.

The 2026 Inference Landscape

Three tiers of inference hardware have emerged, each with a different cost and capability profile:

Tier 1: Consumer and workstation GPUs. The RTX 5090 (32GB VRAM) and RTX 4090 (24GB VRAM) can run quantized models locally. A 7B Q4 model runs at 150-260 tokens/second on the RTX 5090 (throughput varies by quantization format and framework). A 70B Q4_K_M model requires 38-42GB VRAM, which exceeds the RTX 5090's 32GB; running it requires partial CPU offloading via llama.cpp, which drops throughput to 3-8 tokens/second. For 70B at usable speeds, a 48GB+ GPU is more practical. These are personal and workstation-class GPUs with no cloud billing, but their throughput and model size limits cap what you can serve.

Tier 2: Edge accelerators. NVIDIA Jetson AGX Orin (64GB unified memory) and Apple M4 Max (128GB unified memory) are purpose-built for on-device inference. Unified memory means the GPU and CPU share the same pool, which gives smaller models better bandwidth efficiency than discrete GPUs. Jetson targets industrial and robotics deployments. Apple Silicon is increasingly common for developer-side inference.

Tier 3: Cloud GPU. H100 SXM5, H100 PCIe, A100, and H200 on demand. These handle frontier models, FP16/BF16 precision at scale, and multi-tenant workloads. No upfront cost, per-second billing available, and linear scaling.

Tier	Example Hardware	VRAM	Hardware Cost (approx.)	7B Tokens/sec	Model Size Limit (practical)
Consumer GPU	RTX 5090	32GB	~$2,000	150-260 tok/s	13B Q4 (70B Q4 with CPU offload, 3-8 tok/s)
Edge accelerator	Jetson AGX Orin	64GB unified	~$2,000	60-120 tok/s	70B Q4
Edge accelerator	Apple M4 Max	128GB unified	~$3,000-4,000	80-150 tok/s	70B FP16
Cloud GPU (on-demand)	H100 PCIe	80GB	$2.01/hr	3,000+ tok/s (batched)	70B FP16
Cloud GPU (on-demand)	H200 SXM5	141GB	$3.69/hr	5,000+ tok/s (batched)	405B Q4

The key insight: cloud GPU wins on raw throughput per dollar at high utilization. Edge hardware wins on latency and cost at low or moderate utilization. Hybrid architectures let you capture both.

Decision Framework: Four Variables That Actually Matter

1. Latency

Time-to-first-token (TTFT) requirements vary by use case:

Voice AI: Under 150ms total for the LLM stage. Any higher and the spoken response feels broken.
Interactive chat: Under 500ms is acceptable. Users notice delays above 1s.
Batch processing: No real-time constraint. Latency is irrelevant; throughput and cost per token matter.
Background agents: Seconds to minutes per task. Cloud burst capacity dominates.

Network round-trip to cloud adds 20-80ms depending on data center proximity. For voice and real-time applications, that budget is often already consumed before the LLM runs. Edge inference eliminates this overhead entirely.

2. Cost

On-device inference amortizes hardware cost over the device lifetime. Cloud GPU charges per second of compute. The crossover depends on utilization rate.

At high utilization (70%+ of hours), cloud GPU on-demand is cost-competitive with on-device hardware. At low or moderate utilization, on-device wins because you are not paying for idle capacity.

Spot cloud GPU changes the math: H100 spot at $0.80/hr on Spheron can undercut on-device hardware cost even at sustained load, but with interruption risk.

3. Privacy

On-device inference guarantees data never leaves the device. For medical records, legal documents, PII, and regulated industries, on-device is often the only compliant option. Cloud inference requires evaluating your vendor's data processing terms and your jurisdiction's data residency rules.

Some workloads are mixed: the metadata is sensitive (patient ID, user ID) but the actual inference payload is not. A hybrid router can strip PII before cloud escalation in some cases.

4. Model Quality

Not every task requires a 70B model. For most classification, short Q&A, and simple lookups, a 7B or 13B model passes human evaluation. Tasks where model size materially affects quality include multi-step reasoning, code generation (especially across files), long-context synthesis, and anything requiring broad world knowledge.

If your edge model passes your quality eval on 85% of queries, you can route the other 15% to cloud and still save significant compute cost compared to running everything on cloud GPUs.

Decision matrix:

Variable	Edge Wins	Cloud Wins	Hybrid
Latency	Real-time voice, sub-200ms TTFT	Batch jobs, async processing	Route latency-sensitive to edge
Cost	Low utilization, always-on ambient	Burst, high concurrency, 70B+	Route simple queries to edge
Privacy	PII, regulated data, offline environments	Non-sensitive multi-tenant APIs	Strip PII, route clean traffic to cloud
Model quality	7B/13B sufficient for task	70B+ required, frontier reasoning	Route by complexity and task type

Workloads That Belong on GPU Cloud

Large models that require FP16/BF16 precision. Any model above 30B parameters in full precision needs cloud GPUs. A 70B model in FP16 takes 140GB VRAM; that requires two H100 PCIe cards or a single H200. You cannot run this on consumer or edge hardware without significant quality-degrading quantization.

Burst traffic and batch inference. Overnight document processing, weekly batch embeddings, and variable-traffic APIs all fit the cloud-on-demand pattern. You pay only for the hours you use. For the billing model analysis, see serverless vs on-demand vs reserved GPU billing.

Multi-tenant APIs. If you serve inference to many users, a shared cloud inference server amortizes the per-GPU cost across all requests. Running a dedicated edge GPU per user is impractical at any meaningful scale. vLLM's continuous batching on a single H100 handles hundreds of concurrent requests that no edge device can match.

Fine-tuned and LoRA adapter serving. Fine-tuning and serving multiple LoRA adapters on the same base model requires high-end GPU memory and tensor parallelism. For LoRA serving patterns, see the LoRA multi-adapter serving guide.

Workloads That Belong on Edge

Real-time voice and audio pipelines. Round-trip cloud latency (20-80ms) plus LLM TTFT (150-300ms on cloud) often exceeds the 500ms budget for voice AI. Running Whisper ASR and a 7B model locally eliminates the network variable entirely. The voice AI GPU infrastructure guide covers VRAM budgets and GPU selection for full voice pipelines.

For a broader decision framework covering when to keep each stage of a voice pipeline on-device vs. cloud, see the case study section below on Voice AI with Local Whisper + Cloud LLM.

Always-on ambient AI. Wearables, home assistants, and embedded devices that run inference continuously cannot afford cloud GPU rates at 24/7 usage. An H100 at $2.01/hr running continuously costs roughly $1,450/month. An edge device at $2,000 hardware cost amortizes over 3 years.

Privacy-sensitive processing. Medical transcription, legal document review, enterprise chat with PII, and any workload where data residency is a compliance requirement. On-device inference keeps data off third-party infrastructure entirely.

Offline and unreliable network environments. Field deployments, aircraft, manufacturing floors, and anywhere network connectivity is intermittent or unavailable. Edge inference has no external dependency.

The Hybrid Architecture Pattern

The hybrid pattern uses two inference tiers:

Tier 1 (Edge): A small quantized model (7B Q4 or 13B Q4) running on local hardware. This tier handles classification, simple Q&A, short completions, and all latency-sensitive tasks. On an RTX 5090, a 7B Q4 model runs at 150-260 tokens/second (varies by quantization format and framework) with no network overhead.

Tier 2 (Cloud): A large model (70B+ or frontier) on a cloud GPU. This tier handles escalated queries: long-context tasks, reasoning-heavy requests, and anything the edge model fails on quality evaluation. Spheron H100 on-demand at $2.01/hr or spot at $0.80/hr.

The routing layer sits between the client and both inference tiers:

Request → [Router] → Edge Model (7B Q4, local)
                   ↘ Cloud GPU (70B, H100 on Spheron) [if escalated]

The router checks these signals, in priority order:

PII detection flag: If the request contains sensitive data, force edge regardless of complexity.
Token count threshold: If the prompt plus expected response exceeds the edge model's practical context budget (typically 512-2048 tokens), route to cloud.
Task type classification: Code generation, multi-step reasoning, and long-form content go to cloud. Classification, short Q&A, and simple lookups go to edge.
Edge model confidence score: Some architectures run the edge model first and escalate if confidence is below a threshold. This improves routing accuracy but adds one extra inference round-trip.

A well-tuned router handling 85% of traffic on edge reduces cloud GPU spend by 60-80% compared to sending all traffic to cloud.

Cost Comparison: On-Device Amortized vs Cloud Per-Token

Three scenarios for serving a 7B model:

Scenario 1: On-device (RTX 5090)

Hardware: ~$2,000 MSRP
Amortization: 3 years = $0.076/hr
Throughput: ~100 tokens/second for a 7B Q4 model
Power: ~500W estimated average inference draw (TDP is 575W; actual consumption under sustained GPU load is typically 558-594W, with inference workloads averaging around 500W), approximately $0.05/hr at $0.10/kWh
Cost per 1M tokens: ($0.076 + $0.05) / 100 tok/s / 3600 s/hr x 1,000,000 = ~$0.35/1M tokens at sustained load

Scenario 2: Cloud GPU on-demand (H100, Spheron)

Rate: $2.01/hr on-demand
Throughput: ~3,000 tokens/second (vLLM with continuous batching, 7B model, moderate concurrency)
Cost per 1M tokens: $2.01 / 3,000 tok/s / 3,600 s/hr x 1,000,000 = ~$0.19/1M tokens at full utilization

Scenario 3: Cloud GPU spot (H100, Spheron)

Rate: $0.80/hr spot
Same throughput as on-demand
Cost per 1M tokens: $0.80 / 3,000 tok/s / 3,600 s/hr x 1,000,000 = ~$0.07/1M tokens at full utilization

Scenario	Rate	Throughput	Cost/1M tokens	Notes
On-device (RTX 5090)	~$0.13/hr (amortized + power)	100 tok/s	~$0.35	No network overhead, full utilization assumed
Cloud on-demand (H100)	$2.01/hr	3,000 tok/s	~$0.19	Best at high concurrency
Cloud spot (H100)	$0.80/hr	3,000 tok/s	~$0.07	Cheapest, but subject to interruption

The crossover: cloud GPU on-demand beats on-device cost per token only at high utilization (full concurrency). At 10% utilization (brief inference bursts throughout the day), on-demand cloud cost rises 10x because you pay for idle time. On-device has no idle cost. Spot cloud GPU beats on-device even at full utilization, but interruptions can disrupt real-time applications.

Pricing fluctuates based on GPU availability. The prices above are based on 04 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For broader cost optimization strategies, see the GPU cost optimization playbook.

Implementation Guide: Routing Logic

A minimal Python router that handles the three main routing signals:

python

from dataclasses import dataclass
from typing import Optional

@dataclass
class InferenceRequest:
    prompt: str
    task_type: str  # "chat", "code", "reasoning", "classify", "embed"
    pii_detected: bool = False
    token_count: int = 0
    expected_response_tokens: int = 128

class HybridRouter:
    def __init__(
        self,
        edge_token_limit: int = 512,
        cloud_tasks: Optional[set] = None,
    ):
        # Requests with prompt + response above this threshold go to cloud
        self.edge_token_limit = edge_token_limit
        # Task types that always require cloud (larger model)
        self.cloud_tasks = cloud_tasks if cloud_tasks is not None else {"code", "reasoning", "long-form"}

    def route(self, req: InferenceRequest) -> str:
        # Priority 1: PII forces edge regardless of complexity
        if req.pii_detected:
            return "edge"

        # Priority 2: Token budget check
        # Fall back to word count estimate when token_count is not provided
        prompt_tokens = req.token_count if req.token_count > 0 else len(req.prompt.split())
        total_tokens = prompt_tokens + req.expected_response_tokens
        if total_tokens > self.edge_token_limit:
            return "cloud"

        # Priority 3: Task type classification
        if req.task_type in self.cloud_tasks:
            return "cloud"

        # Default: handle on edge
        return "edge"


# Usage with OpenAI-compatible clients
import openai

router = HybridRouter(edge_token_limit=512)

# Local Ollama or llama.cpp endpoint
edge_client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

# Spheron-hosted vLLM endpoint
cloud_client = openai.OpenAI(
    base_url="https://your-spheron-instance/v1",
    api_key="your-api-key",
)

def infer(req: InferenceRequest) -> str:
    destination = router.route(req)
    client = edge_client if destination == "edge" else cloud_client
    model = "llama3.1:8b" if destination == "edge" else "meta-llama/Llama-3.3-70B-Instruct"

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": req.prompt}],
    )
    if not response.choices:
        return ""
    return response.choices[0].message.content or ""

The router is stateless and adds less than 1ms overhead per request. Wire it in front of your inference clients as shown above.

For setting up the cloud vLLM endpoint referenced above, see the LLM deployment guide. For Spheron-specific deployment docs, see docs.spheron.ai.

Case Study: Voice AI with Local Whisper + Cloud LLM

A production voice AI system built on the hybrid pattern has three components:

Component 1: Edge ASR (Whisper Large v3 via FasterWhisper)

VRAM: ~4GB (INT8 quantized via FasterWhisper; standard FP16 uses ~4-5GB)
Latency: 30-50ms for a 5-second audio clip
Hardware requirement: any GPU with 6GB+ VRAM
Runs locally on every deployment node, no cloud dependency

Component 2: Routing decision

Input: transcribed text from ASR
If the query is under 50 tokens and the task type is "classify" or "simple Q&A": route to local 7B model
If the query exceeds 50 tokens or requires multi-step reasoning: escalate to cloud 70B model
Routing adds less than 1ms

Component 3: Cloud LLM (70B model on H100)

Spheron on-demand: $2.01/hr
Handles escalated queries requiring full model quality
Latency: 150-300ms TTFT for a 70B model

Latency budget comparison:

Stage	Edge Path	Cloud Path
ASR (Whisper Large v3)	30-50ms	30-50ms
Routing decision	~1ms	~1ms
LLM TTFT	80-120ms (7B local)	150-300ms (70B cloud)
TTS (local Kokoro)	50ms	50ms
Total	~170ms	~400ms

The edge path stays well under 200ms. The cloud path hits 350-400ms, which is still within the 500ms voice AI budget. About 70-80% of voice queries (simple lookups, short commands, contextual follow-ups) stay on the edge path. The remaining 20-30% that require the 70B model go to cloud and still respond within the latency budget.

For the NeuTTS Air voice AI example running on Spheron, see the NeuTTS Air deployment guide. For the full voice AI GPU infrastructure breakdown, see the voice AI GPU guide.

When the Hybrid Pattern Breaks Down

Router overhead at low request volumes. At 10 requests/day, the complexity of maintaining a router, two inference endpoints, and monitoring is not worth the cost savings. Below a threshold of roughly 100 requests/day, just run everything on cloud or everything on edge, depending on your latency and privacy requirements.

Edge model quality gaps. If your 7B edge model fails on 40% of queries instead of 15%, the user experience degrades visibly: responses will vary in quality based on routing decisions that are invisible to the user. Run quality evaluations on your edge model against your actual query distribution before committing to a hybrid architecture.

Cloud escalation failures on network interruption. If the cloud tier becomes unavailable (network outage, instance interruption on spot), requests that route to cloud will fail. Implement a fallback that routes to edge-only on cloud failure, accepting lower quality rather than no response. This requires your edge model to handle all task types at degraded quality.

Cold-start latency on cloud GPU. If you deprovision your cloud instance during low-traffic periods to save cost, the first escalated request after reprovisioning will see 60-120 seconds of cold-start latency. For workloads with variable but predictable daily traffic, keep the cloud instance warm during business hours and deprovision overnight.

Summary: Which Pattern Fits Your Workload

Workload	Recommended Pattern	Reasoning
Real-time voice agent	Hybrid (edge ASR + cloud LLM)	Latency plus quality balance
Privacy-first enterprise	Edge-only	Data never leaves network
Batch document processing	Cloud-only	Cost at scale, no latency requirement
Variable-traffic API	Cloud on-demand (Spheron)	Burst handling, per-second billing
Always-on ambient AI	Edge-only	Cost would be prohibitive at cloud rates
Research / frontier models	Cloud-only	Only viable on H100/H200 scale

Spheron provides the cloud GPU tier in hybrid inference architectures - on-demand H100 and A100 instances billed per second, with no reserved contracts required. Ideal for burst escalation workloads where you pay only for actual usage.
Rent H100 → | Rent A100 → | View all pricing →