Engineering

How to Build GPU Infrastructure for AI Agents: The 2026 Compute Playbook

Back to BlogWritten by Mitrasish, Co-founderMar 8, 2026
AI AgentsGPU InfrastructureInferenceVRAMLatencyvLLMProduction AIGPU Cloud
How to Build GPU Infrastructure for AI Agents: The 2026 Compute Playbook

Training a 70B model requires predictable, sustained GPU utilization over 72 hours. An AI agent handling customer requests needs to respond in under 500ms, handles anywhere from 0 to 10,000 concurrent requests with no warning, and the model must already be loaded in VRAM when the request arrives.

The GPU strategy for one is completely wrong for the other. This post covers the agent side: what compute you actually need, how to size it, and how to avoid paying for idle capacity between bursts.

If you're still working out the fundamentals of GPU capacity planning, read how to plan GPU capacity for AI deployment first. This post picks up where that one leaves off, specifically for the unique demands of agent workloads.

What Makes Agent Workloads Different from Standard LLM Serving

Agent infrastructure is not just "LLM serving but faster." The compute requirements are structurally different across several dimensions.

Bursty traffic patterns

Standard LLM APIs get relatively predictable traffic. Agent workloads don't. A workflow automation tool might get zero calls for an hour then 500 in a minute when a scheduled job fires. Your infrastructure needs to handle that spike without over-provisioning 24/7 to meet an unpredictable peak.

Latency sensitivity

Users interacting with agents in real time notice the difference between 200ms and 2,000ms responses. Training jobs don't. This changes which GPU optimizations matter: TTFT (time to first token) matters far more than raw throughput. A GPU that achieves 3x the tokens-per-second at twice the TTFT is the wrong choice for a user-facing agent.

Always-resident models

Your agent's LLM must be loaded in VRAM before the request arrives; cold starts are not acceptable in a live agent loop. A 13B model in FP16 weighs approximately 26 GB. Loading it from a default EBS gp3 volume (125 MB/s baseline throughput) takes roughly 3-4 minutes; provisioned gp3 volumes at 500+ MB/s can bring this down to under a minute, but provisioned throughput costs extra. Downloading from object storage (S3, GCS) without local caching is typically slower still. Local NVMe SSD can load the same model in under 30 seconds, but most cloud GPU instances rely on network-attached volumes. That's a broken experience. This means you can't rely on serverless spin-up for latency-sensitive agents.

Tool-calling overhead

Each tool call (web search, code execution, database lookup) pauses the model, executes the tool, then re-enters inference. The model occupies VRAM throughout the pause. This increases your minimum VRAM requirements compared to a stateless single-turn inference setup.

Context accumulation

Multi-turn agent conversations grow the KV cache over time. A 10-turn conversation with a 70B model consumes significantly more VRAM than a single-turn query. You must size for your expected conversation length, not just your model size.

FactorStandard LLM ServingAI Agent Workload
Traffic patternPredictable, steadyBursty, unpredictable
Latency requirement< 2-5s acceptable< 500ms often required
Model loadingCan cold-startMust be VRAM-resident
VRAM growthStatic per requestGrows with conversation turns
Idle cost riskLowHigh if over-provisioned

The 4 Agent Compute Patterns and What Each One Needs

Not all agents have the same requirements. Before choosing hardware, identify which pattern your agent fits.

Pattern 1: Single-agent, low-concurrency

Example: personal coding assistant, internal knowledge bot, single-user automation tool.

What it needs: the RTX 5090 (32GB GDDR7, approximately 1.79 TB/s memory bandwidth) handles up to ~36 concurrent 7B-model sessions before VRAM saturation at 4K context (GQA-based models such as Mistral 7B). The H100 PCIe (80GB HBM2e) handles far more: the same 7B workload can sustain over 100 concurrent sessions within VRAM at 4K context. For a lightweight single-agent deployment, either GPU is more than sufficient. On-demand instance is fine.

Cost tip: if your agent runs background tasks that can be checkpointed (conversation state saved to a database between turns), a spot instance works. If sessions are live and synchronous, stay on on-demand.

Pattern 2: Multi-agent orchestration

Example: coding agent + browser agent + tool agent running in parallel pipelines.

What it needs: H100 PCIe (80GB) with enough VRAM to hold multiple models simultaneously. Each model must be separately loaded and VRAM-resident. A 3-agent system with a 7B orchestrator + 13B coder + 7B reviewer needs approximately 54GB combined VRAM minimum, before accounting for KV cache. (The L40S at 48GB cannot fit all three models at FP16 precision; use the H100 PCIe or consider INT4/INT8 quantization if targeting the L40S.)

Key consideration: don't assume you can time-share a single GPU across models with warm-swap. Load times make this impractical for real-time agent loops.

Pattern 3: High-concurrency agent API

Example: customer service bot handling thousands of simultaneous users, high-traffic copilot API.

What it needs: H100 SXM cluster with InfiniBand, tensor parallelism across nodes, vLLM with continuous batching. At this scale you're essentially running a full production LLM inference service. The production GPU cloud architecture guide covers the reliability patterns that apply here.

Pattern 4: Reasoning-heavy agents

Example: DeepSeek R1-style multi-step reasoning, complex planning agents that generate hundreds of chain-of-thought tokens before producing output.

What it needs: H200 (141GB HBM3e, available at major cloud providers including AWS, GCP, and Azure, though often waitlisted or region-limited) or B200 where supply permits. Reasoning models generate significantly more tokens before outputting a result, which drives up both VRAM consumption (long KV caches) and inference time. On a standard H100 (80GB), you may see KV cache spill to host memory for very long reasoning chains, which degrades latency significantly. The H200's 141GB is the current standard for avoiding this at production scale.

VRAM Requirements for Agent Workloads: The Real Numbers

Walk through this calculation for your own stack before choosing hardware.

The basic formula:

Total VRAM = Model weights + (KV cache per session × max concurrent sessions) + overhead

KV cache per session depends on model size, context length, attention mechanism, and data type. A rough rule: for a modern GQA-based 7B model (such as Mistral 7B) at 4K context in FP16, each session uses approximately 0.5GB of KV cache. Older MHA-based 7B models use roughly 4x more KV cache. At 32K context on a GQA model, that becomes ~4GB per session.

Model weight memory in FP16 (weights only; add 15-30% for serving overhead and KV cache):

  • 7B model: ~14GB
  • 13B model: ~26GB
  • 70B model: ~140GB

Example calculations:

ScenarioModelMax SessionsContext LengthTotal VRAM NeededRecommended GPU
Simple Q&A agent7B204K~24GBRTX 5090
Customer service agent13B (GQA)508K~90GBH200
Complex reasoning agent70B1032K~280GB4x H100 SXM or 3x H200 SXM
Orchestrator + 3 sub-agents7B + 3x7B10 each4K~80GB2x H100 PCIe

Add 10-15% overhead for the inference runtime itself (vLLM, process memory, CUDA context). If your calculation comes out at 48GB, size for a 80GB GPU; you don't want to be at 95% VRAM utilization in production.

For a deeper dive into model memory requirements and quantization tradeoffs, see GPU memory requirements for LLMs.

Latency Budgets: What "Fast Enough" Actually Means

Understanding where latency comes from lets you optimize in the right place. The full stack looks like this:

User request
    -> Network routing:            ~10-50ms  (not GPU-dependent)
    -> Token prefill (input):      ~50-200ms (GPU compute-dependent)
    -> Token decode (output):      ~100-500ms (GPU memory bandwidth-dependent)
    -> Network response:           ~10-50ms  (not GPU-dependent)
Total:                            170-800ms

Latency targets by agent type:

  • Voice agents: 300-500ms total round-trip, with the LLM component needing to complete in under 200ms. The sub-200ms RAG pipeline case study demonstrates this is achievable on bare metal H100s, achieving p99 latency of 190ms serving 2M queries/day.
  • Chat agents: under 2 seconds is acceptable to most users for a complete response. TTFT under 500ms is the key threshold.
  • Background/workflow agents: no real-time requirement. Optimize for throughput (tokens/second) rather than latency. This is where spot instances and batch processing make sense.

Where your GPU choice has the most impact:

  • Prefill latency: scales with raw compute (FLOPs). H100 > RTX 5090 for large prompts. If your agent receives long system prompts or tool outputs as input, this matters.
  • Decode latency: scales with memory bandwidth. This is where the RTX 5090's GDDR7 memory is competitive for smaller models where the weights fit comfortably.
  • TTFT (time to first token): the number your users actually feel. Optimize this first. It's dominated by prefill for long inputs and by model loading if you have cold starts.

Spot vs Dedicated vs Reserved: The Right Pricing Model for Agents

The right pricing model depends on your agent's user-facing requirements and traffic predictability.

Spot instances

Right for: background agents, batch reasoning tasks, non-user-facing workflows, tasks that can checkpoint state and resume after interruption.

Wrong for: any agent with a live user waiting for a response. Interruption means broken UX.

Spot instances can cut compute costs by 60-90%. If your agent architecture allows it (state saved externally, retryable at the task level), they're the right default for non-critical paths. The GPU cost optimization playbook covers the detailed strategy for managing spot instances in production, including bid strategies and fallback patterns.

Dedicated on-demand

Right for: production agent APIs, user-facing applications, any workload where a spot interruption = broken user experience.

Spheron aggregates capacity across multiple GPU providers, so you get on-demand access without committing to a single provider's availability constraints. If your first-choice provider is at capacity, you can provision from another.

Reserved instances

Only makes sense once you've run production traffic for 30+ days and measured a stable baseline load. Don't pre-commit to a reservation before you know your actual sustained utilization; agent traffic can be far more variable than you expect in the first weeks of a launch.

The burst strategy (what most production agent teams end up running):

Maintain a small dedicated instance (keeps models warm, handles baseline traffic) and provision additional spot instances during peak demand. Spheron makes this practical: you're not locked into one provider's spot pool, so you have real options when bursts hit.

Three Production Architecture Patterns

Architecture 1: Single-Node, Multi-Replica

Best for: small-to-medium agent APIs, up to ~100 concurrent users.

Setup: one H100 PCIe, vLLM with multiple model replicas loaded, NGINX round-robin load balancing across replicas.

Cost: predictable single line item. Easy to reason about. Right first choice for most teams before they have real traffic data.

Architecture 2: Multi-Provider Burst

Best for: applications with variable traffic and cost sensitivity.

Setup: one dedicated H100 always warm for baseline load, provision Spheron spot RTX 5090s during peak demand.

Key advantage: you pay for burst capacity only when you actually need it. Your always-on cost is a single instance; peak capacity scales without pre-commitment.

Architecture 3: Tiered by Agent Complexity

Best for: multi-agent orchestration systems with agents of different sizes.

Setup: RTX 4090s (24GB GDDR6X) or RTX 5090s (32GB GDDR7) for lightweight tool-routing agents and intent classifiers, H100s reserved for heavy reasoning and generation steps. The 4090 remains widely available and cost-effective for smaller models; the 5090 adds headroom if your routing model is larger or you need more concurrent sessions.

Key advantage: don't pay H100 prices to run a 7B routing model that handles 95% of your requests. Tier the compute to the task. This is the architecture that lets production multi-agent systems stay cost-efficient at scale.

Framework Choices for Agent Serving

Your GPU handles compute; your serving framework shapes how efficiently that compute is used.

FrameworkBest ForNot For
vLLMProduction inference, continuous batching, high concurrencySimple single-request setups
SGLangMulti-step agent workloads, structured outputs, KV cache sharing across agent steps (RadixAttention)Teams already invested in the vLLM ecosystem
TGI (Text Generation Inference)Streaming responses, Hugging Face ecosystemFine-grained batching control; less configurable than vLLM for custom scheduling
OllamaLocal development and testingProduction (not designed for production load)
TensorRT-LLMMaximum throughput on NVIDIA hardwareFlexibility; harder to customize

For production agent APIs, vLLM with continuous batching remains the most widely-deployed open-source inference engine. SGLang has emerged as a strong alternative specifically for multi-step agentic workloads, offering RadixAttention for efficient KV cache reuse across agent steps. Set --max-num-seqs based on your target concurrent sessions; this is the primary lever for VRAM utilization management.

Orchestration layers (LangGraph, CrewAI, AutoGen, and similar frameworks) sit above your GPU stack. They handle agent logic, tool routing, and conversation management, but they call your inference endpoint over HTTP. Your GPU choice affects what that endpoint can handle; the orchestration framework is GPU-agnostic.

Real Cost Model: What Agent Infrastructure Actually Costs

Scenario: Customer service agent, 10,000 conversations per day, average 5 turns per conversation, approximately 200 tokens per turn (input + output combined).

Daily token volume:   10,000 x 5 x 200 = 10,000,000 tokens/day
Peak throughput need: 10x average burst  = ~1,000 tokens/sec

GPU sizing: a single H100 PCIe running vLLM with continuous batching handles approximately 1,500-3,500 tokens/second for a 13B model at moderate concurrency (10-50 concurrent sessions) in FP16, depending on batch size, sequence length, and output length distribution. With FP8 quantization or at higher concurrency, throughput can reach 5,000 tokens/second. One H100 is sufficient for this workload, with headroom.

Monthly compute cost: 1x H100 PCIe x hourly rate x 720 hours

Compare that to running the equivalent on AWS SageMaker or GCP Vertex AI; managed inference on hyperscalers typically carries a 5-10x markup over raw compute costs when you factor in both the higher instance pricing and the per-token API overhead on top. See how to avoid unexpected AWS GPU costs for a breakdown of where those costs accumulate.

For current H100 pricing on Spheron, check the GPU pricing page.

For running autonomous ML research agents overnight, see how to run Karpathy's autoresearch on a Spheron GPU VM.

Getting Started on Spheron in 4 Steps

  1. Choose your GPU using the formula above: model weights + (KV cache x concurrent sessions) + 15% overhead.
  2. Provision a bare metal or VM instance on Spheron to get full root access to your server. From there, install and configure vLLM, tune kernel parameters, and deploy it as a Docker container, you control the entire stack.
  3. Set --max-num-seqs in your vLLM config based on your expected concurrent sessions. This controls how many requests vLLM batches simultaneously.
  4. Monitor VRAM utilization and TTFT with nvidia-smi and your inference server's metrics endpoint. If sustained VRAM > 80%, scale up before you hit OOM errors under load.

Agent workloads need flexible compute that scales with demand, not hyperscaler contracts designed for predictable batch jobs. Spheron gives your agent stack access to H100, H200, RTX 5090, and more via bare metal and VM instances. Pay only for what you actually use.

Explore GPU options ->

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.