How much VRAM do I need for a production AI agent?

It depends on your model size, context length, and concurrent sessions. A GQA-based 7B model (such as Mistral 7B) serving 20 concurrent sessions at 4K context needs roughly 24GB VRAM: 14GB for weights plus 10GB for KV cache. The RTX 5090 (32GB GDDR7) handles this with headroom, and can sustain up to ~36 sessions before saturation. A GQA-based 13B model serving 50 sessions at 8K context needs approximately 90GB: 26GB for weights plus 64GB for KV cache (roughly 1.25GB per session at 8K context). This exceeds the H100 PCIe's 80GB capacity, so the H200 (141GB HBM3e) is required. Use the formula: Total VRAM = Model weights + (KV cache per session x max concurrent sessions) + overhead.

Can I use spot instances for AI agent workloads?

Only for background or non-user-facing agents. Spot instances are unsuitable for any agent with a live user waiting for a response; interruption means broken UX. Use dedicated on-demand instances for production user-facing agents, and reserve spot instances for batch reasoning tasks or background workflows that can be checkpointed.

What GPU is best for AI agent inference in 2026?

For single-agent or low-concurrency workloads, the RTX 5090 (32GB GDDR7, ~1.79 TB/s memory bandwidth) offers strong value and can sustain 36+ concurrent 7B-model sessions before VRAM saturation at 4K context (GQA-based models). For medium-to-high concurrency production APIs, the H100 PCIe (80GB HBM2e) is the standard choice. For reasoning-heavy agents with long context (32K+), the H200 (141GB HBM3e) is the standard choice, or the B200 where available, to avoid KV cache spilling to host memory.

How does an AI agent workload differ from standard LLM serving?

Agent workloads are bursty and unpredictable, require models to be VRAM-resident at all times (no cold starts), accumulate KV cache over multi-turn conversations, and must meet strict latency targets (often under 500ms). Standard LLM serving is more predictable, tolerates cold starts, and often optimizes for throughput over latency.

What is TTFT and why does it matter for AI agents?

TTFT stands for Time to First Token, which measures the time between when a request arrives and when the model starts producing output. This is the latency number users actually perceive. For voice agents, TTFT must be under 200ms for the LLM component. For chat agents, under 500ms is a reasonable target. GPU choice (compute speed) and model loading strategy both affect TTFT.

How to Build GPU Infrastructure for AI Agents: The 2026 Compute Playbook

Training a 70B model requires predictable, sustained GPU utilization over 72 hours. An AI agent handling customer requests needs to respond in under 500ms, handles anywhere from 0 to 10,000 concurrent requests with no warning, and the model must already be loaded in VRAM when the request arrives.

The GPU strategy for one is completely wrong for the other. This post covers the agent side: what compute you actually need, how to size it, and how to avoid paying for idle capacity between bursts.

If you're still working out the fundamentals of GPU capacity planning, read how to plan GPU capacity for AI deployment first. This post picks up where that one leaves off, specifically for the unique demands of agent workloads.

What Makes Agent Workloads Different from Standard LLM Serving

Agent infrastructure is not just "LLM serving but faster." The compute requirements are structurally different across several dimensions.

Bursty traffic patterns

Standard LLM APIs get relatively predictable traffic. Agent workloads don't. A workflow automation tool might get zero calls for an hour then 500 in a minute when a scheduled job fires. Your infrastructure needs to handle that spike without over-provisioning 24/7 to meet an unpredictable peak.

Latency sensitivity

Users interacting with agents in real time notice the difference between 200ms and 2,000ms responses. Training jobs don't. This changes which GPU optimizations matter: TTFT (time to first token) matters far more than raw throughput. A GPU that achieves 3x the tokens-per-second at twice the TTFT is the wrong choice for a user-facing agent.

Always-resident models

Your agent's LLM must be loaded in VRAM before the request arrives; cold starts are not acceptable in a live agent loop. A 13B model in FP16 weighs approximately 26 GB. Loading it from a default EBS gp3 volume (125 MB/s baseline throughput) takes roughly 3-4 minutes; provisioned gp3 volumes at 500+ MB/s can bring this down to under a minute, but provisioned throughput costs extra. Downloading from object storage (S3, GCS) without local caching is typically slower still. Local NVMe SSD can load the same model in under 30 seconds, but most cloud GPU instances rely on network-attached volumes. That's a broken experience. This means you can't rely on serverless spin-up for latency-sensitive agents.

Tool-calling overhead

Each tool call (web search, code execution, database lookup) pauses the model, executes the tool, then re-enters inference. The model occupies VRAM throughout the pause. This increases your minimum VRAM requirements compared to a stateless single-turn inference setup.

Context accumulation

Multi-turn agent conversations grow the KV cache over time. A 10-turn conversation with a 70B model consumes significantly more VRAM than a single-turn query. You must size for your expected conversation length, not just your model size.

Factor	Standard LLM Serving	AI Agent Workload
Traffic pattern	Predictable, steady	Bursty, unpredictable
Latency requirement	< 2-5s acceptable	< 500ms often required
Model loading	Can cold-start	Must be VRAM-resident
VRAM growth	Static per request	Grows with conversation turns
Idle cost risk	Low	High if over-provisioned

The 4 Agent Compute Patterns and What Each One Needs

Not all agents have the same requirements. Before choosing hardware, identify which pattern your agent fits.

Pattern 1: Single-agent, low-concurrency

Example: personal coding assistant, internal knowledge bot, single-user automation tool.

What it needs: the RTX 5090 (32GB GDDR7, approximately 1.79 TB/s memory bandwidth) handles up to ~36 concurrent 7B-model sessions before VRAM saturation at 4K context (GQA-based models such as Mistral 7B). The H100 PCIe (80GB HBM2e) handles far more: the same 7B workload can sustain over 100 concurrent sessions within VRAM at 4K context. For a lightweight single-agent deployment, either GPU is more than sufficient. On-demand instance is fine.

Cost tip: if your agent runs background tasks that can be checkpointed (conversation state saved to a database between turns), a spot instance works. If sessions are live and synchronous, stay on on-demand.

Pattern 2: Multi-agent orchestration

Example: coding agent + browser agent + tool agent running in parallel pipelines.

What it needs: H100 PCIe (80GB) with enough VRAM to hold multiple models simultaneously. Each model must be separately loaded and VRAM-resident. A 3-agent system with a 7B orchestrator + 13B coder + 7B reviewer needs approximately 54GB combined VRAM minimum, before accounting for KV cache. (The L40S at 48GB cannot fit all three models at FP16 precision; use the H100 PCIe or consider INT4/INT8 quantization if targeting the L40S.)

Key consideration: don't assume you can time-share a single GPU across models with warm-swap. Load times make this impractical for real-time agent loops.

Pattern 3: High-concurrency agent API

Example: customer service bot handling thousands of simultaneous users, high-traffic copilot API.

What it needs: H100 SXM cluster with InfiniBand, tensor parallelism across nodes, vLLM with continuous batching. At this scale you're essentially running a full production LLM inference service. The production GPU cloud architecture guide covers the reliability patterns that apply here.

Pattern 4: Reasoning-heavy agents

Example: DeepSeek R1-style multi-step reasoning, complex planning agents that generate hundreds of chain-of-thought tokens before producing output.

What it needs: H200 (141GB HBM3e, available at major cloud providers including AWS, GCP, and Azure, though often waitlisted or region-limited) or B200 where supply permits. Reasoning models generate significantly more tokens before outputting a result, which drives up both VRAM consumption (long KV caches) and inference time. On a standard H100 (80GB), you may see KV cache spill to host memory for very long reasoning chains, which degrades latency significantly. The H200's 141GB is the current standard for avoiding this at production scale.

VRAM Requirements for Agent Workloads: The Real Numbers

Walk through this calculation for your own stack before choosing hardware.

The basic formula:

Total VRAM = Model weights + (KV cache per session × max concurrent sessions) + overhead

KV cache per session depends on model size, context length, attention mechanism, and data type. A rough rule: for a modern GQA-based 7B model (such as Mistral 7B) at 4K context in FP16, each session uses approximately 0.5GB of KV cache. Older MHA-based 7B models use roughly 4x more KV cache. At 32K context on a GQA model, that becomes ~4GB per session.

Model weight memory in FP16 (weights only; add 15-30% for serving overhead and KV cache):

7B model: ~14GB
13B model: ~26GB
70B model: ~140GB

Example calculations:

Scenario	Model	Max Sessions	Context Length	Total VRAM Needed	Recommended GPU
Simple Q&A agent	7B	20	4K	~24GB	RTX 5090
Customer service agent	13B (GQA)	50	8K	~90GB	H200
Complex reasoning agent	70B	10	32K	~280GB	4x H100 SXM or 3x H200 SXM
Orchestrator + 3 sub-agents	7B + 3x7B	10 each	4K	~80GB	2x H100 PCIe

Add 10-15% overhead for the inference runtime itself (vLLM, process memory, CUDA context). If your calculation comes out at 48GB, size for a 80GB GPU; you don't want to be at 95% VRAM utilization in production.

For a deeper dive into model memory requirements and quantization tradeoffs, see GPU memory requirements for LLMs.

Latency Budgets: What "Fast Enough" Actually Means

Understanding where latency comes from lets you optimize in the right place. The full stack looks like this:

User request
    -> Network routing:            ~10-50ms  (not GPU-dependent)
    -> Token prefill (input):      ~50-200ms (GPU compute-dependent)
    -> Token decode (output):      ~100-500ms (GPU memory bandwidth-dependent)
    -> Network response:           ~10-50ms  (not GPU-dependent)
Total:                            170-800ms

Latency targets by agent type:

Voice agents: 300-500ms total round-trip, with the LLM component needing to complete in under 200ms. The sub-200ms RAG pipeline case study demonstrates this is achievable on bare metal H100s, achieving p99 latency of 190ms serving 2M queries/day.
Chat agents: under 2 seconds is acceptable to most users for a complete response. TTFT under 500ms is the key threshold.
Background/workflow agents: no real-time requirement. Optimize for throughput (tokens/second) rather than latency. This is where spot instances and batch processing make sense.

Where your GPU choice has the most impact:

Prefill latency: scales with raw compute (FLOPs). H100 > RTX 5090 for large prompts. If your agent receives long system prompts or tool outputs as input, this matters.
Decode latency: scales with memory bandwidth. This is where the RTX 5090's GDDR7 memory is competitive for smaller models where the weights fit comfortably.
TTFT (time to first token): the number your users actually feel. Optimize this first. It's dominated by prefill for long inputs and by model loading if you have cold starts.

Spot vs Dedicated vs Reserved: The Right Pricing Model for Agents

The right pricing model depends on your agent's user-facing requirements and traffic predictability.

Spot instances

Right for: background agents, batch reasoning tasks, non-user-facing workflows, tasks that can checkpoint state and resume after interruption.

Wrong for: any agent with a live user waiting for a response. Interruption means broken UX.

Spot instances can cut compute costs by 60-90%. If your agent architecture allows it (state saved externally, retryable at the task level), they're the right default for non-critical paths. The GPU cost optimization playbook covers the detailed strategy for managing spot instances in production, including bid strategies and fallback patterns.

Dedicated on-demand

Right for: production agent APIs, user-facing applications, any workload where a spot interruption = broken user experience.

Spheron aggregates capacity across multiple GPU providers, so you get on-demand access without committing to a single provider's availability constraints. If your first-choice provider is at capacity, you can provision from another.

Reserved instances

Only makes sense once you've run production traffic for 30+ days and measured a stable baseline load. Don't pre-commit to a reservation before you know your actual sustained utilization; agent traffic can be far more variable than you expect in the first weeks of a launch.

The burst strategy (what most production agent teams end up running):

Maintain a small dedicated instance (keeps models warm, handles baseline traffic) and provision additional spot instances during peak demand. Spheron makes this practical: you're not locked into one provider's spot pool, so you have real options when bursts hit.

Three Production Architecture Patterns

Architecture 1: Single-Node, Multi-Replica

Best for: small-to-medium agent APIs, up to ~100 concurrent users.

Setup: one H100 PCIe, vLLM with multiple model replicas loaded, NGINX round-robin load balancing across replicas.

Cost: predictable single line item. Easy to reason about. Right first choice for most teams before they have real traffic data.

Architecture 2: Multi-Provider Burst

Best for: applications with variable traffic and cost sensitivity.

Setup: one dedicated H100 always warm for baseline load, provision Spheron spot RTX 5090s during peak demand.

Key advantage: you pay for burst capacity only when you actually need it. Your always-on cost is a single instance; peak capacity scales without pre-commitment.

Architecture 3: Tiered by Agent Complexity

Best for: multi-agent orchestration systems with agents of different sizes.

Setup: RTX 4090s (24GB GDDR6X) or RTX 5090s (32GB GDDR7) for lightweight tool-routing agents and intent classifiers, H100s reserved for heavy reasoning and generation steps. The 4090 remains widely available and cost-effective for smaller models; the 5090 adds headroom if your routing model is larger or you need more concurrent sessions.

Key advantage: don't pay H100 prices to run a 7B routing model that handles 95% of your requests. Tier the compute to the task. This is the architecture that lets production multi-agent systems stay cost-efficient at scale.

Framework Choices for Agent Serving

Your GPU handles compute; your serving framework shapes how efficiently that compute is used.

Framework	Best For	Not For
vLLM	Production inference, continuous batching, high concurrency	Simple single-request setups
SGLang	Multi-step agent workloads, structured outputs, KV cache sharing across agent steps (RadixAttention)	Teams already invested in the vLLM ecosystem
TGI (Text Generation Inference)	Streaming responses, Hugging Face ecosystem	Fine-grained batching control; less configurable than vLLM for custom scheduling
Ollama	Local development and testing	Production (not designed for production load)
TensorRT-LLM	Maximum throughput on NVIDIA hardware	Flexibility; harder to customize

For production agent APIs, vLLM with continuous batching remains the most widely-deployed open-source inference engine. SGLang has emerged as a strong alternative specifically for multi-step agentic workloads, offering RadixAttention for efficient KV cache reuse across agent steps. Set --max-num-seqs based on your target concurrent sessions; this is the primary lever for VRAM utilization management.

Orchestration layers (LangGraph, CrewAI, AutoGen, and similar frameworks) sit above your GPU stack. They handle agent logic, tool routing, and conversation management, but they call your inference endpoint over HTTP. Your GPU choice affects what that endpoint can handle; the orchestration framework is GPU-agnostic.

Real Cost Model: What Agent Infrastructure Actually Costs

Scenario: Customer service agent, 10,000 conversations per day, average 5 turns per conversation, approximately 200 tokens per turn (input + output combined).

Daily token volume:   10,000 x 5 x 200 = 10,000,000 tokens/day
Peak throughput need: 10x average burst  = ~1,000 tokens/sec

GPU sizing: a single H100 PCIe running vLLM with continuous batching handles approximately 1,500-3,500 tokens/second for a 13B model at moderate concurrency (10-50 concurrent sessions) in FP16, depending on batch size, sequence length, and output length distribution. With FP8 quantization or at higher concurrency, throughput can reach 5,000 tokens/second. One H100 is sufficient for this workload, with headroom.

Monthly compute cost: 1x H100 PCIe x hourly rate x 720 hours

Compare that to running the equivalent on AWS SageMaker or GCP Vertex AI; managed inference on hyperscalers typically carries a 5-10x markup over raw compute costs when you factor in both the higher instance pricing and the per-token API overhead on top. See how to avoid unexpected AWS GPU costs for a breakdown of where those costs accumulate.

For current H100 pricing on Spheron, check the GPU pricing page.

For running autonomous ML research agents overnight, see how to run Karpathy's autoresearch on a Spheron GPU VM.

Getting Started on Spheron in 4 Steps

Choose your GPU using the formula above: model weights + (KV cache x concurrent sessions) + 15% overhead.
Provision a bare metal or VM instance on Spheron to get full root access to your server. From there, install and configure vLLM, tune kernel parameters, and deploy it as a Docker container, you control the entire stack.
Set --max-num-seqs in your vLLM config based on your expected concurrent sessions. This controls how many requests vLLM batches simultaneously.
Monitor VRAM utilization and TTFT with nvidia-smi and your inference server's metrics endpoint. If sustained VRAM > 80%, scale up before you hit OOM errors under load.

Agent workloads need flexible compute that scales with demand, not hyperscaler contracts designed for predictable batch jobs. Spheron gives your agent stack access to H100, H200, RTX 5090, and more via bare metal and VM instances. Pay only for what you actually use.
Explore GPU options ->