Training a 70B model requires predictable, sustained GPU utilization over 72 hours. An AI agent handling customer requests needs to respond in under 500ms, handles anywhere from 0 to 10,000 concurrent requests with no warning, and the model must already be loaded in VRAM when the request arrives.
The GPU strategy for one is completely wrong for the other. This post covers the agent side: what compute you actually need, how to size it, and how to avoid paying for idle capacity between bursts.
If you're still working out the fundamentals of GPU capacity planning, read how to plan GPU capacity for AI deployment first. This post picks up where that one leaves off, specifically for the unique demands of agent workloads.
What Makes Agent Workloads Different from Standard LLM Serving
Agent infrastructure is not just "LLM serving but faster." The compute requirements are structurally different across several dimensions.
Bursty traffic patterns
Standard LLM APIs get relatively predictable traffic. Agent workloads don't. A workflow automation tool might get zero calls for an hour then 500 in a minute when a scheduled job fires. Your infrastructure needs to handle that spike without over-provisioning 24/7 to meet an unpredictable peak.
Latency sensitivity
Users interacting with agents in real time notice the difference between 200ms and 2,000ms responses. Training jobs don't. This changes which GPU optimizations matter: TTFT (time to first token) matters far more than raw throughput. A GPU that achieves 3x the tokens-per-second at twice the TTFT is the wrong choice for a user-facing agent.
Always-resident models
Your agent's LLM must be loaded in VRAM before the request arrives; cold starts are not acceptable in a live agent loop. A 13B model in FP16 weighs approximately 26 GB. Loading it from a default EBS gp3 volume (125 MB/s baseline throughput) takes roughly 3-4 minutes; provisioned gp3 volumes at 500+ MB/s can bring this down to under a minute, but provisioned throughput costs extra. Downloading from object storage (S3, GCS) without local caching is typically slower still. Local NVMe SSD can load the same model in under 30 seconds, but most cloud GPU instances rely on network-attached volumes. That's a broken experience. This means you can't rely on serverless spin-up for latency-sensitive agents.
Tool-calling overhead
Each tool call (web search, code execution, database lookup) pauses the model, executes the tool, then re-enters inference. The model occupies VRAM throughout the pause. This increases your minimum VRAM requirements compared to a stateless single-turn inference setup.
Context accumulation
Multi-turn agent conversations grow the KV cache over time. A 10-turn conversation with a 70B model consumes significantly more VRAM than a single-turn query. You must size for your expected conversation length, not just your model size.
| Factor | Standard LLM Serving | AI Agent Workload |
|---|---|---|
| Traffic pattern | Predictable, steady | Bursty, unpredictable |
| Latency requirement | < 2-5s acceptable | < 500ms often required |
| Model loading | Can cold-start | Must be VRAM-resident |
| VRAM growth | Static per request | Grows with conversation turns |
| Idle cost risk | Low | High if over-provisioned |
The 4 Agent Compute Patterns and What Each One Needs
Not all agents have the same requirements. Before choosing hardware, identify which pattern your agent fits.
Pattern 1: Single-agent, low-concurrency
Example: personal coding assistant, internal knowledge bot, single-user automation tool.
What it needs: the RTX 5090 (32GB GDDR7, approximately 1.79 TB/s memory bandwidth) handles up to ~36 concurrent 7B-model sessions before VRAM saturation at 4K context (GQA-based models such as Mistral 7B). The H100 PCIe (80GB HBM2e) handles far more: the same 7B workload can sustain over 100 concurrent sessions within VRAM at 4K context. For a lightweight single-agent deployment, either GPU is more than sufficient. On-demand instance is fine.
Cost tip: if your agent runs background tasks that can be checkpointed (conversation state saved to a database between turns), a spot instance works. If sessions are live and synchronous, stay on on-demand.
Pattern 2: Multi-agent orchestration
Example: coding agent + browser agent + tool agent running in parallel pipelines.
What it needs: H100 PCIe (80GB) with enough VRAM to hold multiple models simultaneously. Each model must be separately loaded and VRAM-resident. A 3-agent system with a 7B orchestrator + 13B coder + 7B reviewer needs approximately 54GB combined VRAM minimum, before accounting for KV cache. (The L40S at 48GB cannot fit all three models at FP16 precision; use the H100 PCIe or consider INT4/INT8 quantization if targeting the L40S.)
Key consideration: don't assume you can time-share a single GPU across models with warm-swap. Load times make this impractical for real-time agent loops.
Pattern 3: High-concurrency agent API
Example: customer service bot handling thousands of simultaneous users, high-traffic copilot API.
What it needs: H100 SXM cluster with InfiniBand, tensor parallelism across nodes, vLLM with continuous batching. At this scale you're essentially running a full production LLM inference service. The production GPU cloud architecture guide covers the reliability patterns that apply here.
Pattern 4: Reasoning-heavy agents
Example: DeepSeek R1-style multi-step reasoning, complex planning agents that generate hundreds of chain-of-thought tokens before producing output.
What it needs: H200 (141GB HBM3e, available at major cloud providers including AWS, GCP, and Azure, though often waitlisted or region-limited) or B200 where supply permits. Reasoning models generate significantly more tokens before outputting a result, which drives up both VRAM consumption (long KV caches) and inference time. On a standard H100 (80GB), you may see KV cache spill to host memory for very long reasoning chains, which degrades latency significantly. The H200's 141GB is the current standard for avoiding this at production scale.
VRAM Requirements for Agent Workloads: The Real Numbers
Walk through this calculation for your own stack before choosing hardware.
The basic formula:
Total VRAM = Model weights + (KV cache per session × max concurrent sessions) + overheadKV cache per session depends on model size, context length, attention mechanism, and data type. A rough rule: for a modern GQA-based 7B model (such as Mistral 7B) at 4K context in FP16, each session uses approximately 0.5GB of KV cache. Older MHA-based 7B models use roughly 4x more KV cache. At 32K context on a GQA model, that becomes ~4GB per session.
Model weight memory in FP16 (weights only; add 15-30% for serving overhead and KV cache):
- 7B model: ~14GB
- 13B model: ~26GB
- 70B model: ~140GB
Example calculations:
| Scenario | Model | Max Sessions | Context Length | Total VRAM Needed | Recommended GPU |
|---|---|---|---|---|---|
| Simple Q&A agent | 7B | 20 | 4K | ~24GB | RTX 5090 |
| Customer service agent | 13B (GQA) | 50 | 8K | ~90GB | H200 |
| Complex reasoning agent | 70B | 10 | 32K | ~280GB | 4x H100 SXM or 3x H200 SXM |
| Orchestrator + 3 sub-agents | 7B + 3x7B | 10 each | 4K | ~80GB | 2x H100 PCIe |
Add 10-15% overhead for the inference runtime itself (vLLM, process memory, CUDA context). If your calculation comes out at 48GB, size for a 80GB GPU; you don't want to be at 95% VRAM utilization in production.
For a deeper dive into model memory requirements and quantization tradeoffs, see GPU memory requirements for LLMs.
Latency Budgets: What "Fast Enough" Actually Means
Understanding where latency comes from lets you optimize in the right place. The full stack looks like this:
User request
-> Network routing: ~10-50ms (not GPU-dependent)
-> Token prefill (input): ~50-200ms (GPU compute-dependent)
-> Token decode (output): ~100-500ms (GPU memory bandwidth-dependent)
-> Network response: ~10-50ms (not GPU-dependent)
Total: 170-800msLatency targets by agent type:
- Voice agents: 300-500ms total round-trip, with the LLM component needing to complete in under 200ms. The sub-200ms RAG pipeline case study demonstrates this is achievable on bare metal H100s, achieving p99 latency of 190ms serving 2M queries/day.
- Chat agents: under 2 seconds is acceptable to most users for a complete response. TTFT under 500ms is the key threshold.
- Background/workflow agents: no real-time requirement. Optimize for throughput (tokens/second) rather than latency. This is where spot instances and batch processing make sense.
Where your GPU choice has the most impact:
- Prefill latency: scales with raw compute (FLOPs). H100 > RTX 5090 for large prompts. If your agent receives long system prompts or tool outputs as input, this matters.
- Decode latency: scales with memory bandwidth. This is where the RTX 5090's GDDR7 memory is competitive for smaller models where the weights fit comfortably.
- TTFT (time to first token): the number your users actually feel. Optimize this first. It's dominated by prefill for long inputs and by model loading if you have cold starts.
Spot vs Dedicated vs Reserved: The Right Pricing Model for Agents
The right pricing model depends on your agent's user-facing requirements and traffic predictability.
Spot instances
Right for: background agents, batch reasoning tasks, non-user-facing workflows, tasks that can checkpoint state and resume after interruption.
Wrong for: any agent with a live user waiting for a response. Interruption means broken UX.
Spot instances can cut compute costs by 60-90%. If your agent architecture allows it (state saved externally, retryable at the task level), they're the right default for non-critical paths. The GPU cost optimization playbook covers the detailed strategy for managing spot instances in production, including bid strategies and fallback patterns.
Dedicated on-demand
Right for: production agent APIs, user-facing applications, any workload where a spot interruption = broken user experience.
Spheron aggregates capacity across multiple GPU providers, so you get on-demand access without committing to a single provider's availability constraints. If your first-choice provider is at capacity, you can provision from another.
Reserved instances
Only makes sense once you've run production traffic for 30+ days and measured a stable baseline load. Don't pre-commit to a reservation before you know your actual sustained utilization; agent traffic can be far more variable than you expect in the first weeks of a launch.
The burst strategy (what most production agent teams end up running):
Maintain a small dedicated instance (keeps models warm, handles baseline traffic) and provision additional spot instances during peak demand. Spheron makes this practical: you're not locked into one provider's spot pool, so you have real options when bursts hit.
Three Production Architecture Patterns
Architecture 1: Single-Node, Multi-Replica
Best for: small-to-medium agent APIs, up to ~100 concurrent users.
Setup: one H100 PCIe, vLLM with multiple model replicas loaded, NGINX round-robin load balancing across replicas.
Cost: predictable single line item. Easy to reason about. Right first choice for most teams before they have real traffic data.
Architecture 2: Multi-Provider Burst
Best for: applications with variable traffic and cost sensitivity.
Setup: one dedicated H100 always warm for baseline load, provision Spheron spot RTX 5090s during peak demand.
Key advantage: you pay for burst capacity only when you actually need it. Your always-on cost is a single instance; peak capacity scales without pre-commitment.
Architecture 3: Tiered by Agent Complexity
Best for: multi-agent orchestration systems with agents of different sizes.
Setup: RTX 4090s (24GB GDDR6X) or RTX 5090s (32GB GDDR7) for lightweight tool-routing agents and intent classifiers, H100s reserved for heavy reasoning and generation steps. The 4090 remains widely available and cost-effective for smaller models; the 5090 adds headroom if your routing model is larger or you need more concurrent sessions.
Key advantage: don't pay H100 prices to run a 7B routing model that handles 95% of your requests. Tier the compute to the task. This is the architecture that lets production multi-agent systems stay cost-efficient at scale.
Framework Choices for Agent Serving
Your GPU handles compute; your serving framework shapes how efficiently that compute is used.
| Framework | Best For | Not For |
|---|---|---|
| vLLM | Production inference, continuous batching, high concurrency | Simple single-request setups |
| SGLang | Multi-step agent workloads, structured outputs, KV cache sharing across agent steps (RadixAttention) | Teams already invested in the vLLM ecosystem |
| TGI (Text Generation Inference) | Streaming responses, Hugging Face ecosystem | Fine-grained batching control; less configurable than vLLM for custom scheduling |
| Ollama | Local development and testing | Production (not designed for production load) |
| TensorRT-LLM | Maximum throughput on NVIDIA hardware | Flexibility; harder to customize |
For production agent APIs, vLLM with continuous batching remains the most widely-deployed open-source inference engine. SGLang has emerged as a strong alternative specifically for multi-step agentic workloads, offering RadixAttention for efficient KV cache reuse across agent steps. Set --max-num-seqs based on your target concurrent sessions; this is the primary lever for VRAM utilization management.
Orchestration layers (LangGraph, CrewAI, AutoGen, and similar frameworks) sit above your GPU stack. They handle agent logic, tool routing, and conversation management, but they call your inference endpoint over HTTP. Your GPU choice affects what that endpoint can handle; the orchestration framework is GPU-agnostic.
Real Cost Model: What Agent Infrastructure Actually Costs
Scenario: Customer service agent, 10,000 conversations per day, average 5 turns per conversation, approximately 200 tokens per turn (input + output combined).
Daily token volume: 10,000 x 5 x 200 = 10,000,000 tokens/day
Peak throughput need: 10x average burst = ~1,000 tokens/secGPU sizing: a single H100 PCIe running vLLM with continuous batching handles approximately 1,500-3,500 tokens/second for a 13B model at moderate concurrency (10-50 concurrent sessions) in FP16, depending on batch size, sequence length, and output length distribution. With FP8 quantization or at higher concurrency, throughput can reach 5,000 tokens/second. One H100 is sufficient for this workload, with headroom.
Monthly compute cost: 1x H100 PCIe x hourly rate x 720 hoursCompare that to running the equivalent on AWS SageMaker or GCP Vertex AI; managed inference on hyperscalers typically carries a 5-10x markup over raw compute costs when you factor in both the higher instance pricing and the per-token API overhead on top. See how to avoid unexpected AWS GPU costs for a breakdown of where those costs accumulate.
For current H100 pricing on Spheron, check the GPU pricing page.
For running autonomous ML research agents overnight, see how to run Karpathy's autoresearch on a Spheron GPU VM.
Getting Started on Spheron in 4 Steps
- Choose your GPU using the formula above: model weights + (KV cache x concurrent sessions) + 15% overhead.
- Provision a bare metal or VM instance on Spheron to get full root access to your server. From there, install and configure vLLM, tune kernel parameters, and deploy it as a Docker container, you control the entire stack.
- Set
--max-num-seqsin your vLLM config based on your expected concurrent sessions. This controls how many requests vLLM batches simultaneously. - Monitor VRAM utilization and TTFT with
nvidia-smiand your inference server's metrics endpoint. If sustained VRAM > 80%, scale up before you hit OOM errors under load.
Agent workloads need flexible compute that scales with demand, not hyperscaler contracts designed for predictable batch jobs. Spheron gives your agent stack access to H100, H200, RTX 5090, and more via bare metal and VM instances. Pay only for what you actually use.
