One agent calling tools on a single GPU is a demo. A hundred agents handling real user traffic, with queues, retries, and cost constraints, is a system. Most teams nail the single-agent case and then get surprised by what breaks when they try to scale it.
This guide covers the infrastructure decisions that matter at scale: GPU right-sizing for different agent tiers, MCP orchestration patterns, autoscaling triggers, and cost modeling from 1,000 to 100,000 concurrent agent interactions.
For the GPU fundamentals (VRAM sizing, TTFT budgets, KV cache math), see GPU Infrastructure for AI Agents: The 2026 Compute Playbook. This post starts where that one ends.
Why AI Agent Fleets Are the Fastest-Growing GPU Workload in 2026
Training runs get bigger headlines, but agent inference is what's actually filling GPU capacity right now. A training job runs for days, then finishes. An agent fleet runs continuously, spikes unpredictably, and needs to respond in under a second.
The compute profile is different in every way that matters:
- Bursty, not steady. Agent traffic follows user behavior. A customer service agent fleet sees 10x traffic during business hours and near-zero overnight. Training clusters run flat.
- Latency-sensitive. Users waiting for an agent response feel 2-second latency. Batch training doesn't care about per-step time.
- State-dependent. Multi-turn agent conversations accumulate KV cache. Each additional turn costs more VRAM than the last.
- Tool-heavy. MCP-connected agents make multiple tool calls per task. Each tool call is a round-trip inference request. At 5 tool calls per interaction and 10,000 concurrent users, you're running 50,000 inference requests simultaneously.
That last point is why GPU right-sizing for agent fleets is harder than it looks. You're not sizing for one model at one concurrency level. You're sizing for a pipeline with multiple inference steps, variable tool call frequency, and unpredictable user load.
Agent Fleet Architecture: MCP Servers, Routing Layers, and Shared GPU Pools
A production agent fleet has three distinct compute layers. Conflating them leads to over-provisioning in some areas and bottlenecks in others.
Layer 1: The orchestrator
The orchestrator decides which tools to call, in what order, and what to do with results. For most agent frameworks (LangGraph, CrewAI, AutoGen), the orchestrator is an LLM inference call. It receives the user request, the tool definitions, and any accumulated context, then outputs a decision.
Orchestrator requirements: low latency is everything. TTFT under 500ms is the standard target for user-facing agents. Use your fastest GPU here. The H100 PCIe at $2.51/hr handles orchestrator workloads with TTFT well under 500ms for 7B-13B models under typical concurrency.
Layer 2: Tool execution (MCP servers)
MCP servers handle the actual tool calls. Some tools are CPU-only (web search, database queries, code parsing). Some need GPU: inference, embeddings, image generation, code execution with GPU-accelerated libraries.
The key architectural insight: MCP servers themselves don't need GPU. The GPU lives in the backend inference service the MCP server calls. You can run many MCP server processes on a single CPU host and point them all at a shared GPU inference pool.
For a detailed walkthrough of setting up GPU-backed MCP servers, see How to Deploy GPU-Accelerated MCP Servers.
Layer 3: Shared inference pool
The inference pool is where the actual GPU compute lives. A single vLLM instance can serve multiple MCP servers and agent instances simultaneously via continuous batching. This is where you can get dramatic cost savings compared to naive per-agent GPU allocation.
Without shared pooling: 100 agents x 1 GPU each = 100 GPUs, most of them idle.
With shared pooling: 100 agents sharing 4-8 GPUs at high utilization = 4-8 GPUs.
The math only works if your concurrency patterns allow batching. Synchronous, serial agents (one request at a time, waiting for response) don't batch well. Async agents that submit requests and process results later batch extremely well.
GPU Right-Sizing for Agent Workloads: TTFT vs Throughput Tradeoffs
Not all GPUs are equal for agent workloads. The tradeoff you're optimizing depends on your tier.
| GPU | On-Demand ($/hr) | Spot ($/hr) | Best for agent workloads |
|---|---|---|---|
| L40S PCIe | $0.72 | N/A | Lightweight to mid-tier inference |
| A100 80G PCIe | $1.04 | $1.14 | High-throughput MCP servers |
| H100 PCIe | $2.51 | N/A | Latency-critical routing tier |
| H100 SXM5 | $4.41 | N/A | High-concurrency orchestrators |
Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing for live rates.
Note: A100 80G PCIe spot pricing is currently slightly above on-demand, which is atypical. This reflects current market demand. Check live pricing before choosing spot over on-demand for A100 PCIe workloads.
TTFT-first workloads (orchestrators, user-facing agents)
The metric that matters here is time to first token under load. TTFT degrades as queue depth grows, which means you need to overprovision relative to average load, not just peak.
H100 PCIe at $2.51/hr is the standard choice. Its memory bandwidth (2,039 GB/s on HBM2e) and high FP16 throughput make it the best option for fast prefill (the stage that dominates TTFT at long context lengths). For very high-concurrency orchestrators, the H100 SXM5 at $4.41/hr delivers higher memory bandwidth and more headroom before KV cache pressure degrades TTFT.
See H100 rental details for current availability and per-second billing options.
Throughput-first workloads (background agents, batch pipelines)
For agents that run in the background (document analysis, research pipelines, scheduled summarization), TTFT is irrelevant. Total tokens per hour and cost per million tokens are what matter.
The A100 80G PCIe at $1.04/hr delivers strong throughput for 7B-13B models and is significantly cheaper than the H100. For background agent pipelines, spot instances (where available) can cut costs further — check current GPU pricing for live spot rates.
Embedding-heavy workloads (RAG tool calls)
If your agents rely heavily on vector search and retrieval, the bottleneck is often the embedding service, not the generation model. L40S at $0.72/hr handles embedding workloads for most production RAG setups with headroom to spare. For the full agentic RAG architecture, see Agentic RAG on GPU Cloud.
Autoscaling Patterns: Scale-to-Zero, Burst Scaling, and Queue-Based GPU Allocation
The wrong autoscaling strategy is expensive in both directions: too aggressive, and you're paying for idle GPUs; too conservative, and you're dropping requests during traffic spikes.
Pattern 1: Scale-to-zero for batch agents
Scale-to-zero releases GPU instances when the task queue is empty, then provisions new instances when tasks arrive. Cold start time is the main cost: spinning up a GPU instance and loading a model takes 2-5 minutes end-to-end on most providers (covering instance provisioning, container pull, CUDA initialization, and weight loading).
This is fine for background agents where users aren't waiting. Scheduled document processing, nightly analysis pipelines, batch research workflows. Completely wrong for user-facing agents.
Implementation: use a task queue (Redis with Celery, RabbitMQ, or SQS) as the source of truth. Set a scale-out trigger at queue depth > 0 sustained for 30 seconds. Set a scale-in trigger at queue depth == 0 sustained for 5 minutes. On Spheron, use the API docs to provision instances programmatically from your queue consumer.
Pattern 2: Minimum-floor scaling for user-facing agents
Keep a minimum of N warm GPU instances at all times to handle baseline traffic without cold starts. Scale out when queue depth grows, scale in when it shrinks, but never go below the floor.
The right floor is: how many GPUs do you need to serve your baseline traffic (typically overnight/weekends) with acceptable TTFT? For most teams starting out, that's 1-2 instances.
Scale-out trigger: queue depth > 10 requests per GPU instance sustained for 60 seconds (the threshold from MCP server autoscaling patterns).
Scale-in trigger: queue has been below threshold for 5+ minutes. Use a cooldown to prevent flapping.
Pattern 3: Predictive scaling for known traffic patterns
If your agent traffic follows a predictable pattern (business hours, daily peaks, weekly cycles), pre-scale based on the schedule rather than waiting for queue depth to trigger reactive scaling. Add 30% headroom above predicted peak.
This works well when combined with spot instances for the burst capacity: maintain a core fleet of on-demand instances at the floor, burst onto spot instances for the predictable traffic spikes. If spot is interrupted, fall back gracefully.
What not to do: scale on GPU utilization
GPU utilization is the wrong autoscaling signal. A GPU at 70% utilization with a growing request queue is already degrading TTFT. By the time utilization hits 95%, you've been dropping requests for minutes. Scale on queue depth, not utilization.
Multi-Agent Orchestration Frameworks: LangGraph, CrewAI, and AutoGen on GPU Cloud
The choice of orchestration framework affects your GPU architecture more than most people expect. Different frameworks make different assumptions about state, concurrency, and tool call patterns.
LangGraph
LangGraph models agents as stateful graphs with explicit checkpointing. Each node in the graph can be an LLM call, a tool call, or a conditional branch. The stateful design means LangGraph agents are naturally suited for long-running tasks with complex tool chains.
GPU implication: LangGraph's checkpoint-based state management means you can interrupt and resume agent runs. This makes it compatible with spot GPU instances for non-user-facing nodes, since a spot interruption just means resuming from the last checkpoint. Keep the user-facing orchestrator node on on-demand instances; move background research nodes to spot.
LangGraph's native async support allows multiple graph nodes to run in parallel, which maps directly to GPU batching. If your graph has parallel tool call branches, those branches can share a single GPU inference pool efficiently.
CrewAI
CrewAI structures agents as a crew with defined roles, where each agent has specific responsibilities and a set of tools. The framework handles routing between agents automatically.
GPU implication: CrewAI's multi-agent routing often means multiple LLM calls per user request. If each agent uses a different model (e.g., a researcher agent uses a 70B model and a writer agent uses a 13B model), you need both models VRAM-resident simultaneously. Plan for multi-model co-resident VRAM requirements rather than per-agent VRAM.
AutoGen
AutoGen takes a conversation-based approach where agents communicate by passing messages to each other. Multi-agent AutoGen workflows tend to generate significantly more tokens than single-agent equivalents because each agent explains its reasoning to the others.
GPU implication: higher token counts mean higher KV cache pressure. AutoGen workflows at scale need more VRAM per session than comparable single-agent workflows. Size for the full conversation length, not just the per-turn output length.
Note: Microsoft unified AutoGen and Semantic Kernel under the Microsoft Agent Framework in late 2025. The GPU architecture patterns described here apply to both the original AutoGen and the new framework.
For the benchmark data on running 100 concurrent agent tasks with LangGraph and vLLM, see Running 100 Concurrent AI Agent Tasks on Bare Metal GPUs.
For cost comparison between on-premises and cloud GPU for these frameworks, see LLM Inference On-Premise vs GPU Cloud.
Cost Modeling: GPU Spend Per Agent Interaction at 1K, 10K, and 100K Concurrent Agents
The numbers below use the tiered architecture (L40S for lightweight workers, A100 80G PCIe for mid-tier, H100 for orchestrators) with shared inference pools.
Assumptions:
- Average interaction: 600 input tokens + 350 output tokens
- Tool calls per interaction: 4 (2 embeddings + 2 inference calls)
- Concurrency handled per GPU: L40S = 50 interactions, A100 80G PCIe = 75 interactions, H100 = 100 interactions
- Mix: 70% lightweight agents (L40S), 25% mid-tier (A100 80G PCIe), 5% orchestrator (H100)
At 1,000 concurrent interactions:
| Tier | GPUs needed | On-demand cost/hr | Spot cost/hr |
|---|---|---|---|
| L40S (700 interactions) | 14 GPUs | $10.08 | N/A |
| A100 80G PCIe (250 interactions) | 4 GPUs | $4.16 | N/A |
| H100 (50 interactions) | 1 GPU | $2.51 | N/A |
| Total | 19 GPUs | $16.75/hr | N/A |
At 10,000 concurrent interactions:
| Tier | GPUs needed | On-demand cost/hr | Spot cost/hr |
|---|---|---|---|
| L40S (7,000 interactions) | 140 GPUs | $100.80 | N/A |
| A100 80G PCIe (2,500 interactions) | 34 GPUs | $35.36 | N/A |
| H100 (500 interactions) | 5 GPUs | $12.55 | N/A |
| Total | 179 GPUs | $148.71/hr | N/A |
Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing for live rates.
At 100,000 concurrent interactions:
| Tier | GPUs needed | On-demand cost/hr | Spot cost/hr |
|---|---|---|---|
| L40S (70,000 interactions) | 1,400 GPUs | $1,008.00 | N/A |
| A100 80G PCIe (25,000 interactions) | 334 GPUs | $347.36 | N/A |
| H100 (5,000 interactions) | 50 GPUs | $125.50 | N/A |
| Total | 1,784 GPUs | $1,480.86/hr | N/A |
Pricing fluctuates based on GPU availability. The prices above are based on 19 Apr 2026 and may have changed. Check current GPU pricing for live rates.
The A100 80G PCIe mid-tier costs $35.36/hr at 10,000 concurrent interactions and $347.36/hr at 100,000. Check current GPU pricing for spot availability — when spot rates drop below on-demand for mid-tier GPUs, the per-interaction savings at this scale are substantial.
Cost per interaction (on-demand): at 10,000 concurrent, $148.71/hr across 10,000 interactions = $0.015 per interaction/hour. At 350 output tokens/interaction and a typical 3-second latency, that's roughly $0.0000124 per interaction.
The numbers change significantly with model size. Switching from 8B to 70B models for the orchestrator tier roughly triples GPU cost for that tier. For most fleets, using a capable 8B-13B model at the orchestrator tier and only escalating to larger models for specific high-complexity tasks keeps costs manageable.
Shared vs Dedicated GPU Pools for Mixed Agent Workloads
The architecture choice between shared and dedicated GPU pools has the biggest single impact on your per-interaction cost.
Shared pool architecture
All agent instances submit inference requests to a central vLLM pool. The pool handles batching, queuing, and priority. GPU utilization is high because the pool aggregates demand from many agents.
Pros: high GPU utilization (60-80% vs 10-30% for dedicated), simple to scale (add capacity to the pool, all agents benefit), lower total GPU count.
Cons: one model per pool (unless you run multiple pools). Agents with very different latency requirements compete for the same queue. Noisy-neighbor effects at high load.
Works best for: homogeneous agent fleets where all agents use the same model and have similar latency requirements.
Dedicated pool architecture
Each agent type (or agent tier) has its own GPU pool. Orchestrators have an H100 pool, tool-calling workers have an L40S pool, embedding services have their own pool.
Pros: each tier gets guaranteed capacity and predictable latency, no cross-tier interference, independent scaling per tier.
Cons: lower average GPU utilization per pool, more complex provisioning.
Works best for: heterogeneous fleets with mixed latency requirements, different models per tier, or explicit SLA guarantees per agent type.
MIG partitioning for mixed workloads
NVIDIA MIG (Multi-Instance GPU) lets you partition a single A100 or H100 into multiple isolated GPU instances, each with dedicated VRAM and compute. A 40GB A100 can be partitioned into up to 7 instances (each 5GB VRAM) or various larger configurations.
This is useful when you have low-concurrency agents that don't need a full GPU, but you still want isolation and predictable performance. A single H100 80G with MIG can run a 7B orchestrator in a 2g.20gb slice and an embedding service in a 1g.10gb slice simultaneously, with full isolation.
For a complete guide to MIG partitioning and time-slicing techniques, see Running Multiple LLMs on One GPU with MIG and Time-Slicing.
Production Checklist: Deploy Your First Agent Fleet on Spheron GPU Cloud
Before going to production with your agent fleet, verify these in order:
Infrastructure:
- GPU tier selection matches workload (TTFT-first vs throughput-first)
- vLLM configured with
--max-num-seqsat 1.5-2x expected concurrent sessions - MCP servers deployed as CPU processes, not on GPU instances
- Task queue in place (Redis/RabbitMQ) to buffer requests between agents and inference pool
- Health check endpoint on inference server returning HTTP 200
- Load balancer with health check intervals at 10-15 seconds
- Minimum 2 GPU instances behind load balancer for failover
Autoscaling:
- Scale trigger based on queue depth, not GPU utilization
- Scale-out threshold: queue depth > N sustained for 60 seconds
- Scale-in threshold: queue empty for 5+ minutes with cooldown
- Minimum floor defined (for user-facing agents: never scale to zero)
- Spot instances assigned only to interruptible workloads
Monitoring:
- TTFT p95 alert configured (alert before it exceeds SLA, not after)
- KV cache utilization tracked (alert at 85%, not 100%)
- Queue depth dashboarded alongside GPU count
- Cost per hour tracked by tier
Cost controls:
- On-demand used only for orchestrators and user-facing tiers
- Spot pricing applied to background and batch agent tiers
- Per-hour cost budget configured with alerts
For Spheron deployment specifics, including API provisioning and per-second billing setup, see the Spheron documentation.
Agent fleets are compute-intensive at scale, but you don't need hyperscaler pricing. Spheron's spot and on-demand GPU instances let you right-size each tier of your fleet, from $0.72/hr L40S GPUs for lightweight tool agents to H100s for latency-critical orchestrators.
