A single agent calling one LLM model is straightforward to size. A multi-agent system where an orchestrator spawns 20 specialized worker agents, each with its own context window, is a different problem entirely. A LangGraph workflow running 5 specialized sub-agents in parallel, say a researcher, a web searcher, a code writer, a reviewer, and a summarizer, is effectively running 5 simultaneous inference sessions, all competing for the same GPU's VRAM and compute. Size it like a single agent and you will run out of KV cache slots within minutes of going to production.
This guide covers the infrastructure layer: topology choices, KV cache math at scale, GPU selection from single-instance to fleet deployments, and cost modeling that tells you when managed API pricing stops making sense. For the single-agent fundamentals, VRAM sizing formulas, and latency budgets, start with the GPU infrastructure requirements for AI agents post first. This one picks up where that leaves off.
Why Multi-Agent Systems Are Architecturally Different
Three structural differences separate multi-agent infrastructure from single-agent serving.
Multiple simultaneous model loads
A coordinator agent may use a 70B reasoning model for complex planning decisions while worker agents run 8B instruct models for targeted tasks. These cannot share the same VRAM region; each model needs its own contiguous allocation. On an 80GB H100, a 70B FP8 model consumes almost the entire card, leaving no room for worker models. Either you run separate GPUs for different model tiers, or you constrain your orchestrator to a smaller model class. The architecture decision you make at the topology level determines your minimum hardware footprint.
Coordination overhead is GPU compute
Agent-to-agent calls via HTTP or message queues add 5-50ms per hop. The serialization and deserialization of tool outputs consumes CPU, but when the result re-enters the LLM context window, the GPU processes that additional context on every subsequent decode step. Each tool call that appends 500 tokens to the context increases KV cache size by 500 slots for the duration of the session. A 5-tool-call workflow can triple the effective KV cache size compared to a direct single-turn request.
Session state is multiplicative
In a single-agent setup, 100 concurrent users equals 100 KV cache slots. In a multi-agent setup with a 5-agent workflow, 100 concurrent user workflows equals up to 500 active KV cache slots if all sub-agents run in parallel. This is the number that breaks naive infrastructure sizing. If you sized your single-agent deployment for 100 sessions, your multi-agent deployment with the same user count needs 5x the KV cache capacity. The formulas from the GPU infrastructure requirements for AI agents post apply here, but the multiplier is topology-dependent.
Multi-Agent Topologies and Their GPU Footprints
Which topology you choose determines your infrastructure shape before you pick a single GPU.
Flat topology: All agents share one model and one inference server. N agents equals N slots in the same vLLM instance. Simplest to operate. You only manage one model and one serving process. The downside: you hit the VRAM ceiling fastest, because every agent must use the same model and you cannot specialize compute by task type. Works well when all your agents genuinely need the same model class and context length.
Hierarchical topology: One orchestrator model (larger, reasoning-optimized) plus N worker agents (smaller, faster). Requires at least two GPU processes, ideally two separate GPUs or a multi-GPU node with partitioned VRAM. The orchestrator is latency-critical; workers can tolerate some queueing. This is the topology that production multi-agent APIs most commonly end up running because it lets you match compute cost to task complexity.
Pipeline topology: Agents hand off sequentially (researcher passes to writer, writer passes to editor, editor passes to publisher). Only one agent is active per step per user workflow. Peak concurrency is lower than flat: if you have 100 concurrent user workflows in a pipeline with 5 stages, at any moment roughly 100 agents are active (one per workflow, distributed as ~20 per stage across all 5 stages), not 500. Pipeline topology requires one active KV cache slot per concurrent workflow, not one per agent. This is good for background document pipelines and async batch jobs. For real-time use cases where users wait for the result, the accumulated latency across pipeline stages usually makes this the wrong choice.
| Topology | Concurrent VRAM slots | GPU config | Best for |
|---|---|---|---|
| Flat | N agents x 1 session each | Single large GPU | Under 50 concurrent workflows, homogeneous tasks |
| Hierarchical | (1 orchestrator + N workers) x concurrent users | 2+ GPUs or large VRAM | Production APIs, specialized pipelines |
| Pipeline | 1 active slot per workflow step | Single mid-tier GPU | Async batch pipelines, doc processing |
Persistent Sessions: What Happens After 60 Minutes
Most infrastructure guides treat agents as stateless. Multi-agent systems often are not. A research agent that has processed 50 web pages may have accumulated 30,000-50,000 tokens of context. That context has to live somewhere, and on a GPU, it lives in the KV cache.
KV cache growth with conversation length
KV cache memory scales linearly with context length. The per-token KV cache size depends on model architecture. For Llama 3.1 8B with 32 layers, 8 KV heads, and 128-dimensional head embeddings in FP8 (1 byte per element):
KV bytes per token = num_layers x 2 (K and V) x num_kv_heads x head_dim x dtype_bytes
= 32 x 2 x 8 x 128 x 1 byte
= 65,536 bytes (64 KB) per tokenAt 8K context, each session consumes 64 KB x 8,192 = 512 MB of KV cache. At 32K context, that is 2 GB per session. With 100 concurrent sessions at 32K context, you need 200 GB of KV cache alone, before model weights. That exceeds any single GPU currently available and requires a multi-GPU deployment. For FP16 instead of FP8, double these numbers. Adding --kv-cache-dtype fp8 cuts KV cache memory by 50% on Hopper and Blackwell architectures (H100, H200, B200). It is not available on Ampere (A100, A10).
Three context management strategies
When sessions grow beyond your target context window, you have three options:
- Sliding window: Discard the oldest tokens when hitting the context limit. Fast. No extra inference calls. The downside is that the agent loses its earliest context, which can matter for research agents that established important facts early in the session.
- Summarization checkpoint: Compress older conversation turns into a short summary, keep recent turns verbatim. This preserves semantic continuity at the cost of one extra inference call per checkpoint. A research agent can summarize every 5K tokens of earlier context into a 500-token summary, keeping effective context concise without losing key facts.
- Reduced context cap: Set
--max-model-lento a practical limit (8K or 16K) for most sessions. Route the rare long-running sessions that genuinely need 50K+ tokens to a dedicated high-VRAM GPU (H200 or B200). This is the most cost-effective approach for production systems where most sessions stay short but occasional outliers run long.
vLLM configuration for long sessions
Key flags for managing persistent agent sessions:
--max-model-len: Set to your actual 95th-percentile context length, not the model's theoretical maximum. Reducing from 32K to 8K quadruples your concurrent session capacity for typical agent workloads.--kv-cache-dtype fp8: Cuts KV cache memory by roughly 50%. H100, H200, and B200 only.--enable-prefix-caching: Reuses KV cache for repeated system prompts across sessions. If your agents share a long system prompt (a common pattern in multi-agent orchestration where every worker gets the same base instructions), prefix caching can save 20-40% of KV cache on repeated requests.
GPU Memory Budget at Scale
The table below maps model size, context length, and concurrent session count to VRAM requirements. These figures assume --kv-cache-dtype fp8 is enabled. Without it, KV cache numbers roughly double.
| Model | Context | Sessions | KV Cache | Weights | Total VRAM | GPU |
|---|---|---|---|---|---|---|
| 8B FP8 | 8K | 100 | ~50 GB | ~8 GB | ~60 GB | H100 PCIe 80GB |
| 8B FP8 | 32K | 100 | ~200 GB | ~8 GB | ~210 GB | 2x H200 |
| 70B FP8 | 8K | 50 | ~64 GB | ~70 GB | ~137 GB (+15% overhead ~158 GB) | 2x H100 SXM5 (160 GB) or B200 (192 GB) |
| 70B FP8 | 32K | 50 | ~256 GB | ~70 GB | ~330 GB | 2x B200 (384 GB) |
Add 10-15% overhead for the inference runtime, CUDA context, and operating system. If your calculation comes out at 75 GB, provision an 80 GB GPU, not a 48 GB one. Running at 95% VRAM under normal load means any traffic spike causes KV cache exhaustion and request failures.
The dedicated vs shared GPU memory post explains why bare-metal deployments outperform virtualized multi-tenant setups for production agent workloads.
GPU Recommendations for Multi-Agent Serving
The table below maps GPU options to multi-agent fleet size. Prices reflect Spheron on-demand rates as of March 2026 (spot prices noted separately). Note: prices fluctuate based on GPU availability and market conditions. Check current GPU pricing before building your cost model.
| GPU | VRAM | Agent fleet size | Typical config | Spheron price |
|---|---|---|---|---|
| A100 80GB PCIe | 80 GB | 40-80 sessions | Single vLLM, 8B model, 16K context | $1.04/hr |
| H100 PCIe 80GB | 80 GB | 80-150 sessions | Single vLLM + prefix caching | $2.01/hr |
| H100 SXM5 80GB | 80 GB | 150-250 sessions | vLLM with tensor parallelism (TP=2) | ~$2.40/hr per GPU (~$4.80/hr for TP=2) |
| H200 SXM 141GB | 141 GB | 250-500 sessions | vLLM, long-context agents | $3.70/hr |
| B200 192GB | 192 GB | 500-1,000 sessions | vLLM FP4, highest density | $2.01/hr spot |
For multi-agent hierarchical topologies, a common production configuration is one H200 for the orchestrator model (70B, long-context planning) plus a pool of H100 PCIe instances for worker agents. The orchestrator handles one request at a time per user workflow; worker agents handle many simultaneous short-context calls.
Inference-optimized variants matter for tensor parallelism
For flat multi-agent topology on a single GPU, PCIe vs SXM is irrelevant for performance. But at TP=2 or TP=4 configurations serving 1,000+ concurrent sessions, NVLink bandwidth becomes the bottleneck. The H100 SXM variant offers 900 GB/s NVLink bandwidth. The PCIe variant requires optional NVLink bridges (sold separately) to achieve 600 GB/s GPU-to-GPU bandwidth; without bridges, multi-GPU communication is limited to PCIe bandwidth. The difference shows up directly in tensor-parallel all-reduce latency: for a 70B model split across two GPUs, SXM cuts the inter-GPU communication overhead by roughly 30%.
GPU rental links for each tier:
- RTX 5090 rental for development and small production
- A100 rental for mid-scale flat topologies
- H100 rental for production multi-agent APIs
- H200 rental for long-context orchestration
- B200 rental for maximum session density
For a head-to-head comparison of H100 and H200 memory specifications and KV cache capacity, see the H100 vs H200 comparison. For B200 specs and when FP4 quantization changes the economics, see the B200 complete guide.
Scaling from 1 Agent to 1,000 Concurrent Sessions
Three distinct stages, each with different configuration requirements.
Stage 1: Single GPU, 1-50 concurrent sessions
One vLLM server, one GPU (RTX 5090 or H100 PCIe). This is where you start.
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-num-seqs 100 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--port 8000Set --max-num-seqs to 1.5-2x your target peak. No load balancer needed at this stage; direct HTTP to vLLM's OpenAI-compatible endpoint. Session affinity is not a concern when there is only one instance.
Stage 2: Horizontal scaling, 50-250 sessions
Two to four vLLM instances behind litellm proxy or nginx upstream. The critical addition at this stage is session affinity.
upstream vllm_pool {
hash $http_x_session_id consistent;
server 10.0.0.1:8000;
server 10.0.0.2:8000;
server 10.0.0.3:8000;
}Without session affinity via consistent hashing on session or user ID, a follow-up request in a multi-turn agent conversation can land on a vLLM instance that does not have the session's KV cache. That forces a full re-prefill of the entire conversation history, adding 50-200ms to TTFT for every turn after the first. For multi-turn agents, this is a correctness issue as much as a latency issue.
Add a Redis session store for LangGraph or CrewAI state persistence. The vLLM instance should be stateless from the orchestration layer's perspective; only the inference KV cache needs locality.
Cost implication: two GPUs at Stage 2 versus one at Stage 1. Worth it above roughly 60 concurrent sessions, where a single GPU starts showing P99 TTFT degradation from queue saturation.
Stage 3: Multi-model cluster, 250-1,000 sessions
Separate GPU pools for different agent roles.
Orchestrator model: deploy on its own GPU or GPU pair. Use a faster, smaller model for routing and planning decisions (7B-8B class) rather than a 70B model, keeping the orchestrator available with low latency. The orchestrator does not need to be the largest model; it needs to be the fastest model that reliably routes tasks.
Worker agents: horizontal GPU pool, scaled independently of the orchestrator. Worker pool size determines your peak concurrency ceiling.
Message queue: Redis Streams or RabbitMQ for async agent tasks that do not need real-time response. Agents that perform background research or process documents do not need to block a live HTTP connection; queue them and poll for results.
Auto-scaling trigger: queue depth or P95 TTFT exceeding threshold, not CPU or GPU utilization alone. A queue that grows by 10 requests per minute will overwhelm the system in 20 minutes even if current GPU utilization is 70%.
Rough cost math at 1,000 sessions:
1,000 concurrent sessions at 8B FP8 with 4K context needs approximately 4 H100 SXMs based on 250 sessions per GPU at 4K context. Fleet cost: 4 x $2.40/hr = $9.60/hr. The equivalent workload on a managed API at $0.0000004/token average (mixing input and output), running 1,000 sessions each generating 200 tokens/minute, produces 12,000,000 tokens per hour. At $0.40/M tokens: $4.80/hr. At 1,000 sessions with short agent loops (200 tokens/min per session), managed API wins on cost. At 5,000 tokens per session per minute (typical agent loops with tool calls and longer context), managed API reaches $120/hr and bare metal wins by 12x. Short-loop sessions are the edge case; most production agent workloads hit the higher token rate and favor bare metal. The crossover is workload-dependent. The next section covers how to model it.
For how these numbers hold up in practice, the 100 concurrent AI agent case study shows exact benchmark results from a production H100 setup.
Cost Modeling for Agent-Heavy Workloads
When does fixed GPU infrastructure beat per-token managed API pricing?
Variables:
T = average tokens per interaction (input + output combined)
R = interactions per hour
P_api = managed API price per million tokens
P_gpu = GPU cost per hour
I_gpu = interactions per GPU per hour at sustained load
Crossover: GPU wins when
P_api x T x R / 1,000,000 > P_gpu x (R / I_gpu)
Simplified (cost per interaction):
P_api x T / 1,000,000 > P_gpu / I_gpuWorked example:
Workload: 500 interactions per hour, 950 tokens average each.
Managed API at $0.40/M tokens: 0.40 x (500 x 950) / 1,000,000 = $0.19/hr
Single H100 SXM5 at $2.40/hr handles approximately 22,000+ interactions/hr at sustained load. Cost for 500 interactions: $2.40 x (500 / 22,000) = $0.055/hr
At 500 interactions/hour, the H100 SXM5 is roughly 4x cheaper per interaction than managed API, even though it is mostly idle.
At 100 interactions/hour:
- Managed API:
$0.40 x (100 x 950) / 1,000,000 = $0.038/hr - H100 SXM5 prorated:
$2.40 x (100 / 22,000) = $0.011/hr
H100 SXM5 still wins. But if you run the GPU 24/7 and only use it 2 hours a day:
- GPU cost:
$2.40 x 24 = $57.60/dayfor 200 interactions - Managed API:
$0.038 x 2 = $0.076/day
Fixed GPU infrastructure wins when your GPU utilization exceeds roughly 30% of hours in a month. Below that threshold, managed API is cheaper because you pay only for the tokens you actually process. Above it, you are effectively subsidizing the idle time with per-interaction savings that compound at scale.
For concrete cost-per-interaction numbers from a real production benchmark, the 100 concurrent AI agent case study measured $0.11 per 1,000 interactions on Spheron H100 SXM5 on-demand versus $0.38 per 1,000 interactions on GPT-4o-mini equivalent pricing.
Where to Start
If you are at 1-50 concurrent sessions: a single H100 PCIe on Spheron with one vLLM instance handles this range comfortably for 8B models. The RTX 5090 works for development and for small production deployments under 40 sessions.
If you are at 50-250 sessions: horizontal scaling with session affinity. Two to three H100 instances behind a load balancer, Redis for LangGraph state, consistent hashing on session ID. If your orchestrator uses a 70B model, provision an H200 for that tier.
If you are planning for 1,000+ sessions: the B200 gives the highest session density per GPU for dense models. For a multi-H100 cluster at TP=2, use SXM variants for NVLink bandwidth. Check current pricing before committing to a specific configuration.
For the benchmarks backing the concurrency numbers above, the 100 concurrent AI agent case study covers exact TTFT, throughput, and cost results from a production H100 deployment.
Multi-agent systems need GPU infrastructure that scales with topology, not just with user count. Spheron gives you H100, H200, and B200 instances on demand, with no minimum commitments. Start with a single node and scale horizontally as your session count grows.
