How much VRAM do I need for 1,000 concurrent multi-agent sessions?

At 1,000 concurrent sessions with an 8B FP8 model at 4K context, KV cache alone is approximately 256 GB (64 KB per token x 4,096 tokens x 1,000 sessions), and model weights add another ~8 GB, putting the true total at ~264 GB. That maps to 4 H100 SXM5 GPUs or 2 B200s. For a 70B model at 8K context, 1,000 sessions requires multiple terabytes of VRAM and is only practical with a large GPU cluster. Use the formula: KV cache per session x sessions + model weights. With --kv-cache-dtype fp8 and --max-model-len set to your actual context needs, you can often serve 2-3x more sessions per GPU than the defaults allow.

What GPU is best for multi-agent orchestration?

It depends on your fleet size and topology. For 1-50 concurrent multi-agent workflows, the H100 PCIe 80GB handles a single vLLM instance comfortably. For hierarchical topologies with a large orchestrator model plus worker agents, 2x H100 SXM5 (160 GB combined) is the minimum for a 70B orchestrator plus KV cache for 50+ sessions at 8K context: the raw total is ~137 GB but adding 10-15% runtime overhead pushes the requirement to ~151-158 GB, which exceeds the H200's 141 GB. A single B200 (192 GB) is the simplest single-card option for this tier. For 500-1,000 concurrent sessions with dense models, the B200 (192GB, FP4 support) gives the highest session density per GPU.

How is multi-agent GPU infrastructure different from single-agent?

Three main differences. First, multiple models may need to be VRAM-resident simultaneously: an orchestrator using a 70B model and worker agents using 8B models each need separate GPU memory. Second, KV cache slots multiply by the number of concurrent agents per workflow, not per user. 100 concurrent users with a 5-agent workflow can generate up to 500 simultaneous KV cache slots. Third, inter-agent communication keeps GPU context loaded longer, increasing minimum VRAM reservation per session.

At what scale does bare-metal GPU beat managed LLM API pricing for agents?

The crossover depends on utilization rate, not just volume. An H100 SXM5 at $2.40/hr serving 22,000+ interactions/hour costs approximately $0.11 per 1,000 interactions. Managed API at $0.40/M tokens and 950 tokens per interaction costs $0.38 per 1,000 interactions at any volume. The math favors bare metal as soon as you have enough concurrent load to keep GPU utilization above 30% of hours in a month. Below that, you pay for idle GPU time.

How do I manage KV cache for long-running agents with 50K+ token context?

Three strategies work in practice. Sliding window drops oldest tokens when hitting the context limit, fast but loses early context. Summarization checkpointing compresses earlier conversation to a short summary and keeps recent turns verbatim, preserving semantic continuity at the cost of one extra inference call. Reduced context cap routes most sessions to a smaller context limit (8K or 16K) and assigns high-VRAM GPUs only to rare sessions needing 50K+ tokens. Add --kv-cache-dtype fp8 to cut KV cache memory by 50% (Hopper/Blackwell only), and --enable-prefix-caching to reuse KV cache for repeated system prompts.

Multi-Agent AI and Agentic GPU Infrastructure: Orchestration at Scale

A single agent calling one LLM model is straightforward to size. A multi-agent system where an orchestrator spawns 20 specialized worker agents, each with its own context window, is a different problem entirely. A LangGraph workflow running 5 specialized sub-agents in parallel, say a researcher, a web searcher, a code writer, a reviewer, and a summarizer, is effectively running 5 simultaneous inference sessions, all competing for the same GPU's VRAM and compute. Size it like a single agent and you will run out of KV cache slots within minutes of going to production.

This guide covers the infrastructure layer: topology choices, KV cache math at scale, GPU selection from single-instance to fleet deployments, and cost modeling that tells you when managed API pricing stops making sense. For the single-agent fundamentals, VRAM sizing formulas, and latency budgets, start with the GPU infrastructure requirements for AI agents post first. This one picks up where that leaves off.

Why Multi-Agent Systems Are Architecturally Different

Three structural differences separate multi-agent infrastructure from single-agent serving.

Multiple simultaneous model loads

A coordinator agent may use a 70B reasoning model for complex planning decisions while worker agents run 8B instruct models for targeted tasks. These cannot share the same VRAM region; each model needs its own contiguous allocation. On an 80GB H100, a 70B FP8 model consumes almost the entire card, leaving no room for worker models. Either you run separate GPUs for different model tiers, or you constrain your orchestrator to a smaller model class. The architecture decision you make at the topology level determines your minimum hardware footprint.

Coordination overhead is GPU compute

Agent-to-agent calls via HTTP or message queues add 5-50ms per hop. The serialization and deserialization of tool outputs consumes CPU, but when the result re-enters the LLM context window, the GPU processes that additional context on every subsequent decode step. Each tool call that appends 500 tokens to the context increases KV cache size by 500 slots for the duration of the session. A 5-tool-call workflow can triple the effective KV cache size compared to a direct single-turn request.

Session state is multiplicative

In a single-agent setup, 100 concurrent users equals 100 KV cache slots. In a multi-agent setup with a 5-agent workflow, 100 concurrent user workflows equals up to 500 active KV cache slots if all sub-agents run in parallel. This is the number that breaks naive infrastructure sizing. If you sized your single-agent deployment for 100 sessions, your multi-agent deployment with the same user count needs 5x the KV cache capacity. The formulas from the GPU infrastructure requirements for AI agents post apply here, but the multiplier is topology-dependent.

Multi-Agent Topologies and Their GPU Footprints

Which topology you choose determines your infrastructure shape before you pick a single GPU.

Flat topology: All agents share one model and one inference server. N agents equals N slots in the same vLLM instance. Simplest to operate. You only manage one model and one serving process. The downside: you hit the VRAM ceiling fastest, because every agent must use the same model and you cannot specialize compute by task type. Works well when all your agents genuinely need the same model class and context length.

Hierarchical topology: One orchestrator model (larger, reasoning-optimized) plus N worker agents (smaller, faster). Requires at least two GPU processes, ideally two separate GPUs or a multi-GPU node with partitioned VRAM. The orchestrator is latency-critical; workers can tolerate some queueing. This is the topology that production multi-agent APIs most commonly end up running because it lets you match compute cost to task complexity.

Pipeline topology: Agents hand off sequentially (researcher passes to writer, writer passes to editor, editor passes to publisher). Only one agent is active per step per user workflow. Peak concurrency is lower than flat: if you have 100 concurrent user workflows in a pipeline with 5 stages, at any moment roughly 100 agents are active (one per workflow, distributed as ~20 per stage across all 5 stages), not 500. Pipeline topology requires one active KV cache slot per concurrent workflow, not one per agent. This is good for background document pipelines and async batch jobs. For real-time use cases where users wait for the result, the accumulated latency across pipeline stages usually makes this the wrong choice.

Topology	Concurrent VRAM slots	GPU config	Best for
Flat	N agents x 1 session each	Single large GPU	Under 50 concurrent workflows, homogeneous tasks
Hierarchical	(1 orchestrator + N workers) x concurrent users	2+ GPUs or large VRAM	Production APIs, specialized pipelines
Pipeline	1 active slot per workflow step	Single mid-tier GPU	Async batch pipelines, doc processing

Persistent Sessions: What Happens After 60 Minutes

Most infrastructure guides treat agents as stateless. Multi-agent systems often are not. A research agent that has processed 50 web pages may have accumulated 30,000-50,000 tokens of context. That context has to live somewhere, and on a GPU, it lives in the KV cache.

KV cache growth with conversation length

KV cache memory scales linearly with context length. The per-token KV cache size depends on model architecture. For Llama 3.1 8B with 32 layers, 8 KV heads, and 128-dimensional head embeddings in FP8 (1 byte per element):

KV bytes per token = num_layers x 2 (K and V) x num_kv_heads x head_dim x dtype_bytes
                   = 32 x 2 x 8 x 128 x 1 byte
                   = 65,536 bytes (64 KB) per token

At 8K context, each session consumes 64 KB x 8,192 = 512 MB of KV cache. At 32K context, that is 2 GB per session. With 100 concurrent sessions at 32K context, you need 200 GB of KV cache alone, before model weights. That exceeds any single GPU currently available and requires a multi-GPU deployment. For FP16 instead of FP8, double these numbers. Adding --kv-cache-dtype fp8 cuts KV cache memory by 50% on Hopper and Blackwell architectures (H100, H200, B200). It is not available on Ampere (A100, A10).

Three context management strategies

When sessions grow beyond your target context window, you have three options:

Sliding window: Discard the oldest tokens when hitting the context limit. Fast. No extra inference calls. The downside is that the agent loses its earliest context, which can matter for research agents that established important facts early in the session.

Summarization checkpoint: Compress older conversation turns into a short summary, keep recent turns verbatim. This preserves semantic continuity at the cost of one extra inference call per checkpoint. A research agent can summarize every 5K tokens of earlier context into a 500-token summary, keeping effective context concise without losing key facts.

Reduced context cap: Set --max-model-len to a practical limit (8K or 16K) for most sessions. Route the rare long-running sessions that genuinely need 50K+ tokens to a dedicated high-VRAM GPU (H200 or B200). This is the most cost-effective approach for production systems where most sessions stay short but occasional outliers run long.

vLLM configuration for long sessions

Key flags for managing persistent agent sessions:

--max-model-len: Set to your actual 95th-percentile context length, not the model's theoretical maximum. Reducing from 32K to 8K quadruples your concurrent session capacity for typical agent workloads.
--kv-cache-dtype fp8: Cuts KV cache memory by roughly 50%. H100, H200, and B200 only.
--enable-prefix-caching: Reuses KV cache for repeated system prompts across sessions. If your agents share a long system prompt (a common pattern in multi-agent orchestration where every worker gets the same base instructions), prefix caching can save 20-40% of KV cache on repeated requests.

GPU Memory Budget at Scale

The table below maps model size, context length, and concurrent session count to VRAM requirements. These figures assume --kv-cache-dtype fp8 is enabled. Without it, KV cache numbers roughly double.

Model	Context	Sessions	KV Cache	Weights	Total VRAM	GPU
8B FP8	8K	100	~50 GB	~8 GB	~60 GB	H100 PCIe 80GB
8B FP8	32K	100	~200 GB	~8 GB	~210 GB	2x H200
70B FP8	8K	50	~64 GB	~70 GB	~137 GB (+15% overhead ~158 GB)	2x H100 SXM5 (160 GB) or B200 (192 GB)
70B FP8	32K	50	~256 GB	~70 GB	~330 GB	2x B200 (384 GB)

Add 10-15% overhead for the inference runtime, CUDA context, and operating system. If your calculation comes out at 75 GB, provision an 80 GB GPU, not a 48 GB one. Running at 95% VRAM under normal load means any traffic spike causes KV cache exhaustion and request failures.

The dedicated vs shared GPU memory post explains why bare-metal deployments outperform virtualized multi-tenant setups for production agent workloads.

GPU Recommendations for Multi-Agent Serving

The table below maps GPU options to multi-agent fleet size. Prices reflect Spheron on-demand rates as of March 2026 (spot prices noted separately). Note: prices fluctuate based on GPU availability and market conditions. Check current GPU pricing before building your cost model.

GPU	VRAM	Agent fleet size	Typical config	Spheron price
A100 80GB PCIe	80 GB	40-80 sessions	Single vLLM, 8B model, 16K context	$1.04/hr
H100 PCIe 80GB	80 GB	80-150 sessions	Single vLLM + prefix caching	$2.01/hr
H100 SXM5 80GB	80 GB	150-250 sessions	vLLM with tensor parallelism (TP=2)	~$2.40/hr per GPU (~$4.80/hr for TP=2)
H200 SXM 141GB	141 GB	250-500 sessions	vLLM, long-context agents	$3.70/hr
B200 192GB	192 GB	500-1,000 sessions	vLLM FP4, highest density	$2.01/hr spot

For multi-agent hierarchical topologies, a common production configuration is one H200 for the orchestrator model (70B, long-context planning) plus a pool of H100 PCIe instances for worker agents. The orchestrator handles one request at a time per user workflow; worker agents handle many simultaneous short-context calls.

Inference-optimized variants matter for tensor parallelism

For flat multi-agent topology on a single GPU, PCIe vs SXM is irrelevant for performance. But at TP=2 or TP=4 configurations serving 1,000+ concurrent sessions, NVLink bandwidth becomes the bottleneck. The H100 SXM variant offers 900 GB/s NVLink bandwidth. The PCIe variant requires optional NVLink bridges (sold separately) to achieve 600 GB/s GPU-to-GPU bandwidth; without bridges, multi-GPU communication is limited to PCIe bandwidth. The difference shows up directly in tensor-parallel all-reduce latency: for a 70B model split across two GPUs, SXM cuts the inter-GPU communication overhead by roughly 30%.

GPU rental links for each tier:

RTX 5090 rental for development and small production
A100 rental for mid-scale flat topologies
H100 rental for production multi-agent APIs
H200 rental for long-context orchestration
B200 rental for maximum session density

For a head-to-head comparison of H100 and H200 memory specifications and KV cache capacity, see the H100 vs H200 comparison. For B200 specs and when FP4 quantization changes the economics, see the B200 complete guide.

Scaling from 1 Agent to 1,000 Concurrent Sessions

Three distinct stages, each with different configuration requirements.

Stage 1: Single GPU, 1-50 concurrent sessions

One vLLM server, one GPU (RTX 5090 or H100 PCIe). This is where you start.

bash

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-num-seqs 100 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --port 8000

Set --max-num-seqs to 1.5-2x your target peak. No load balancer needed at this stage; direct HTTP to vLLM's OpenAI-compatible endpoint. Session affinity is not a concern when there is only one instance.

Stage 2: Horizontal scaling, 50-250 sessions

Two to four vLLM instances behind litellm proxy or nginx upstream. The critical addition at this stage is session affinity.

nginx

upstream vllm_pool {
    hash $http_x_session_id consistent;
    server 10.0.0.1:8000;
    server 10.0.0.2:8000;
    server 10.0.0.3:8000;
}

Without session affinity via consistent hashing on session or user ID, a follow-up request in a multi-turn agent conversation can land on a vLLM instance that does not have the session's KV cache. That forces a full re-prefill of the entire conversation history, adding 50-200ms to TTFT for every turn after the first. For multi-turn agents, this is a correctness issue as much as a latency issue.

Add a Redis session store for LangGraph or CrewAI state persistence. The vLLM instance should be stateless from the orchestration layer's perspective; only the inference KV cache needs locality.

Cost implication: two GPUs at Stage 2 versus one at Stage 1. Worth it above roughly 60 concurrent sessions, where a single GPU starts showing P99 TTFT degradation from queue saturation.

Stage 3: Multi-model cluster, 250-1,000 sessions

Separate GPU pools for different agent roles.

Orchestrator model: deploy on its own GPU or GPU pair. Use a faster, smaller model for routing and planning decisions (7B-8B class) rather than a 70B model, keeping the orchestrator available with low latency. The orchestrator does not need to be the largest model; it needs to be the fastest model that reliably routes tasks.

Worker agents: horizontal GPU pool, scaled independently of the orchestrator. Worker pool size determines your peak concurrency ceiling.

Message queue: Redis Streams or RabbitMQ for async agent tasks that do not need real-time response. Agents that perform background research or process documents do not need to block a live HTTP connection; queue them and poll for results.

Auto-scaling trigger: queue depth or P95 TTFT exceeding threshold, not CPU or GPU utilization alone. A queue that grows by 10 requests per minute will overwhelm the system in 20 minutes even if current GPU utilization is 70%.

Rough cost math at 1,000 sessions:

1,000 concurrent sessions at 8B FP8 with 4K context needs approximately 4 H100 SXMs based on 250 sessions per GPU at 4K context. Fleet cost: 4 x $2.40/hr = $9.60/hr. The equivalent workload on a managed API at $0.0000004/token average (mixing input and output), running 1,000 sessions each generating 200 tokens/minute, produces 12,000,000 tokens per hour. At $0.40/M tokens: $4.80/hr. At 1,000 sessions with short agent loops (200 tokens/min per session), managed API wins on cost. At 5,000 tokens per session per minute (typical agent loops with tool calls and longer context), managed API reaches $120/hr and bare metal wins by 12x. Short-loop sessions are the edge case; most production agent workloads hit the higher token rate and favor bare metal. The crossover is workload-dependent. The next section covers how to model it.

For how these numbers hold up in practice, the 100 concurrent AI agent case study shows exact benchmark results from a production H100 setup.

Cost Modeling for Agent-Heavy Workloads

When does fixed GPU infrastructure beat per-token managed API pricing?

Variables:
  T      = average tokens per interaction (input + output combined)
  R      = interactions per hour
  P_api  = managed API price per million tokens
  P_gpu  = GPU cost per hour
  I_gpu  = interactions per GPU per hour at sustained load

Crossover: GPU wins when
  P_api x T x R / 1,000,000 > P_gpu x (R / I_gpu)

Simplified (cost per interaction):
  P_api x T / 1,000,000 > P_gpu / I_gpu

Worked example:

Workload: 500 interactions per hour, 950 tokens average each.

Managed API at $0.40/M tokens: 0.40 x (500 x 950) / 1,000,000 = $0.19/hr

Single H100 SXM5 at $2.40/hr handles approximately 22,000+ interactions/hr at sustained load. Cost for 500 interactions: $2.40 x (500 / 22,000) = $0.055/hr

At 500 interactions/hour, the H100 SXM5 is roughly 4x cheaper per interaction than managed API, even though it is mostly idle.

At 100 interactions/hour:

Managed API: $0.40 x (100 x 950) / 1,000,000 = $0.038/hr
H100 SXM5 prorated: $2.40 x (100 / 22,000) = $0.011/hr

H100 SXM5 still wins. But if you run the GPU 24/7 and only use it 2 hours a day:

GPU cost: $2.40 x 24 = $57.60/day for 200 interactions
Managed API: $0.038 x 2 = $0.076/day

Fixed GPU infrastructure wins when your GPU utilization exceeds roughly 30% of hours in a month. Below that threshold, managed API is cheaper because you pay only for the tokens you actually process. Above it, you are effectively subsidizing the idle time with per-interaction savings that compound at scale.

For concrete cost-per-interaction numbers from a real production benchmark, the 100 concurrent AI agent case study measured $0.11 per 1,000 interactions on Spheron H100 SXM5 on-demand versus $0.38 per 1,000 interactions on GPT-4o-mini equivalent pricing.

Where to Start

If you are at 1-50 concurrent sessions: a single H100 PCIe on Spheron with one vLLM instance handles this range comfortably for 8B models. The RTX 5090 works for development and for small production deployments under 40 sessions.

If you are at 50-250 sessions: horizontal scaling with session affinity. Two to three H100 instances behind a load balancer, Redis for LangGraph state, consistent hashing on session ID. If your orchestrator uses a 70B model, provision an H200 for that tier.

If you are planning for 1,000+ sessions: the B200 gives the highest session density per GPU for dense models. For a multi-H100 cluster at TP=2, use SXM variants for NVLink bandwidth. Check current pricing before committing to a specific configuration.

For the benchmarks backing the concurrency numbers above, the 100 concurrent AI agent case study covers exact TTFT, throughput, and cost results from a production H100 deployment.

Multi-agent systems need GPU infrastructure that scales with topology, not just with user count. Spheron gives you H100, H200, and B200 instances on demand, with no minimum commitments. Start with a single node and scale horizontally as your session count grows.
Rent H100 → | Rent H200 → | View all pricing →