A B2B SaaS company operating an AI-powered customer support platform needed to serve 100 simultaneous agent sessions during peak business hours, each agent handling a live customer conversation, querying a product knowledge base, and generating step-by-step resolution guidance in under three seconds end-to-end. The constraint was hard: customers waited for each response, making P99 latency as important as P50.
Here is exactly how the infrastructure was built, what hardware it ran on, and what the performance and cost numbers looked like. Every configuration value and benchmark figure here is reproducible on Spheron using the setup described. Understanding GPU infrastructure requirements for AI agents is the foundation this case study builds on.
The Scenario: What "100 Concurrent AI Agents" Means
Before benchmarking, the scenario was defined precisely. Vague "concurrent agent" claims collapse under scrutiny; these numbers only hold for this specific setup.
Agent type: Customer support agents for a B2B SaaS platform. Each agent handles inbound product support questions: billing queries, integration troubleshooting, and feature guidance drawn from a structured knowledge base.
Agent architecture:
- Orchestration framework: LangGraph 1.1.x
- LLM: Llama 3.1 8B Instruct (FP8 quantization)
- Tools per agent:
knowledge_base_lookup(vector search over product docs),ticket_create(structured output for CRM integration) - Average conversation length: 3 turns before resolution
- Average tokens per interaction: 600 input (system prompt + conversation history + retrieved context) + 350 output (agent response)
Concurrency model:
- 100 simultaneous agent sessions
- Each session had an active end user waiting for a response (synchronous, not background)
- Maximum acceptable TTFT (time to first token): 500ms at P95 (P99 budget: 800ms)
- Maximum acceptable end-to-end response time: 3,000ms at P50 (P95 budget: 5,000ms)
What 100 concurrent sessions maps to in practice: A customer support tool handling 100 simultaneous conversations is approximately what a mid-market SaaS company (500-2,000 customers) experiences during peak morning hours on the US East Coast. It is also the threshold at which managed LLM API costs start to become a significant budget line. At 200,000 interactions per day, per-token API pricing starts adding up in ways that fixed infrastructure costs do not.
Infrastructure Setup
Hardware:
One bare metal H100 SXM5 80GB server on Spheron. No load balancer, no multi-GPU tensor parallelism. The goal was to establish a single-GPU baseline before scaling out.
| Component | Configuration |
|---|---|
| GPU | 1x H100 SXM5 80GB HBM3 |
| RAM | 116 GB system RAM |
| vCPUs | 26 vCPUs |
| Storage | 2,400 GB NVMe |
| Network | 10 Gbps |
Software stack:
| Layer | Technology | Role |
|---|---|---|
| LLM | Llama 3.1 8B Instruct (FP8) | Response generation |
| Inference server | vLLM 0.17.1 | Continuous batching, KV cache management |
| Agent framework | LangGraph 1.1 | Multi-turn state machine, tool dispatch |
| Knowledge base | FAISS (CPU) | Product doc retrieval (pre-embedded) |
| Load generator | Locust 2.43.3 | 100-agent concurrent load simulation |
| Monitoring | Prometheus + Grafana | GPU utilization, TTFT, queue depth |
Deployment commands:
vLLM was launched directly on the bare metal instance, not inside Docker, to eliminate container overhead on GPU memory allocation:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--dtype bfloat16 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-num-seqs 200 \
--max-num-batched-tokens 16384 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--disable-log-requests \
--port 8000Flag rationale:
--max-num-seqs 200: 2x the target concurrency. vLLM queues requests above this limit; setting it to 2x target leaves headroom for traffic spikes without queue saturation.--quantization fp8: applies FP8 weight quantization to model weights, reducing the model footprint from ~16 GB (FP16) to ~8.5 GB.--kv-cache-dtype fp8: stores the KV cache in FP8 format (1 byte per element). Without this flag, vLLM defaults the KV cache dtype to the compute dtype (--dtype bfloat16, i.e. 2 bytes per element), halving the effective KV cache capacity. With this flag, each token occupies 32 × 8 × 128 × 2 × 1 byte = 64 KB, enabling the ~248 concurrent session slots calculated below.--max-model-len 4096: agents rarely exceeded 3,000 tokens in testing. Reducing from the default 8,192 doubled KV cache capacity from ~125 concurrent slots to ~250.--enable-chunked-prefill: interleaves long prefill operations with decode steps, preventing one long-context request from blocking all active decode sessions. Critical for P99.
KV cache math:
Llama 3.1 8B uses 32 layers, 8 KV heads, 128-dimensional head embeddings. In FP8, each token occupies 32 × 8 × 128 × 2 × 1 byte = 65,536 bytes (64 KB). With --gpu-memory-utilization 0.90 on an 80 GB GPU, approximately 72 GB is available. After reserving ~8.5 GB for model weights, ~63.5 GB remains for KV cache. At 4,096 tokens per slot, each slot uses 64 KB × 4,096 = 256 MB, yielding approximately 248 concurrent session slots, well above the 100-session target.
LangGraph agent configuration:
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
tool_calls: list
resolution: str | None
def create_support_agent(vllm_endpoint: str, kb_index) -> StateGraph:
graph = StateGraph(AgentState)
graph.add_node("reason", reason_node(vllm_endpoint))
graph.add_node("lookup", lookup_node(kb_index))
graph.add_node("respond", respond_node(vllm_endpoint))
graph.set_entry_point("reason")
graph.add_conditional_edges(
"reason",
route_after_reasoning,
{"lookup": "lookup", "respond": "respond", "end": END},
)
graph.add_edge("lookup", "reason")
graph.add_edge("respond", END)
return graph.compile()Each agent instance maintains its own state graph. The 100 concurrent sessions ran as 100 independent LangGraph instances sharing the same vLLM endpoint via the OpenAI-compatible API (/v1/chat/completions).
Load Test Methodology
Load generation:
Locust was configured to ramp 0 to 100 concurrent agents over 60 seconds, then sustain 100 agents for 20 minutes. Each simulated agent sent a realistic support question drawn from a dataset of 500 real-world support ticket templates, then waited for the full streamed response before sending the next turn.
# locustfile.py
from locust import HttpUser, task, between
import random
import time
SYSTEM_PROMPT = "You are a helpful customer support agent." # define your system prompt here
SUPPORT_QUESTIONS = [...] # 500-item dataset
class SupportAgentUser(HttpUser):
wait_time = between(0.1, 0.5) # simulate think time between turns
@task
def run_agent_turn(self):
payload = {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": random.choice(SUPPORT_QUESTIONS)},
],
"max_tokens": 400,
"stream": True,
}
start_time = time.perf_counter()
with self.client.post(
"/v1/chat/completions",
json=payload,
stream=True,
catch_response=True,
) as resp:
if resp.status_code >= 400:
resp.failure(f'HTTP {resp.status_code}')
else:
first_token_time = None
for chunk in resp.iter_lines():
if first_token_time is None and chunk:
first_token_time = time.perf_counter()
ttft = first_token_time - start_time
self.environment.events.request.fire(
request_type="stream_ttft",
name="/v1/chat/completions",
response_time=ttft * 1000,
response_length=0,
exception=None,
)
if first_token_time is None:
resp.failure('No tokens received')
else:
resp.success()Metrics measured:
- TTFT (time to first token): P50, P95, P99, measured client-side from request send to first streamed chunk
- End-to-end latency: P50, P95, P99, from request send to final token
- Throughput (tokens/sec): measured server-side via vLLM's
/metricsPrometheus endpoint - GPU utilization (%): via
nvidia-smi dmon - VRAM utilization (GB): via vLLM's memory stats endpoint
- Error rate: HTTP 4xx/5xx and timeout rate during sustained load
Baseline comparison:
The same Locust workload was replayed against the AWS p5.48xlarge (8x H100 SXM5 NVLink, same model, same vLLM configuration). AWS GPU pricing varies with instance type and commitment level; the on-demand p5.48xlarge is priced at $55.04/hr for the full 8-GPU node (reflecting the 44% price reduction AWS applied to P5 instances in June 2025). For a single-GPU comparison, that is $6.88/hr per H100.
Results
This section reports results from the sustained 20-minute load phase with 100 concurrent agents. The ramp-up phase is excluded from percentile calculations.
Performance results:
| Metric | Result | Target | Pass/Fail |
|---|---|---|---|
| TTFT P50 | 78 ms | < 200 ms | ✅ |
| TTFT P95 | 193 ms | < 500 ms | ✅ |
| TTFT P99 | 342 ms | < 800 ms | ✅ |
| End-to-end latency P50 | 1,840 ms | < 3,000 ms | ✅ |
| End-to-end latency P95 | 2,920 ms | < 5,000 ms | ✅ |
| Throughput (tokens/sec) | 22,400 | - | - |
| GPU utilization (avg) | 84% | - | - |
| VRAM utilization | 71 GB / 80 GB | - | - |
| Error rate | 0.0% | < 1% | ✅ |
TTFT P50 of 78ms is well under the 200ms threshold, driven by FP8 quantization reducing the effective model size and chunked prefill preventing head-of-line blocking. The gap between P50 (78ms) and P99 (342ms) reflects request scheduling variance under full load; requests that arrive when the batch is already processing a long prefill wait in the vLLM queue. With chunked prefill enabled, this variance is bounded rather than unbounded.
Cost results:
Note: GPU pricing on Spheron fluctuates over time based on availability and market conditions. The figures below reflect pricing as of March 16, 2026. Check current GPU pricing before planning your cost model.
| Metric | On-Demand | Spot |
|---|---|---|
| GPU | H100 SXM5 80GB | H100 SXM5 80GB |
| Hourly cost (Spheron) | $2.50/hr | $0.99/hr |
| Interactions served/hr | ~230,000 | ~230,000 |
| Cost per 1,000 interactions | $0.011 | $0.0043 |
| Cost per 1M tokens | $0.011 | $0.0045 |
| AWS p5.48xlarge (per GPU) | $6.88/hr | - |
| Savings vs AWS on-demand | 64% | - |
At $0.011 per 1,000 interactions on-demand, a platform handling 10 million agent interactions per day (a large enterprise deployment) requires 2 GPUs running 24/7 to maintain real-time availability (10M interactions divided by 230K/hr = ~43.5 GPU-hours/day, which exceeds a single GPU's 24-hour capacity). That works out to approximately $120/day or $3,600/month on Spheron (2 x $2.50/hr x 24 hours). The same workload on AWS runs approximately $330/day (2 x $6.88/hr x 24 hours). The cost difference justifies the infrastructure investment for any team at scale. GPU pricing fluctuates based on availability, so check current rates when building your cost model.
Spot instances are appropriate only for background agent workloads: batch analysis jobs, async enrichment pipelines, scheduled research tasks. For the customer support case described here (users waiting for a live response), dedicated on-demand instances are required. A spot interruption mid-conversation means a broken session.
What Broke and How It Was Fixed
Three problems came up during the benchmark before the final configuration was stable.
Problem 1: KV cache exhaustion at default settings
The first vLLM launch used the default --max-model-len 8192. Under 100 concurrent sessions with an average context of 600 tokens and growing KV cache from conversation history, the server hit memory pressure errors after 8 minutes. vLLM logs showed VRAM at 98.7% utilization, causing requests to queue indefinitely as the KV cache was full.
Fix: Reduced --max-model-len from 8,192 to 4,096. Support conversations rarely exceeded 3,000 tokens in testing, so this bound was never hit in practice. The reduction doubled the available concurrent session slots and eliminated KV cache exhaustion under sustained load.
Problem 2: P99 latency spikes from simultaneous long-context prefills
In early runs using the V0 engine (explicitly set via VLLM_USE_V1=0) without chunked prefill enabled, P99 TTFT reached 1,840ms, over 5x the P50. The cause was a burst of 8-12 simultaneous requests arriving with long system prompts (600 tokens each). Without chunked prefill, vLLM processes these as a single large prefill batch, blocking all active decode sessions for the entire prefill duration.
Fix: Switched to the V1 engine (the default in vLLM 0.8.0 and later) with --enable-chunked-prefill included explicitly for documentation clarity. In vLLM V1, chunked prefill is always enabled and cannot be disabled. This broke large prefill batches into 2,048-token chunks, interleaving prefill with decode at each scheduler step. P99 TTFT dropped from 1,840ms to 342ms with no throughput regression.
Problem 3: Queue saturation during traffic spikes above 100 concurrent
The load test uncovered that brief spikes to 130 concurrent requests (simulating a traffic burst from an outage alert) caused vLLM's request queue to back up significantly. With --max-num-seqs 100, the scheduler processed at most 100 requests per batch; anything above that waited in queue. A 130-request burst at the default setting caused P99 TTFT to exceed 2,000ms for 40 seconds after the spike.
Fix: Increased --max-num-seqs from 100 to 200. vLLM maintains throughput across larger batches on H100 due to sufficient compute headroom; the 200-sequence setting handled 130-request spikes without P99 degradation. See the GPU infrastructure AI agent guide for more on sizing max-num-seqs relative to your traffic profile.
Scaling Beyond 100 Concurrent Agents
The single-H100 setup described here handles 100 concurrent agents comfortably. Here is the path forward as workloads grow.
Horizontal scaling (100 -> 1,000 concurrent agents):
Add more H100 instances running identical vLLM configurations, load balanced with Nginx using least_conn (route each new request to the instance with the fewest active connections). Each H100 handles ~200 concurrent sessions; 10 instances handle ~2,000. Scaling is near-linear: 5 instances produce 5x throughput at the same TTFT percentiles as the single-instance benchmark.
upstream vllm_pool {
least_conn;
server 10.0.0.1:8000;
server 10.0.0.2:8000;
server 10.0.0.3:8000;
# add instances as needed
}Vertical scaling (move to H200 or larger model):
If your workload requires a larger model (13B-70B parameters) or longer context windows (16K-128K), the H200 (141GB HBM3e, 4.8 TB/s bandwidth) dramatically increases KV cache capacity and decode throughput. A 70B model in FP8 weighs ~70GB; on an H200 you have ~70GB remaining for KV cache. On an H100 80GB, there is essentially no room for KV cache after loading the 70B weights. The H100 vs H200 comparison covers this tradeoff in detail.
When to move to multi-GPU serving:
At 500+ concurrent agents with a 13B+ model, consider --tensor-parallel-size 2 across two GPUs on the same node. Multi-GPU tensor parallelism reduces per-GPU KV cache by splitting the model, but also reduces the latency floor (faster matmuls on larger batches). For the 8B model in this benchmark, single-GPU is strictly better; tensor parallelism adds communication overhead that increases TTFT without improving throughput at this scale.
For teams approaching 1,000+ concurrent agents, the production GPU cloud architecture guide covers multi-node deployments with autoscaling.
Key Learnings
--max-num-seqsis the primary concurrency lever in vLLM. Set it to 1.5-2x your target concurrent sessions to absorb traffic bursts without queue saturation.--max-model-lendirectly determines KV cache capacity. For agent workloads where conversations stay under 4,096 tokens, halving this value from the default doubles your concurrent session ceiling.- KV cache utilization is the binding constraint for concurrent agent serving, not compute; monitor it with vLLM's Prometheus
/metricsendpoint before tuning anything else. - P95 and P99 TTFT matter more than P50 for user-facing agents. Users notice the tail, not the median.
- Chunked prefill is non-negotiable for mixed-load agent serving. Without it, one long-context request blocks the entire batch, causing P99 spikes that are invisible in P50 metrics. In vLLM V1 (the default engine since 0.8.0, including 0.17.1), chunked prefill is always on by default. Pass
--enable-chunked-prefillexplicitly only for documentation clarity. - FP8 quantization on H100 is worth enabling for conversational agent tasks. The ~8.5 GB model footprint (vs ~16 GB in FP16) nearly doubles KV cache capacity with negligible quality impact on instruction-following tasks.
- Spot instances work for background agents. For live sessions with users waiting, use on-demand dedicated instances; a spot interruption mid-conversation breaks the session and the user experience.
- Spheron's on-demand H100 at $2.50/hr (as of March 16, 2026) versus AWS at $6.88/hr per H100 is a 64% cost reduction. GPU pricing fluctuates over time based on availability. At scale, that gap is the difference between a viable and unviable unit economics model.
For teams building agent infrastructure from first principles, the AI agent GPU infrastructure guide covers VRAM sizing, latency budget allocation, and the key differences between agent and standard LLM serving in detail.
Want to reproduce this benchmark on Spheron? The full setup is described above. Deploy an H100 or RTX 5090, configure vLLM with the flags above, and run the same Locust workload against your agent endpoint.
