Most inference servers recompute KV cache on every request, even when 80% of the tokens are an identical system prompt or conversation history shared across every turn. SGLang's RadixAttention solves this by caching KV activations in a radix tree and reusing them across requests with shared prefixes, delivering significant TTFT reductions on workloads with 60%+ prefix overlap. If you want a raw engine comparison before committing to SGLang, the vLLM vs TensorRT-LLM vs SGLang benchmarks covers throughput, latency, and VRAM numbers across all three on the same H100.
Why SGLang Wins for Agentic Workloads
Every agent turn arrives carrying a long, mostly-static context: tool definitions, memory state, and prior conversation turns. A standard inference server treats each request as independent and recomputes attention from scratch, including the tokens that haven't changed since the last turn.
RadixAttention changes this. It stores cached attention activations in a radix tree keyed by token sequence. When a new request shares a prefix with an existing cached entry, SGLang starts computation from the branching point instead of the beginning. Workloads where agents share a fixed system prompt and tool definitions across all sessions see 75-95% cache hit rates on multi-turn conversations.
| Engine | Cache Reuse Mechanism | Shared-Prefix Benefit |
|---|---|---|
| vLLM | PagedAttention + APC (opt-in) | Supported via APC; requires --enable-prefix-caching flag |
| TensorRT-LLM | Static KV cache allocation | None by default |
| SGLang | RadixAttention (shared radix tree) | Significant TTFT reduction at 60%+ overlap |
How RadixAttention Works
The radix tree stores token sequences as paths from root to leaf. Each node represents a sequence of tokens; child nodes extend the parent's prefix. When a request arrives, SGLang walks the tree to find the longest matching prefix already in cache and begins computation at that point instead of position 0.
Two scenarios drive most of the cache hits:
- Shared system prompt: every request starts with the same 512-token system prompt. SGLang computes the KV cache for that prefix once and reuses it across all requests, regardless of how different the user turns are.
- Multi-turn conversation: each turn's history is the prefix for the next turn. SGLang accumulates the cached KV state for the entire conversation and only processes the new user message.
A simplified view of the tree structure:
[root]
|
+-- [system prompt: "You are a helpful assistant with access to tools..."]
|
+-- [user turn 1: "What is the weather in Paris?"]
| |
| +-- [assistant turn 1: "I'll check that for you..."]
| |
| +-- [user turn 2: "And in London?"]
|
+-- [user turn 1: "Summarize this document: ..."]
|
+-- [assistant turn 1: "The document covers..."]All requests sharing the same system prompt reuse the cached activations from the root to the system prompt node. Each conversation branch is cached independently, so parallel agent sessions don't evict each other's prefixes under normal load.
--mem-fraction-static controls how much GPU memory is reserved for the KV cache versus model weights. Setting it too high leaves no room for activations during decode; too low limits cache depth. Start at 0.92 for single-model deployments on an 80GB H100.
GPU Requirements by Model Size
| Model Size | Precision | GPU | VRAM Used | Key Flags | Spheron Price |
|---|---|---|---|---|---|
| 7B | FP16 | RTX 4090 (24GB) | ~14 GB | --tp 1 | check pricing |
| 13B | FP16 | A100 40GB | ~26 GB | --tp 1 | from $0.73/hr |
| 70B | FP8 | H100 SXM5 80GB | ~72 GB | --tp 1 --quantization fp8 | from $2.40/hr |
| 70B | FP16 | 2x H100 SXM5 | ~140 GB | --tp 2 | from $4.80/hr |
| 405B | FP8 | 8x H100 NVL | ~405 GB | --tp 8 --quantization fp8 | from $16.48/hr |
Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
For a deeper look at GPU sizing for different model architectures, see the GPU memory requirements guide for LLMs and best GPU for AI inference 2026.
Step-by-Step: Deploy SGLang on Spheron
Step 1: Provision your Spheron GPU instance
Go to app.spheron.ai, select H100 SXM5 from the GPU catalog, choose your region, and provision. SSH in once the instance is ready. Verify you're on the right hardware:
nvidia-smiThe output should show NVIDIA H100 80GB HBM3 with 81,559 MiB total memory. For setup details, see the Spheron quick-start guide.
Step 2: Verify Docker and NVIDIA Container Toolkit
Most Spheron GPU instances ship with the NVIDIA Container Toolkit pre-installed. Confirm it's working:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smiIf this fails, install the toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart dockerStep 3: Single-GPU SGLang deployment
Pull and run the SGLang container. RadixAttention is on by default; no extra flags needed to enable it.
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
lmsysorg/sglang:v0.5.9-cu130-runtime \
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.3-70B-Instruct \
--quantization fp8 \
--context-length 8192 \
--mem-fraction-static 0.92 \
--max-running-requests 128 \
--enable-metrics \
--host 0.0.0.0 \
--port 8000Model loading takes 50-70 seconds on an H100. Once it's up:
curl http://localhost:8000/health
curl http://localhost:8000/v1/modelsTest with a completion:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 50
}'Step 4: Multi-GPU with tensor parallelism
For 70B at FP16 (requiring 140GB), add --tp 2 to split attention heads and MLP layers across two GPUs:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
lmsysorg/sglang:v0.5.9-cu130-runtime \
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.3-70B-Instruct \
--tp 2 \
--context-length 16384 \
--mem-fraction-static 0.90 \
--host 0.0.0.0 \
--port 8000--ipc=host is not optional. NCCL uses shared memory for inter-GPU communication, and omitting this flag produces silent CUDA errors under load that are difficult to trace back to this cause. Always include it.
For data parallelism (replicate the full model across multiple GPUs to increase request throughput rather than handle larger models), add --dp 2.
Step 5: Create a systemd service for persistence
Run SGLang as a managed service so it restarts on failures and starts on boot:
[Unit]
Description=SGLang Inference Server
After=docker.service
Requires=docker.service
[Service]
Restart=always
RestartSec=5
ExecStartPre=-/usr/bin/docker rm -f sglang
ExecStart=/usr/bin/docker run --name sglang --gpus all --ipc=host -p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
lmsysorg/sglang:v0.5.9-cu130-runtime \
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.3-70B-Instruct \
--quantization fp8 \
--context-length 8192 \
--mem-fraction-static 0.92 \
--enable-metrics \
--host 0.0.0.0 \
--port 8000
ExecStop=/usr/bin/docker stop sglang
[Install]
WantedBy=multi-user.targetSave to /etc/systemd/system/sglang.service, then:
sudo systemctl daemon-reload
sudo systemctl enable sglang
sudo systemctl start sglangStep 6: Verify RadixAttention is working
Send 20 requests that share a long system prompt and measure TTFT on the first versus subsequent requests:
import asyncio
import time
import aiohttp
SYSTEM_PROMPT = "You are a helpful assistant. " * 100 # ~512 tokens
async def send_request(session, user_message):
start = time.monotonic()
first_token_time = None
async with session.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message}
],
"max_tokens": 20,
"stream": True
}
) as resp:
async for chunk in resp.content:
if first_token_time is None and chunk.strip():
first_token_time = time.monotonic()
break
if first_token_time is None:
raise RuntimeError(f'No token received for: {user_message}')
return first_token_time - start
async def benchmark():
async with aiohttp.ClientSession() as session:
# First request: cold cache, computes the system prompt prefix
first = await send_request(session, "What is 2+2?")
print(f"First request TTFT: {first*1000:.0f} ms")
# Subsequent requests: shared prefix cached
times = []
for i in range(19):
t = await send_request(session, f"What is {i+3}+{i+3}?")
times.append(t)
avg = sum(times) / len(times) * 1000
print(f"Subsequent requests avg TTFT: {avg:.0f} ms")
asyncio.run(benchmark())Expected output: first request around 280-320ms TTFT, subsequent requests 80-120ms. The first request builds the KV cache for the system prompt; subsequent requests reuse it and only process the new user turn. If subsequent requests are still around 280ms, the system prompt is not byte-identical across requests, which invalidates the cache each time.
Optimizing for Agentic Workloads
Prefix Caching: What Actually Affects Hit Rate
Three things kill your cache hit rate:
System prompt consistency. The radix tree keys on exact byte sequences. A single extra space, a newline difference, or a version string in the system prompt creates a different cache key and a full cache miss. Pin your system prompt as a constant string in your application code, not a template that gets regenerated.
Conversation history structure. For multi-turn, pass the full conversation history as a single prefix before the new user turn. If you reconstruct the messages array differently each turn (e.g., trimming old messages and re-adding them), you're creating new prefixes that don't match the cached tree.
Chunked prefill. For long-context requests, --chunked-prefill-size 4096 lets decode start while the long prefix is still being processed by the prefill stage. This doesn't improve cache hit rate, but it reduces user-perceived latency by overlapping the two phases.
Monitor your actual hit rate:
curl http://localhost:8000/metrics | grep sglang_cache_hit_rateStructured Output for Tool Calling
SGLang's constrained decoding engine enforces JSON schemas at the token level. It cannot produce invalid JSON, which removes an entire class of agent failures where the model generates a malformed function call payload:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": "Get weather for Paris"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "weather_call",
"schema": {
"type": "object",
"properties": {
"function": {"type": "string"},
"location": {"type": "string"}
},
"required": ["function", "location"]
}
}
}
)
print(response.choices[0].message.content)
# Always valid JSON matching the schemaConcurrent Agent Sessions
The throughput advantage from RadixAttention compounds with concurrency because more sessions share more prefix nodes in the radix tree:
| Concurrency | Unique Prompts (tok/s) | 80% Shared Prefix (tok/s) | Cache Hit Rate |
|---|---|---|---|
| 1 | 125 | 125 | - |
| 10 | 680 | 890 | ~78% |
| 50 | 1,920 | 2,480 | ~82% |
| 100 | 2,460 | 3,100 | ~85% |
At unique-prompt workloads, SGLang performs similarly to vLLM. The gap opens up specifically when shared prefixes are present.
SGLang vs vLLM vs TensorRT-LLM vs LMDeploy: Decision Framework
| Use This | When |
|---|---|
| SGLang | Multi-turn conversations, agentic workloads, RAG with shared context documents, tool-calling agents |
| vLLM | Broadest model support needed, frequent model updates, team with limited DevOps capacity |
| TensorRT-LLM | Single fixed model, max throughput priority, comfortable with a 28-minute compile pipeline |
| LMDeploy | Smaller deployment footprint, TurboMind backend preference |
For the full throughput and TTFT numbers across all three engines, see the inference framework benchmark. All three expose OpenAI-compatible APIs, so switching engines does not require application code changes.
For Spheron-specific configuration, see Spheron's SGLang docs and the Spheron LLM quick-guides.
Production Monitoring
Prometheus Metrics
SGLang exposes a Prometheus-compatible /metrics endpoint. To activate it, you must pass --enable-metrics in the launch command (already included in the Docker commands above).
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
sglang_num_queue_reqs | Queue depth | > 20 sustained |
sglang_token_usage | KV cache fill % | > 85% |
sglang_cache_hit_rate | RadixAttention effectiveness | < 30% (check prefix consistency) |
sglang_time_to_first_token_seconds | TTFT p50/p95 | p95 > 2s |
sglang_num_running_reqs | Active request count | Near --max-running-requests |
# Scrape metrics
curl http://localhost:8000/metrics | grep sglang
# Check cache hit rate
curl http://localhost:8000/metrics | grep sglang_cache_hit_rateIf sglang_cache_hit_rate drops below 30%, check whether your system prompt has drifted. Even a single character difference creates a full cache miss.
Load Balancing Across Multiple SGLang Instances
When horizontal scaling, session affinity is critical for multi-turn workloads. Without it, the same conversation lands on different instances each turn, and each instance starts with a cold cache for that conversation's history:
upstream sglang_backends {
# Sticky sessions by IP - keeps conversation history in same instance's KV cache
ip_hash;
server 10.0.0.1:8000;
server 10.0.0.2:8000;
server 10.0.0.3:8000;
}
server {
listen 80;
location /v1/ {
proxy_pass http://sglang_backends;
proxy_read_timeout 300s;
proxy_buffering off;
}
}ip_hash is a reasonable starting point. For production agents, a session cookie or conversation ID in a header is more reliable since multiple agents may share the same IP. See the MCP server GPU deployment guide for a more complete session affinity example.
GPU-Level Monitoring
# Watch GPU utilization and memory every 5 seconds
nvidia-smi dmon -s pum -d 5
# For multi-GPU setups, check NVLink topology
nvidia-smi topo -mFor deeper GPU monitoring in production, see the GPU monitoring for ML guide.
Benchmark Results on Spheron H100
The full three-way benchmark covers SGLang vs vLLM vs TensorRT-LLM at unique-prompt workloads. Here are the SGLang-specific numbers across different agentic workload types, where RadixAttention's advantage shows clearly:
| Workload | TTFT p50 (10 req) | Throughput (50 req) | Cache Hit Rate |
|---|---|---|---|
| Unique prompts | 112 ms | 1,920 tok/s | 0% |
| RAG (1 shared 2k-token document) | 68 ms | 2,280 tok/s | ~72% |
| Multi-turn chat (4-turn history) | 54 ms | 2,480 tok/s | ~81% |
| Agentic (shared tool defs + memory) | 41 ms | 2,620 tok/s | ~88% |
Hardware: single H100 SXM5 80GB, Llama 3.3 70B Instruct at FP8, SGLang v0.5.9.
Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
SGLang is worth choosing when your workload has shared prefixes: multi-turn conversations, agents with shared tool definitions, RAG pipelines that reuse the same context documents. For unique-prompt batch jobs, vLLM or TensorRT-LLM are equally good choices. If you want to push TTFT further, speculative decoding can add another 2-4x improvement on top of RadixAttention - see the speculative decoding production guide for configuration details. For multi-node orchestration above SGLang, the NVIDIA Dynamo disaggregated inference guide covers prefill-decode separation at scale.
SGLang's RadixAttention advantages are most pronounced on H100 and GH200 GPUs where high memory bandwidth makes KV cache reuse fast enough to matter at production scale. Spheron provides on-demand H100 SXM5 instances from $2.40/hr with no minimum commitment, matching the hardware profile where SGLang's cache hit rate gains translate to real latency improvements.
