What is SGLang's RadixAttention and when does it improve performance?

RadixAttention stores KV cache entries in a radix tree indexed by token sequence. When multiple requests share a common prefix (a system prompt, a RAG document, or prior conversation turns), SGLang reuses the cached KV activations instead of recomputing them. Workloads with 60%+ prefix overlap see 75-95% cache hit rates with significantly lower TTFT. Workloads with fully unique prompts see no benefit.

What GPU do I need for SGLang with a 70B model?

A single H100 SXM5 80GB handles Llama 3.3 70B at FP8 (70GB weights). For FP16, you need 2x H100 80GB with --tp 2. For 7B-13B models at FP16, an RTX 4090 (24GB) or A100 40GB works. H100 SXM5 instances on Spheron start from $2.40/hr on-demand.

How does SGLang compare to vLLM for agentic workloads?

SGLang outperforms vLLM for agentic workloads because agents reuse the same tool definitions, system prompts, and conversation history across turns. These shared prefixes are exactly where RadixAttention's cache reuse delivers meaningful TTFT reductions. vLLM's PagedAttention manages memory well but recomputes shared prefixes on every request. For unique-prompt batch workloads, the performance difference is under 5%.

Can SGLang serve structured JSON output for tool calling?

Yes. SGLang has a built-in constrained decoding engine for structured output. Pass --constrained-json-whitespace-pattern or use the /v1/chat/completions endpoint with response_format: { type: 'json_schema', json_schema: {...} } to enforce JSON schemas, regex patterns, or Pydantic models. This is useful for tool-calling agents that need guaranteed-valid function call payloads.

SGLang Production Deployment Guide: RadixAttention and Multi-Turn Inference on GPU Cloud (2026)

Q: How do I deploy SGLang on a Spheron GPU instance?

Provision an H100 instance from the Spheron GPU catalog, SSH in, verify with nvidia-smi, then run: docker run --gpus all --ipc=host -p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=your_token lmsysorg/sglang:v0.5.9-cu130-runtime python -m sglang.launch_server --model-path meta-llama/Llama-3.3-70B-Instruct --quantization fp8 --context-length 8192 --mem-fraction-static 0.92 --host 0.0.0.0 --port 8000. The OpenAI-compatible endpoint is available at /v1/chat/completions.

Most inference servers recompute KV cache on every request, even when 80% of the tokens are an identical system prompt or conversation history shared across every turn. SGLang's RadixAttention solves this by caching KV activations in a radix tree and reusing them across requests with shared prefixes, delivering significant TTFT reductions on workloads with 60%+ prefix overlap. If you want a raw engine comparison before committing to SGLang, the vLLM vs TensorRT-LLM vs SGLang benchmarks covers throughput, latency, and VRAM numbers across all three on the same H100.

Why SGLang Wins for Agentic Workloads

Every agent turn arrives carrying a long, mostly-static context: tool definitions, memory state, and prior conversation turns. A standard inference server treats each request as independent and recomputes attention from scratch, including the tokens that haven't changed since the last turn.

RadixAttention changes this. It stores cached attention activations in a radix tree keyed by token sequence. When a new request shares a prefix with an existing cached entry, SGLang starts computation from the branching point instead of the beginning. Workloads where agents share a fixed system prompt and tool definitions across all sessions see 75-95% cache hit rates on multi-turn conversations.

Engine	Cache Reuse Mechanism	Shared-Prefix Benefit
vLLM	PagedAttention + APC (opt-in)	Supported via APC; requires `--enable-prefix-caching` flag
TensorRT-LLM	Static KV cache allocation	None by default
SGLang	RadixAttention (shared radix tree)	Significant TTFT reduction at 60%+ overlap

How RadixAttention Works

The radix tree stores token sequences as paths from root to leaf. Each node represents a sequence of tokens; child nodes extend the parent's prefix. When a request arrives, SGLang walks the tree to find the longest matching prefix already in cache and begins computation at that point instead of position 0.

Two scenarios drive most of the cache hits:

Shared system prompt: every request starts with the same 512-token system prompt. SGLang computes the KV cache for that prefix once and reuses it across all requests, regardless of how different the user turns are.
Multi-turn conversation: each turn's history is the prefix for the next turn. SGLang accumulates the cached KV state for the entire conversation and only processes the new user message.

A simplified view of the tree structure:

[root]
  |
  +-- [system prompt: "You are a helpful assistant with access to tools..."]
        |
        +-- [user turn 1: "What is the weather in Paris?"]
        |     |
        |     +-- [assistant turn 1: "I'll check that for you..."]
        |           |
        |           +-- [user turn 2: "And in London?"]
        |
        +-- [user turn 1: "Summarize this document: ..."]
              |
              +-- [assistant turn 1: "The document covers..."]

All requests sharing the same system prompt reuse the cached activations from the root to the system prompt node. Each conversation branch is cached independently, so parallel agent sessions don't evict each other's prefixes under normal load.

--mem-fraction-static controls how much GPU memory is reserved for the KV cache versus model weights. Setting it too high leaves no room for activations during decode; too low limits cache depth. Start at 0.92 for single-model deployments on an 80GB H100.

GPU Requirements by Model Size

Model Size	Precision	GPU	VRAM Used	Key Flags	Spheron Price
7B	FP16	RTX 4090 (24GB)	~14 GB	`--tp 1`	check pricing
13B	FP16	A100 40GB	~26 GB	`--tp 1`	from $0.73/hr
70B	FP8	H100 SXM5 80GB	~72 GB	`--tp 1 --quantization fp8`	from $2.40/hr
70B	FP16	2x H100 SXM5	~140 GB	`--tp 2`	from $4.80/hr
405B	FP8	8x H100 NVL	~405 GB	`--tp 8 --quantization fp8`	from $16.48/hr

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

For a deeper look at GPU sizing for different model architectures, see the GPU memory requirements guide for LLMs and best GPU for AI inference 2026.

Step-by-Step: Deploy SGLang on Spheron

Step 1: Provision your Spheron GPU instance

Go to app.spheron.ai, select H100 SXM5 from the GPU catalog, choose your region, and provision. SSH in once the instance is ready. Verify you're on the right hardware:

bash

nvidia-smi

The output should show NVIDIA H100 80GB HBM3 with 81,559 MiB total memory. For setup details, see the Spheron quick-start guide.

Step 2: Verify Docker and NVIDIA Container Toolkit

Most Spheron GPU instances ship with the NVIDIA Container Toolkit pre-installed. Confirm it's working:

bash

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

If this fails, install the toolkit:

bash

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Step 3: Single-GPU SGLang deployment

Pull and run the SGLang container. RadixAttention is on by default; no extra flags needed to enable it.

bash

docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  lmsysorg/sglang:v0.5.9-cu130-runtime \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.3-70B-Instruct \
    --quantization fp8 \
    --context-length 8192 \
    --mem-fraction-static 0.92 \
    --max-running-requests 128 \
    --enable-metrics \
    --host 0.0.0.0 \
    --port 8000

Model loading takes 50-70 seconds on an H100. Once it's up:

bash

curl http://localhost:8000/health
curl http://localhost:8000/v1/models

Test with a completion:

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
  }'

Step 4: Multi-GPU with tensor parallelism

For 70B at FP16 (requiring 140GB), add --tp 2 to split attention heads and MLP layers across two GPUs:

bash

docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  lmsysorg/sglang:v0.5.9-cu130-runtime \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.3-70B-Instruct \
    --tp 2 \
    --context-length 16384 \
    --mem-fraction-static 0.90 \
    --host 0.0.0.0 \
    --port 8000

--ipc=host is not optional. NCCL uses shared memory for inter-GPU communication, and omitting this flag produces silent CUDA errors under load that are difficult to trace back to this cause. Always include it.

For data parallelism (replicate the full model across multiple GPUs to increase request throughput rather than handle larger models), add --dp 2.

Step 5: Create a systemd service for persistence

Run SGLang as a managed service so it restarts on failures and starts on boot:

ini

[Unit]
Description=SGLang Inference Server
After=docker.service
Requires=docker.service

[Service]
Restart=always
RestartSec=5
ExecStartPre=-/usr/bin/docker rm -f sglang
ExecStart=/usr/bin/docker run --name sglang --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  lmsysorg/sglang:v0.5.9-cu130-runtime \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.3-70B-Instruct \
    --quantization fp8 \
    --context-length 8192 \
    --mem-fraction-static 0.92 \
    --enable-metrics \
    --host 0.0.0.0 \
    --port 8000
ExecStop=/usr/bin/docker stop sglang

[Install]
WantedBy=multi-user.target

Save to /etc/systemd/system/sglang.service, then:

bash

sudo systemctl daemon-reload
sudo systemctl enable sglang
sudo systemctl start sglang

Step 6: Verify RadixAttention is working

Send 20 requests that share a long system prompt and measure TTFT on the first versus subsequent requests:

python

import asyncio
import time
import aiohttp

SYSTEM_PROMPT = "You are a helpful assistant. " * 100  # ~512 tokens

async def send_request(session, user_message):
    start = time.monotonic()
    first_token_time = None
    async with session.post(
        "http://localhost:8000/v1/chat/completions",
        json={
            "model": "meta-llama/Llama-3.3-70B-Instruct",
            "messages": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": user_message}
            ],
            "max_tokens": 20,
            "stream": True
        }
    ) as resp:
        async for chunk in resp.content:
            if first_token_time is None and chunk.strip():
                first_token_time = time.monotonic()
                break
    if first_token_time is None:
        raise RuntimeError(f'No token received for: {user_message}')
    return first_token_time - start

async def benchmark():
    async with aiohttp.ClientSession() as session:
        # First request: cold cache, computes the system prompt prefix
        first = await send_request(session, "What is 2+2?")
        print(f"First request TTFT: {first*1000:.0f} ms")

        # Subsequent requests: shared prefix cached
        times = []
        for i in range(19):
            t = await send_request(session, f"What is {i+3}+{i+3}?")
            times.append(t)
        avg = sum(times) / len(times) * 1000
        print(f"Subsequent requests avg TTFT: {avg:.0f} ms")

asyncio.run(benchmark())

Expected output: first request around 280-320ms TTFT, subsequent requests 80-120ms. The first request builds the KV cache for the system prompt; subsequent requests reuse it and only process the new user turn. If subsequent requests are still around 280ms, the system prompt is not byte-identical across requests, which invalidates the cache each time.

Optimizing for Agentic Workloads

Prefix Caching: What Actually Affects Hit Rate

Three things kill your cache hit rate:

System prompt consistency. The radix tree keys on exact byte sequences. A single extra space, a newline difference, or a version string in the system prompt creates a different cache key and a full cache miss. Pin your system prompt as a constant string in your application code, not a template that gets regenerated.

Conversation history structure. For multi-turn, pass the full conversation history as a single prefix before the new user turn. If you reconstruct the messages array differently each turn (e.g., trimming old messages and re-adding them), you're creating new prefixes that don't match the cached tree.

Chunked prefill. For long-context requests, --chunked-prefill-size 4096 lets decode start while the long prefix is still being processed by the prefill stage. This doesn't improve cache hit rate, but it reduces user-perceived latency by overlapping the two phases.

Monitor your actual hit rate:

bash

curl http://localhost:8000/metrics | grep sglang_cache_hit_rate

Structured Output for Tool Calling

SGLang's constrained decoding engine enforces JSON schemas at the token level. It cannot produce invalid JSON, which removes an entire class of agent failures where the model generates a malformed function call payload:

python

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Get weather for Paris"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "weather_call",
            "schema": {
                "type": "object",
                "properties": {
                    "function": {"type": "string"},
                    "location": {"type": "string"}
                },
                "required": ["function", "location"]
            }
        }
    }
)

print(response.choices[0].message.content)
# Always valid JSON matching the schema

Concurrent Agent Sessions

The throughput advantage from RadixAttention compounds with concurrency because more sessions share more prefix nodes in the radix tree:

Concurrency	Unique Prompts (tok/s)	80% Shared Prefix (tok/s)	Cache Hit Rate
1	125	125	-
10	680	890	~78%
50	1,920	2,480	~82%
100	2,460	3,100	~85%

At unique-prompt workloads, SGLang performs similarly to vLLM. The gap opens up specifically when shared prefixes are present.

SGLang vs vLLM vs TensorRT-LLM vs LMDeploy: Decision Framework

Use This	When
SGLang	Multi-turn conversations, agentic workloads, RAG with shared context documents, tool-calling agents
vLLM	Broadest model support needed, frequent model updates, team with limited DevOps capacity
TensorRT-LLM	Single fixed model, max throughput priority, comfortable with a 28-minute compile pipeline
LMDeploy	Smaller deployment footprint, TurboMind backend preference

For the full throughput and TTFT numbers across all three engines, see the inference framework benchmark. All three expose OpenAI-compatible APIs, so switching engines does not require application code changes.

For Spheron-specific configuration, see Spheron's SGLang docs and the Spheron LLM quick-guides.

Production Monitoring

Prometheus Metrics

SGLang exposes a Prometheus-compatible /metrics endpoint. To activate it, you must pass --enable-metrics in the launch command (already included in the Docker commands above).

Metric	What It Tells You	Alert Threshold
`sglang_num_queue_reqs`	Queue depth	> 20 sustained
`sglang_token_usage`	KV cache fill %	> 85%
`sglang_cache_hit_rate`	RadixAttention effectiveness	< 30% (check prefix consistency)
`sglang_time_to_first_token_seconds`	TTFT p50/p95	p95 > 2s
`sglang_num_running_reqs`	Active request count	Near `--max-running-requests`

bash

# Scrape metrics
curl http://localhost:8000/metrics | grep sglang

# Check cache hit rate
curl http://localhost:8000/metrics | grep sglang_cache_hit_rate

If sglang_cache_hit_rate drops below 30%, check whether your system prompt has drifted. Even a single character difference creates a full cache miss.

Load Balancing Across Multiple SGLang Instances

When horizontal scaling, session affinity is critical for multi-turn workloads. Without it, the same conversation lands on different instances each turn, and each instance starts with a cold cache for that conversation's history:

nginx

upstream sglang_backends {
    # Sticky sessions by IP - keeps conversation history in same instance's KV cache
    ip_hash;
    server 10.0.0.1:8000;
    server 10.0.0.2:8000;
    server 10.0.0.3:8000;
}

server {
    listen 80;
    location /v1/ {
        proxy_pass http://sglang_backends;
        proxy_read_timeout 300s;
        proxy_buffering off;
    }
}

ip_hash is a reasonable starting point. For production agents, a session cookie or conversation ID in a header is more reliable since multiple agents may share the same IP. See the MCP server GPU deployment guide for a more complete session affinity example.

GPU-Level Monitoring

bash

# Watch GPU utilization and memory every 5 seconds
nvidia-smi dmon -s pum -d 5

# For multi-GPU setups, check NVLink topology
nvidia-smi topo -m

For deeper GPU monitoring in production, see the GPU monitoring for ML guide.

Benchmark Results on Spheron H100

The full three-way benchmark covers SGLang vs vLLM vs TensorRT-LLM at unique-prompt workloads. Here are the SGLang-specific numbers across different agentic workload types, where RadixAttention's advantage shows clearly:

Workload	TTFT p50 (10 req)	Throughput (50 req)	Cache Hit Rate
Unique prompts	112 ms	1,920 tok/s	0%
RAG (1 shared 2k-token document)	68 ms	2,280 tok/s	~72%
Multi-turn chat (4-turn history)	54 ms	2,480 tok/s	~81%
Agentic (shared tool defs + memory)	41 ms	2,620 tok/s	~88%

Hardware: single H100 SXM5 80GB, Llama 3.3 70B Instruct at FP8, SGLang v0.5.9.

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

SGLang is worth choosing when your workload has shared prefixes: multi-turn conversations, agents with shared tool definitions, RAG pipelines that reuse the same context documents. For unique-prompt batch jobs, vLLM or TensorRT-LLM are equally good choices. If you want to push TTFT further, speculative decoding can add another 2-4x improvement on top of RadixAttention - see the speculative decoding production guide for configuration details. For multi-node orchestration above SGLang, the NVIDIA Dynamo disaggregated inference guide covers prefill-decode separation at scale.

SGLang's RadixAttention advantages are most pronounced on H100 and GH200 GPUs where high memory bandwidth makes KV cache reuse fast enough to matter at production scale. Spheron provides on-demand H100 SXM5 instances from $2.40/hr with no minimum commitment, matching the hardware profile where SGLang's cache hit rate gains translate to real latency improvements.
Rent H100 → | View all GPU pricing →
Get started on Spheron →

Why SGLang Wins for Agentic Workloads

How RadixAttention Works

GPU Requirements by Model Size

Step-by-Step: Deploy SGLang on Spheron

Step 1: Provision your Spheron GPU instance

Step 2: Verify Docker and NVIDIA Container Toolkit

Step 3: Single-GPU SGLang deployment

Step 4: Multi-GPU with tensor parallelism

Step 5: Create a systemd service for persistence

Step 6: Verify RadixAttention is working

Optimizing for Agentic Workloads

Prefix Caching: What Actually Affects Hit Rate

Structured Output for Tool Calling

Concurrent Agent Sessions

SGLang vs vLLM vs TensorRT-LLM vs LMDeploy: Decision Framework

Production Monitoring

Prometheus Metrics

Load Balancing Across Multiple SGLang Instances

GPU-Level Monitoring

Benchmark Results on Spheron H100

Build what's next.