Tutorial

SGLang Production Deployment Guide: RadixAttention and Multi-Turn Inference on GPU Cloud (2026)

Back to BlogWritten by Mitrasish, Co-founderMar 31, 2026
SGLangLLM InferenceGPU CloudRadixAttentionH100Multi-TurnAgentic AI
SGLang Production Deployment Guide: RadixAttention and Multi-Turn Inference on GPU Cloud (2026)

Most inference servers recompute KV cache on every request, even when 80% of the tokens are an identical system prompt or conversation history shared across every turn. SGLang's RadixAttention solves this by caching KV activations in a radix tree and reusing them across requests with shared prefixes, delivering significant TTFT reductions on workloads with 60%+ prefix overlap. If you want a raw engine comparison before committing to SGLang, the vLLM vs TensorRT-LLM vs SGLang benchmarks covers throughput, latency, and VRAM numbers across all three on the same H100.

Why SGLang Wins for Agentic Workloads

Every agent turn arrives carrying a long, mostly-static context: tool definitions, memory state, and prior conversation turns. A standard inference server treats each request as independent and recomputes attention from scratch, including the tokens that haven't changed since the last turn.

RadixAttention changes this. It stores cached attention activations in a radix tree keyed by token sequence. When a new request shares a prefix with an existing cached entry, SGLang starts computation from the branching point instead of the beginning. Workloads where agents share a fixed system prompt and tool definitions across all sessions see 75-95% cache hit rates on multi-turn conversations.

EngineCache Reuse MechanismShared-Prefix Benefit
vLLMPagedAttention + APC (opt-in)Supported via APC; requires --enable-prefix-caching flag
TensorRT-LLMStatic KV cache allocationNone by default
SGLangRadixAttention (shared radix tree)Significant TTFT reduction at 60%+ overlap

How RadixAttention Works

The radix tree stores token sequences as paths from root to leaf. Each node represents a sequence of tokens; child nodes extend the parent's prefix. When a request arrives, SGLang walks the tree to find the longest matching prefix already in cache and begins computation at that point instead of position 0.

Two scenarios drive most of the cache hits:

  1. Shared system prompt: every request starts with the same 512-token system prompt. SGLang computes the KV cache for that prefix once and reuses it across all requests, regardless of how different the user turns are.
  2. Multi-turn conversation: each turn's history is the prefix for the next turn. SGLang accumulates the cached KV state for the entire conversation and only processes the new user message.

A simplified view of the tree structure:

[root]
  |
  +-- [system prompt: "You are a helpful assistant with access to tools..."]
        |
        +-- [user turn 1: "What is the weather in Paris?"]
        |     |
        |     +-- [assistant turn 1: "I'll check that for you..."]
        |           |
        |           +-- [user turn 2: "And in London?"]
        |
        +-- [user turn 1: "Summarize this document: ..."]
              |
              +-- [assistant turn 1: "The document covers..."]

All requests sharing the same system prompt reuse the cached activations from the root to the system prompt node. Each conversation branch is cached independently, so parallel agent sessions don't evict each other's prefixes under normal load.

--mem-fraction-static controls how much GPU memory is reserved for the KV cache versus model weights. Setting it too high leaves no room for activations during decode; too low limits cache depth. Start at 0.92 for single-model deployments on an 80GB H100.

GPU Requirements by Model Size

Model SizePrecisionGPUVRAM UsedKey FlagsSpheron Price
7BFP16RTX 4090 (24GB)~14 GB--tp 1check pricing
13BFP16A100 40GB~26 GB--tp 1from $0.73/hr
70BFP8H100 SXM5 80GB~72 GB--tp 1 --quantization fp8from $2.40/hr
70BFP162x H100 SXM5~140 GB--tp 2from $4.80/hr
405BFP88x H100 NVL~405 GB--tp 8 --quantization fp8from $16.48/hr

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

For a deeper look at GPU sizing for different model architectures, see the GPU memory requirements guide for LLMs and best GPU for AI inference 2026.

Step-by-Step: Deploy SGLang on Spheron

Step 1: Provision your Spheron GPU instance

Go to app.spheron.ai, select H100 SXM5 from the GPU catalog, choose your region, and provision. SSH in once the instance is ready. Verify you're on the right hardware:

bash
nvidia-smi

The output should show NVIDIA H100 80GB HBM3 with 81,559 MiB total memory. For setup details, see the Spheron quick-start guide.

Step 2: Verify Docker and NVIDIA Container Toolkit

Most Spheron GPU instances ship with the NVIDIA Container Toolkit pre-installed. Confirm it's working:

bash
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

If this fails, install the toolkit:

bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Step 3: Single-GPU SGLang deployment

Pull and run the SGLang container. RadixAttention is on by default; no extra flags needed to enable it.

bash
docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  lmsysorg/sglang:v0.5.9-cu130-runtime \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.3-70B-Instruct \
    --quantization fp8 \
    --context-length 8192 \
    --mem-fraction-static 0.92 \
    --max-running-requests 128 \
    --enable-metrics \
    --host 0.0.0.0 \
    --port 8000

Model loading takes 50-70 seconds on an H100. Once it's up:

bash
curl http://localhost:8000/health
curl http://localhost:8000/v1/models

Test with a completion:

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
  }'

Step 4: Multi-GPU with tensor parallelism

For 70B at FP16 (requiring 140GB), add --tp 2 to split attention heads and MLP layers across two GPUs:

bash
docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  lmsysorg/sglang:v0.5.9-cu130-runtime \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.3-70B-Instruct \
    --tp 2 \
    --context-length 16384 \
    --mem-fraction-static 0.90 \
    --host 0.0.0.0 \
    --port 8000

--ipc=host is not optional. NCCL uses shared memory for inter-GPU communication, and omitting this flag produces silent CUDA errors under load that are difficult to trace back to this cause. Always include it.

For data parallelism (replicate the full model across multiple GPUs to increase request throughput rather than handle larger models), add --dp 2.

Step 5: Create a systemd service for persistence

Run SGLang as a managed service so it restarts on failures and starts on boot:

ini
[Unit]
Description=SGLang Inference Server
After=docker.service
Requires=docker.service

[Service]
Restart=always
RestartSec=5
ExecStartPre=-/usr/bin/docker rm -f sglang
ExecStart=/usr/bin/docker run --name sglang --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  lmsysorg/sglang:v0.5.9-cu130-runtime \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.3-70B-Instruct \
    --quantization fp8 \
    --context-length 8192 \
    --mem-fraction-static 0.92 \
    --enable-metrics \
    --host 0.0.0.0 \
    --port 8000
ExecStop=/usr/bin/docker stop sglang

[Install]
WantedBy=multi-user.target

Save to /etc/systemd/system/sglang.service, then:

bash
sudo systemctl daemon-reload
sudo systemctl enable sglang
sudo systemctl start sglang

Step 6: Verify RadixAttention is working

Send 20 requests that share a long system prompt and measure TTFT on the first versus subsequent requests:

python
import asyncio
import time
import aiohttp

SYSTEM_PROMPT = "You are a helpful assistant. " * 100  # ~512 tokens

async def send_request(session, user_message):
    start = time.monotonic()
    first_token_time = None
    async with session.post(
        "http://localhost:8000/v1/chat/completions",
        json={
            "model": "meta-llama/Llama-3.3-70B-Instruct",
            "messages": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": user_message}
            ],
            "max_tokens": 20,
            "stream": True
        }
    ) as resp:
        async for chunk in resp.content:
            if first_token_time is None and chunk.strip():
                first_token_time = time.monotonic()
                break
    if first_token_time is None:
        raise RuntimeError(f'No token received for: {user_message}')
    return first_token_time - start

async def benchmark():
    async with aiohttp.ClientSession() as session:
        # First request: cold cache, computes the system prompt prefix
        first = await send_request(session, "What is 2+2?")
        print(f"First request TTFT: {first*1000:.0f} ms")

        # Subsequent requests: shared prefix cached
        times = []
        for i in range(19):
            t = await send_request(session, f"What is {i+3}+{i+3}?")
            times.append(t)
        avg = sum(times) / len(times) * 1000
        print(f"Subsequent requests avg TTFT: {avg:.0f} ms")

asyncio.run(benchmark())

Expected output: first request around 280-320ms TTFT, subsequent requests 80-120ms. The first request builds the KV cache for the system prompt; subsequent requests reuse it and only process the new user turn. If subsequent requests are still around 280ms, the system prompt is not byte-identical across requests, which invalidates the cache each time.

Optimizing for Agentic Workloads

Prefix Caching: What Actually Affects Hit Rate

Three things kill your cache hit rate:

System prompt consistency. The radix tree keys on exact byte sequences. A single extra space, a newline difference, or a version string in the system prompt creates a different cache key and a full cache miss. Pin your system prompt as a constant string in your application code, not a template that gets regenerated.

Conversation history structure. For multi-turn, pass the full conversation history as a single prefix before the new user turn. If you reconstruct the messages array differently each turn (e.g., trimming old messages and re-adding them), you're creating new prefixes that don't match the cached tree.

Chunked prefill. For long-context requests, --chunked-prefill-size 4096 lets decode start while the long prefix is still being processed by the prefill stage. This doesn't improve cache hit rate, but it reduces user-perceived latency by overlapping the two phases.

Monitor your actual hit rate:

bash
curl http://localhost:8000/metrics | grep sglang_cache_hit_rate

Structured Output for Tool Calling

SGLang's constrained decoding engine enforces JSON schemas at the token level. It cannot produce invalid JSON, which removes an entire class of agent failures where the model generates a malformed function call payload:

python
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Get weather for Paris"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "weather_call",
            "schema": {
                "type": "object",
                "properties": {
                    "function": {"type": "string"},
                    "location": {"type": "string"}
                },
                "required": ["function", "location"]
            }
        }
    }
)

print(response.choices[0].message.content)
# Always valid JSON matching the schema

Concurrent Agent Sessions

The throughput advantage from RadixAttention compounds with concurrency because more sessions share more prefix nodes in the radix tree:

ConcurrencyUnique Prompts (tok/s)80% Shared Prefix (tok/s)Cache Hit Rate
1125125-
10680890~78%
501,9202,480~82%
1002,4603,100~85%

At unique-prompt workloads, SGLang performs similarly to vLLM. The gap opens up specifically when shared prefixes are present.

SGLang vs vLLM vs TensorRT-LLM vs LMDeploy: Decision Framework

Use ThisWhen
SGLangMulti-turn conversations, agentic workloads, RAG with shared context documents, tool-calling agents
vLLMBroadest model support needed, frequent model updates, team with limited DevOps capacity
TensorRT-LLMSingle fixed model, max throughput priority, comfortable with a 28-minute compile pipeline
LMDeploySmaller deployment footprint, TurboMind backend preference

For the full throughput and TTFT numbers across all three engines, see the inference framework benchmark. All three expose OpenAI-compatible APIs, so switching engines does not require application code changes.

For Spheron-specific configuration, see Spheron's SGLang docs and the Spheron LLM quick-guides.

Production Monitoring

Prometheus Metrics

SGLang exposes a Prometheus-compatible /metrics endpoint. To activate it, you must pass --enable-metrics in the launch command (already included in the Docker commands above).

MetricWhat It Tells YouAlert Threshold
sglang_num_queue_reqsQueue depth> 20 sustained
sglang_token_usageKV cache fill %> 85%
sglang_cache_hit_rateRadixAttention effectiveness< 30% (check prefix consistency)
sglang_time_to_first_token_secondsTTFT p50/p95p95 > 2s
sglang_num_running_reqsActive request countNear --max-running-requests
bash
# Scrape metrics
curl http://localhost:8000/metrics | grep sglang

# Check cache hit rate
curl http://localhost:8000/metrics | grep sglang_cache_hit_rate

If sglang_cache_hit_rate drops below 30%, check whether your system prompt has drifted. Even a single character difference creates a full cache miss.

Load Balancing Across Multiple SGLang Instances

When horizontal scaling, session affinity is critical for multi-turn workloads. Without it, the same conversation lands on different instances each turn, and each instance starts with a cold cache for that conversation's history:

nginx
upstream sglang_backends {
    # Sticky sessions by IP - keeps conversation history in same instance's KV cache
    ip_hash;
    server 10.0.0.1:8000;
    server 10.0.0.2:8000;
    server 10.0.0.3:8000;
}

server {
    listen 80;
    location /v1/ {
        proxy_pass http://sglang_backends;
        proxy_read_timeout 300s;
        proxy_buffering off;
    }
}

ip_hash is a reasonable starting point. For production agents, a session cookie or conversation ID in a header is more reliable since multiple agents may share the same IP. See the MCP server GPU deployment guide for a more complete session affinity example.

GPU-Level Monitoring

bash
# Watch GPU utilization and memory every 5 seconds
nvidia-smi dmon -s pum -d 5

# For multi-GPU setups, check NVLink topology
nvidia-smi topo -m

For deeper GPU monitoring in production, see the GPU monitoring for ML guide.

Benchmark Results on Spheron H100

The full three-way benchmark covers SGLang vs vLLM vs TensorRT-LLM at unique-prompt workloads. Here are the SGLang-specific numbers across different agentic workload types, where RadixAttention's advantage shows clearly:

WorkloadTTFT p50 (10 req)Throughput (50 req)Cache Hit Rate
Unique prompts112 ms1,920 tok/s0%
RAG (1 shared 2k-token document)68 ms2,280 tok/s~72%
Multi-turn chat (4-turn history)54 ms2,480 tok/s~81%
Agentic (shared tool defs + memory)41 ms2,620 tok/s~88%

Hardware: single H100 SXM5 80GB, Llama 3.3 70B Instruct at FP8, SGLang v0.5.9.

Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.


SGLang is worth choosing when your workload has shared prefixes: multi-turn conversations, agents with shared tool definitions, RAG pipelines that reuse the same context documents. For unique-prompt batch jobs, vLLM or TensorRT-LLM are equally good choices. If you want to push TTFT further, speculative decoding can add another 2-4x improvement on top of RadixAttention - see the speculative decoding production guide for configuration details. For multi-node orchestration above SGLang, the NVIDIA Dynamo disaggregated inference guide covers prefill-decode separation at scale.

SGLang's RadixAttention advantages are most pronounced on H100 and GH200 GPUs where high memory bandwidth makes KV cache reuse fast enough to matter at production scale. Spheron provides on-demand H100 SXM5 instances from $2.40/hr with no minimum commitment, matching the hardware profile where SGLang's cache hit rate gains translate to real latency improvements.

Rent H100 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.