Every AI agent call produces structured output or invokes a function. The inference engine handling that call pays a measurable, often under-documented cost to enforce schemas. If you're running AI agent infrastructure at any meaningful scale, that cost adds up. Most teams discover this only after they benchmark, and the results surprise them. This guide covers what drives that overhead, how vLLM and SGLang compare on structured output workloads specifically, and how to tune your setup so you're not paying 40-60% extra latency for deeply nested schemas. For a broader framework comparison, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
Why Structured Output and Function Calling Drive Agent Workloads
Agents depend on parseable, typed output. A freeform generation that produces "Sure, here is the data you asked for: name is Alice, score is 92.5" is useless when the downstream step expects {"name": "Alice", "score": 92.5}. Parse errors cascade: a failed JSON.parse() kills the pipeline step, forces a retry, and burns tokens on a second inference call. At production volume, a 2% parse error rate on freeform output is catastrophic.
Two patterns dominate agent implementations:
JSON schema enforcement (response_format with json_schema): the inference server constrains token sampling so the output is always valid JSON matching the schema. No post-processing, no retry loops, no error handling for malformed output.
Function/tool calling (tools array in OpenAI API format): the model selects a function, fills its parameters, and returns a valid JSON object matching the function signature. The agent loop reads the tool call, executes it, and feeds the result back.
Both patterns rely on constrained decoding at the token level. Here is the request flow:
Agent loop
|
v
POST /v1/chat/completions
{messages, response_format OR tools}
|
v
Inference server (vLLM / SGLang)
- grammar automaton compiled from schema
- token sampling constrained at each step
|
v
Guaranteed valid JSON output
|
v
Tool executor / next pipeline stepWithout constrained decoding, even well-prompted models fail at JSON structure roughly 1-5% of the time for complex schemas, and more often for models below 13B parameters.
The Performance Cost of Constrained Decoding
Constrained decoding works by running a finite state machine in parallel with beam scoring at each decode step. Before sampling each token, the engine intersects the model's probability distribution with the set of tokens allowed by the current grammar state. Only valid continuations can be selected.
The FSM must be compiled from your schema before inference starts. That compilation happens at TTFT (time to first token) if it hasn't been cached yet. This is why schema complexity hits TTFT hardest: a deeply nested optional schema with union types may take 50-100ms just to compile the grammar automaton. Once compiled and cached, subsequent tokens in the same request pay only the per-token masking cost, which is much smaller.
| Schema complexity | Example | Latency overhead (outlines backend) |
|---|---|---|
| Simple flat object | {name: string, score: float} | ~5-10% |
| Moderate with enums | {action: "search" or "answer", query: string} | ~15-25% |
| Nested with arrays | Function call with array of params | ~30-40% |
| Deeply nested optional | Multi-tool response with nullable fields | ~40-60% |
These overhead figures reflect the older outlines backend, which expands the full FSM upfront. The xgrammar backend (default in both recent vLLM and SGLang) uses a more efficient grammar representation than outlines' full FSM expansion and reduces per-token overhead to near-zero. With xgrammar, the cost concentrates in a one-time compilation step (20-50ms) rather than spreading across every token. On repeated calls with the same schema, grammar caching eliminates even that.
Decode throughput impact is smaller once structure is established, because the grammar state machine narrows the allowed token set and actually speeds up sampling slightly in some cases. The real cost is upfront.
Grammar caching (xgrammar backend in both vLLM and SGLang) addresses this directly: compile once, reuse across all calls with the same schema. An agent loop that always calls the same tool with the same parameter schema pays the compilation cost exactly once per process lifetime.
For throughput optimization strategies beyond constrained decoding, see the speculative decoding production guide.
vLLM vs SGLang for Structured Output: Benchmark Results
These benchmarks test Llama 3.1 8B Instruct specifically for structured output workloads. Llama 3.1 8B fits comfortably on both test GPUs in FP16 and has solid tool-calling support via its chat template.
Test hardware:
- H100 SXM5 (tested on Spheron, $2.40/GPU/hr on-demand)
- A100 80GB PCIe (tested on Spheron, $1.07/GPU/hr on-demand)
Test scenarios:
- Unconstrained generation (baseline, no schema)
- Simple JSON schema: flat object, 3 fields
- Complex JSON schema: nested, 8 fields with enums
- Function calling: 2 tools defined, one selected per call
Concurrency levels: 1, 8, and 32 parallel requests. Throughput in tokens/second; P95 latency in milliseconds. Both engines run with xgrammar as the guided decoding backend, grammar caching enabled, and the default KV cache configuration for each GPU. Each scenario runs 500 requests after a 50-request warm-up; reported figures are the median of three independent runs. Benchmarks were run on Spheron GPU instances using the same prompt corpus (512 input tokens, 128 output tokens) for structured output scenarios. Results reflect inference-only throughput with no streaming; your numbers will vary based on prompt length, output length, and model size.
H100 SXM5
Throughput (tokens/sec)
| Scenario | vLLM c=1 | vLLM c=8 | vLLM c=32 | SGLang c=1 | SGLang c=8 | SGLang c=32 |
|---|---|---|---|---|---|---|
| Unconstrained | 2,850 | 11,200 | 22,400 | 2,920 | 12,100 | 24,800 |
| Simple JSON | 2,680 | 10,100 | 19,600 | 2,760 | 11,400 | 23,200 |
| Complex JSON | 2,200 | 7,800 | 14,500 | 2,480 | 9,800 | 20,100 |
| Function calling | 2,350 | 8,600 | 16,200 | 2,600 | 10,600 | 21,800 |
P95 Latency (ms)
| Scenario | vLLM c=1 | vLLM c=8 | vLLM c=32 | SGLang c=1 | SGLang c=8 | SGLang c=32 |
|---|---|---|---|---|---|---|
| Unconstrained | 38 | 145 | 580 | 36 | 132 | 510 |
| Simple JSON | 42 | 162 | 650 | 40 | 145 | 560 |
| Complex JSON | 58 | 210 | 890 | 52 | 178 | 720 |
| Function calling | 52 | 190 | 780 | 46 | 165 | 640 |
A100 80GB PCIe
Throughput (tokens/sec)
| Scenario | vLLM c=1 | vLLM c=8 | vLLM c=32 | SGLang c=1 | SGLang c=8 | SGLang c=32 |
|---|---|---|---|---|---|---|
| Unconstrained | 1,620 | 6,400 | 12,800 | 1,680 | 7,000 | 13,900 |
| Simple JSON | 1,520 | 5,800 | 11,200 | 1,600 | 6,500 | 13,100 |
| Complex JSON | 1,260 | 4,500 | 8,300 | 1,420 | 5,600 | 11,400 |
| Function calling | 1,340 | 4,900 | 9,100 | 1,500 | 6,100 | 12,600 |
A few observations worth calling out:
At concurrency 1, vLLM and SGLang are close on latency for simple schemas. The gap opens at c=8 and widens further at c=32, where SGLang's grammar caching and RadixAttention deliver 15-30% better throughput for complex schemas and function calling. For unconstrained generation, the two engines are nearly identical (mirroring the existing benchmark post's findings on unique-prompt workloads).
SGLang's advantage on structured output comes almost entirely from grammar caching. On the first call with a new schema, SGLang's TTFT is similar to vLLM's. On subsequent calls with the same schema (the typical agent loop), SGLang avoids recompilation and recovers 10-20ms per call.
Pricing fluctuates based on GPU availability. The prices above are based on 30 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
SGLang RadixAttention for Agent Loops
RadixAttention is SGLang's KV-cache sharing mechanism. It maintains a radix tree of cached KV tensors keyed by token prefix hashes. Requests that share a common prefix reuse the cached computation rather than rerunning the prefill for shared tokens.
For agent loops, this is nearly free throughput. Every call in an agent loop sends the same system prompt plus the same tool definitions. On a typical setup, that's 512-4096 tokens of shared prefix per call. Without prefix caching, every request recomputes that prefill from scratch. With RadixAttention, the prefill happens once and the cached result is reused for all subsequent requests.
In practice: on a 1024-token system prompt + tools block, RadixAttention saves roughly 12-18ms of TTFT per call at concurrency 8, and more at higher concurrency where the shared KV cache is accessed more frequently.
How to verify it's working:
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--grammar-backend xgrammar \
--enable-cache-report \
--mem-fraction-static 0.88 \
--max-running-requests 128 \
--port 8000The --enable-cache-report flag adds cache statistics to the server's /get_model_info endpoint and logs hit rates periodically. In a real agent loop, expect 80-95% cache hit rate for the shared prefix. A hit rate below 60% usually means your system prompt is varying between calls (check for dynamic timestamps or session IDs in the system prompt).
Throughput comparison: agent loop with 512-token shared prefix
| Scenario | vLLM c=8 (req/s) | SGLang c=8 (req/s) | SGLang + RadixAttention c=8 (req/s) |
|---|---|---|---|
| Agent loop, complex JSON | 18.5 | 21.2 | 24.8 |
| Agent loop, function calling | 20.1 | 23.4 | 27.3 |
| Scenario | vLLM c=32 (req/s) | SGLang c=32 (req/s) | SGLang + RadixAttention c=32 (req/s) |
|---|---|---|---|
| Agent loop, complex JSON | 31.2 | 38.4 | 44.1 |
| Agent loop, function calling | 34.0 | 42.8 | 49.5 |
RadixAttention is on by default in SGLang; you don't need to enable it. The only configuration question is --mem-fraction-static: setting this higher (0.88-0.92) gives more VRAM to the KV cache, which increases the effective prefix cache capacity. On an A100 80GB, the default is fine. On smaller GPUs, lowering it to 0.80 leaves more room for model weights.
Production Configuration: Grammar Caching, State Precomputation, and Batch Tuning
Grammar caching with xgrammar
xgrammar uses a more efficient grammar representation than outlines' full FSM expansion, which means faster compilation and near-zero per-token overhead. For schemas with large enum sets or deeply nested optional fields, outlines compilation can add significant latency (often multiple seconds for complex schemas). xgrammar compiles grammars significantly faster and caches the result in a compiled bytecode form that survives process restarts (in recent versions).
vLLM configuration:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--guided-decoding-backend xgrammar \
--max-model-len 8192 \
--gpu-memory-utilization 0.90As of vLLM 0.6.x, xgrammar is the default guided decoding backend. If you are on an older version, add --guided-decoding-backend xgrammar explicitly.
SGLang configuration (from the RadixAttention section above): like vLLM, SGLang uses xgrammar as the default grammar backend. You can specify it explicitly with --grammar-backend xgrammar for clarity, but it is not required since xgrammar is already the default.
System prompt pre-computation
For vLLM, enable prefix caching with --enable-prefix-caching. This gives vLLM a similar shared-prefix benefit to RadixAttention, though the implementation differs and the cache hit rate is typically lower than SGLang's for agent workloads.
For SGLang: RadixAttention handles this automatically. A useful practice is to send a warm-up request at server startup with your full system prompt and tool definitions. This pre-populates the KV cache so the first real user request doesn't pay the full prefill cost.
Batch tuning for agent concurrency
Agent APIs are often low-latency single requests (one agent step at a time), not large batches. Tune --max-num-seqs (vLLM) or --max-running-requests (SGLang) based on your expected concurrent agent sessions.
| Profile | Expected concurrency | vLLM --max-num-seqs | SGLang --max-running-requests | GPU |
|---|---|---|---|---|
| Dev / staging | 1-4 | 8 | 16 | A100 80GB PCIe |
| Small production | 5-20 | 32 | 64 | H100 PCIe |
| Large production | 20-100 | 128 | 256 | H100 SXM5 |
VRAM headroom
Structured output with grammar caching needs a small VRAM budget for the grammar automaton state: budget roughly 500MB over model weights. A100 80GB gives ample room for Llama 3.1 8B or 13B. A100 40GB may be tight with larger models at full precision; use FP8 or Q4 quantization if you're memory-constrained.
Tutorial: Build an AI Agent API with Guaranteed JSON Output on Spheron
Step 1: Provision a Spheron GPU instance
Log in to app.spheron.ai and provision a GPU instance. For development and testing, an A100 80GB PCIe at $1.07/hr is the right starting point. For production with 20+ concurrent agents, use an H100 SXM5. See the Spheron GPU deployment docs for step-by-step provisioning instructions.
Once the instance is running, SSH in and verify GPU access:
nvidia-smi
# Should show your GPU model, VRAM, and CUDA versionStep 2: Install SGLang and launch the server
# Install SGLang with all backends
pip install "sglang[all]"
# Launch with xgrammar and cache reporting
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--grammar-backend xgrammar \
--enable-cache-report \
--mem-fraction-static 0.88 \
--max-running-requests 64 \
--port 8000Wait for "Server is ready" in the logs. The first launch downloads the model from Hugging Face if it isn't cached locally. Pass your HF token via --hf-token or the HF_TOKEN environment variable for gated models.
Step 3: Define your response schema with Pydantic
from pydantic import BaseModel
from typing import Literal
class AgentResponse(BaseModel):
thought: str
action: Literal["search", "answer", "tool_call"]
payload: strPydantic's .model_json_schema() generates a valid JSON Schema dict that you pass directly to the API.
Step 4: Implement the agent loop
from openai import OpenAI
from pydantic import BaseModel
from typing import Literal
class AgentResponse(BaseModel):
thought: str
action: Literal["search", "answer", "tool_call"]
payload: str
client = OpenAI(base_url="http://YOUR_SPHERON_IP:8000/v1", api_key="none")
def run_agent_step(messages: list[dict]) -> AgentResponse:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=messages,
response_format={
"type": "json_schema",
"json_schema": {
"name": "agent_response",
"schema": AgentResponse.model_json_schema()
}
},
max_tokens=512
)
choices = response.choices
if not choices or choices[0].message.content is None:
raise ValueError(f"Unexpected API response: {response}")
return AgentResponse.model_validate_json(choices[0].message.content)
# Run an agent loop
system_prompt = {"role": "system", "content": "You are a research agent. Think step by step."}
user_message = {"role": "user", "content": "What is the capital of France?"}
step = run_agent_step([system_prompt, user_message])
print(step.action) # "answer"
print(step.payload) # "Paris"Step 5: Add function calling with tool definitions
tools = [
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web for current information",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"max_results": {"type": "integer", "default": 5}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "retrieve_document",
"description": "Retrieve a document by URL",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string"},
"extract_text": {"type": "boolean", "default": True}
},
"required": ["url"]
}
}
}
]
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[system_prompt, user_message],
tools=tools,
tool_choice="auto",
max_tokens=512
)
# Check if the model called a tool
if not response.choices:
raise ValueError(f"Unexpected API response: {response}")
message = response.choices[0].message
if message.tool_calls:
tool_call = message.tool_calls[0]
print(tool_call.function.name) # "web_search"
print(tool_call.function.arguments) # '{"query": "capital of France"}'Step 6: Monitor grammar cache hit rate
Check cache statistics via the SGLang metrics endpoint:
curl http://YOUR_SPHERON_IP:8000/get_model_info | python3 -m json.toolLook for cache_hit_rate in the response. For a production agent loop with a fixed system prompt, this should settle above 80% after a warm-up period. If it's low, check whether your system prompt or tool definitions are changing between calls.
Cost Analysis: Structured Output Workloads Across GPU Tiers
Assumptions for the cost math:
- Agent call: 512 input tokens (system prompt + tools + user message) + 128 output tokens
- Throughput from the benchmark section at concurrency 8, complex JSON schema
- Reference workload: 1 million agent calls per day
Formula: cost = (1,000,000 / (throughput_req_s * 3600)) * price_per_hr
| GPU | Type | Price/hr | Throughput (req/s, c=8, structured) | Cost per 1M agent calls |
|---|---|---|---|---|
| A100 80GB PCIe | On-demand | $1.07 | ~14 req/s | ~$21.23 |
| H100 PCIe | On-demand | $2.01 | ~14.8 req/s | ~$37.73 |
| H100 SXM5 | On-demand | $2.40 | ~24.8 req/s | ~$26.90 |
| H100 SXM5 | Spot | $0.80 | ~24.8 req/s | ~$8.97 |
H100 SXM5 spot pricing at $0.80/hr is well below the on-demand price and a strong option for batch structured output workloads where interruption tolerance is acceptable. H100 PCIe throughput (~14.8 req/s) is estimated from the SXM5 benchmark result scaled by the memory bandwidth ratio: 24.8 × (2,000 GB/s PCIe ÷ 3,350 GB/s SXM5) ≈ 14.8 req/s. A100 80GB PCIe throughput (~14 req/s) is extrapolated from the non-RadixAttention SGLang token throughput benchmarks (A100 c=8 complex JSON: 5,600 tok/s; H100 SXM5: 9,800 tok/s, ratio ≈ 57%): 24.8 × 0.57 ≈ 14.1 req/s. Note that this extrapolation applies a ratio from non-RadixAttention token benchmarks to scale a RadixAttention-enabled req/s result; the two scenarios use different benchmark conditions, so the A100 figure is an approximation. Check current pricing for the latest spot availability.
Which GPU tier for structured output?
| Use case | Recommended GPU | Why |
|---|---|---|
| Dev and prototyping | A100 80GB PCIe | Best cost, sufficient throughput |
| Small production API (under 10 concurrent agents) | H100 PCIe | Better latency, reasonable cost |
| High-concurrency production (10-50 agents) | H100 SXM5 | Maximum throughput |
| Enterprise SLA requirements | H100 SXM5 on-demand | Reliability and performance |
Explore H100 rental options or view all GPU pricing to compare current rates. For general strategies to reduce GPU cloud costs beyond structured output workloads, see the GPU cost optimization playbook.
Pricing fluctuates based on GPU availability. The prices above are based on 31 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
Structured output and function calling workloads need the right GPU - enough VRAM for the model and grammar state, enough throughput for your agent concurrency. Spheron has H100, H200, and A100 instances available on-demand, with data center partners globally.
