Engineering

Context Engineering for Production AI Agents: KV Cache, Prefix Caching, and Long-Context GPU Economics (2026 Guide)

Context EngineeringContext Engineering AI AgentsKV Cache Hit RateLong Context Inference CostPrefix CachingLLM Agent InfrastructureRadixAttentionGPU CloudH200B200
Context Engineering for Production AI Agents: KV Cache, Prefix Caching, and Long-Context GPU Economics (2026 Guide)

Production agents are not compute-bound on generation. They are compute-bound on context. A ReAct agent making 10 tool calls in a session might produce 500 output tokens total and consume 800,000 input tokens, because each call carries the full system prompt, tool schemas, and conversation history. The model generation step takes milliseconds. The prefill pass over 80,000 input tokens takes seconds, and you pay for it on every call. For a broader look at how this fits the inference engineering discipline, see the inference engineering guide 2026.

What Context Engineering Is

Context engineering is the practice of deciding what goes into the context window, in what order, and how to cache and compress it to minimize prefill compute without hurting output quality.

It is distinct from prompt engineering, which focuses on what instructions and examples produce the best outputs. Prompt engineering optimizes quality. Context engineering optimizes cost and latency.

It is also distinct from RAG, which focuses on retrieval. RAG decides which external documents to fetch. Context engineering decides how to include those documents, whether to compress them, how to order them relative to cached prefixes, and whether the retrieval was even necessary given what's already in the KV cache.

Context engineering sits between the agent framework and the inference server. The framework decides what to run. The inference server handles computation. Context engineering determines how efficiently those two layers connect.

The 100:1 Input:Output Ratio

Standard chatbots have balanced input:output ratios, often 1:1 to 3:1. Agents are different. Every agent turn carries accumulated overhead:

  • System prompt: 2K-10K tokens for instructions, persona, and constraints
  • Tool schemas: 5K-50K tokens for function definitions passed with every request
  • Conversation history: grows by 100-2,000 tokens per turn
  • Retrieved documents: 5K-100K tokens per RAG call

A concrete example: a ReAct agent making 10 tool calls with a 2K system prompt, 50K tool schemas, and 30K conversation history at turn 10 sends roughly 82,000 input tokens per round and produces 500 output tokens. That's a 164:1 ratio.

WorkloadTypical input tokensTypical output tokensRatio
Chatbot (short)5003001.7:1
RAG pipeline8,00040020:1
ReAct agent (early turns)20,000200100:1
ReAct agent (late turns)80,000300267:1
Code review agent120,0001,000120:1

At 100:1 and above, input tokens dominate everything: cost, latency, and GPU utilization. The generating step is not the problem. For a deeper look at how memory layers (working memory, episodic memory, semantic memory) interact with token costs, the agent memory infrastructure guide covers those tradeoffs in detail.

KV Cache Hit Rate: The #1 Cost Lever

When the inference server processes input tokens, it computes key-value attention tensors for each token at each transformer layer. These tensors are stored in the KV cache. If a subsequent request shares the same prefix, the server can reuse those tensors instead of recomputing them. The fraction of input tokens served from cache rather than recomputed is the KV cache hit rate.

At 100:1 input:output ratios, prefill compute accounts for 85-95% of total GPU time per request. A 90% KV cache hit rate means the server skips 90% of that prefill work, reducing effective compute cost per request by 80-90%.

The math:

cost_saving = hit_rate * prefill_fraction * total_gpu_time

At 90% hit rate, 92% prefill fraction:

cost_saving = 0.90 * 0.92 = 82.8%

Cost comparison for a 70B model serving a workload with 80K input tokens per request:

KV cache hit rateEffective prefill tokensTTFTCost per 1M requests
0% (no prefix caching)80,000~8sbaseline
50%40,000~4s~50% of baseline
90%8,000~800ms~10% of baseline
95%4,000~400ms~5% of baseline

The difference between 0% and 90% hit rate is not a marginal optimization. It is the difference between a $20,000/month GPU bill and a $2,000/month bill for the same workload. For the full KV cache memory math and per-layer VRAM calculations, see the KV cache optimization guide.

Prefix Caching and RadixAttention

Two implementations dominate production in 2026: SGLang's RadixAttention and vLLM's automatic prefix caching.

SGLang RadixAttention stores KV activations in a radix tree keyed by token sequence. When a new request arrives, SGLang walks the tree to find the longest matching prefix already in cache and starts computation at that branch point. Every request that shares a system prompt benefits. Every turn in a multi-turn conversation benefits more because each turn's history is the prefix for the next.

vLLM automatic prefix caching (APC) works at the block level. vLLM divides the KV cache into fixed-size blocks (16 tokens each by default) and hashes each block by its token content. Blocks from one request are reused in later requests if the hash matches.

FeatureSGLang RadixAttentionvLLM APC
Default onYesvLLM V1 / v0.18.0+
GranularityToken-level (radix tree)Block-level (16 tokens/block)
Multi-turn handlingStrong (accumulates state per conversation branch)Good
Monitoring metricsglang_cache_hit_ratenum_prefix_cache_hit_tokens

To maximize cache hit rate from either:

  1. Fix your system prompt character-for-character across requests. A single whitespace difference invalidates the entire cached prefix.
  2. Fix tool definition order. Passing [search, write, read] in one request and [read, search, write] in another produces cache misses even though the schemas are identical.
  3. Pass conversation history as a stable prefix before the new user turn. Don't regenerate or summarize turns unless you've flushed the cached context intentionally.
  4. Keep retrieved documents in the same position relative to the system prompt. Interleaving retrieved chunks differently across turns fragments the prefix tree.

Workloads with 60%+ prefix overlap consistently hit 75-95% cache hit rates. Workloads where every request is unique (creative generation, personalized recommendations) see near-zero benefit. For the full deployment walkthrough including monitoring setup, see the SGLang production deployment guide.

Long-Context Infrastructure: VRAM From 32K to 1M Tokens

KV cache VRAM scales linearly with context length. For Llama 3.1 70B (80 layers, 8 GQA heads, 128 head_dim), the formula is:

kv_bytes = 2 * layers * kv_heads * head_dim * context_len * bytes_per_element
         = 2 * 80 * 8 * 128 * context_len * bytes_per_element

At BF16 (2 bytes per element), a single concurrent user at 128K context needs 42.9 GB for KV alone. Add 140 GB for model weights at FP16, and a single H200 (141 GB) cannot serve even one user at this context without FP8 KV quantization.

ContextBF16 KV per userFP8 KV per userRecommended GPU
32K~10.7 GB~5.4 GBH100 SXM5 80 GB
128K~42.9 GB~21.5 GBH200 SXM5 141 GB
512K~171 GB~85.5 GB2x B200 SXM6 192 GB
1M~343 GB~172 GBNVMe KV offload or sequence parallelism

These figures are for Llama 3.1 70B. Model architecture matters: different GQA configurations and head dimensions produce different KV footprints. Treat these as planning estimates, not guarantees.

FP8 KV quantization is the single most effective lever below 512K context. It halves KV memory with minimal quality impact on most models. Enable it in vLLM with --kv-cache-dtype fp8 on H100 and H200 hardware.

Related deep-dives:

Techniques for Context-Heavy Agent Workloads

Prefix and Prompt Caching

Enable prefix caching before anything else. It has the largest impact on cost and latency and requires nothing beyond a flag.

For vLLM:

bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92

For SGLang (RadixAttention is on by default, no flag needed):

bash
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --enable-metrics \
  --host 0.0.0.0 --port 8000

To monitor hit rate with SGLang, scrape sglang_cache_hit_rate from the Prometheus /metrics endpoint. With vLLM, compute it as:

hit_rate = num_prefix_cache_hit_tokens / (num_prefix_cache_hit_tokens + num_prefix_cache_miss_tokens)

A well-configured agent workload should hit 70%+ after the cache warms across the first few requests.

The most common reason hit rates stay low: prompt variation. Even one character difference in the system prompt (trailing newline, different whitespace) causes a full miss. Audit your prompt templates before concluding that prefix caching doesn't work for your workload.

KV Offload to NVMe

When GPU HBM fills, you have two options: add more GPUs or tier cold KV blocks to NVMe. LMCache adds NVMe-backed KV persistence to vLLM.

Cold KV reads from a fast NVMe drive run at roughly 7 GB/s. Recomputing a 128K prefix on H100 takes about 11 seconds. Even at NVMe read speeds, serving from disk beats recomputing from scratch for most production workloads with large shared prefixes.

Setup:

yaml
# lmcache.yaml
local_disk: true
local_disk_path: /mnt/nvme/lmcache
max_local_disk_size: 500  # GB
bash
LMCACHE_CONFIG_FILE=lmcache.yaml python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

For a full NVMe KV offload walkthrough, see the NVMe KV cache offloading guide. For multi-node setups where you want to share a single KV cache pool across all server instances via Redis, see the LMCache vLLM deployment guide.

Sleep-Time Pre-Computation

For predictable context patterns, you can pre-warm the KV cache during idle GPU cycles rather than on the first live request. If you know that every user session will start with a 50K-token codebase context, compute and store the KV cache for that context during off-peak hours so the first request in a session hits the warm cache.

This is especially valuable for multi-tenant agents where each customer has a stable, large context (a codebase, a knowledge base, a document set) that rarely changes. One pre-computation pass per customer context, shared across all their requests. For the full sleep-time compute pattern including scheduling and cache invalidation strategy, see the sleep-time compute guide.

Measuring and Eliminating Context Waste

The average production agent context carries 40-70% dead weight: tool schemas for functions never called, conversation history from turns irrelevant to the current task, retrieved chunks that score high on retrieval similarity but low on actual relevance to the query.

Waste matters because every dead token contributes to prefill cost and KV cache pressure. At 100:1 ratios, a 30% reduction in input tokens reduces cost by roughly 30% and TTFT by roughly 30% with no model changes.

Identifying waste:

  1. Token attribution analysis: run a small set of representative requests through the model and measure per-token attention weights at decode time. Tokens with near-zero attention across all decode steps are candidates for pruning.
  2. Tool schema audit: log which tools are actually called per agent type. If a research agent never calls write_file in 10,000 sessions, remove that schema from its system prompt.
  3. History compression threshold: track when older conversation turns stop influencing generation (roughly after 10-15 turns for most tasks). Summarize turns beyond that threshold with a small LLM rather than keeping raw text.

Removing waste:

  • Compress historical turns using a fast small model (8B-class) into structured memory objects. The compressed summary is 10-50x smaller than raw history. See the agent memory guide for deployment patterns.
  • Filter retrieved chunks by relevance threshold before injecting. A reranker with a score cutoff at 0.75-0.85 typically removes 30-50% of retrieved text with minimal impact on answer quality.
  • Use dynamic tool schema injection: detect which tool category the current request targets and include only the relevant subset of schemas.

For the application-level semantic caching layer that intercepts repeated identical queries before they even reach the inference server, see the semantic cache guide.

Self-Hosted vs API: The Economics of Context-Heavy Agents

The standard advice is "use managed APIs for low volume, self-host at scale." The crossover point for context-heavy agents is lower than most teams expect.

At 100K input tokens per request and GPT-4o pricing (~$2.50 per 1M input tokens):

  • 1,000 requests/month: $250
  • 5,000 requests/month: $1,250
  • 20,000 requests/month: $5,000

A dedicated H200 on Spheron at $4.84/hr on-demand runs $3,484/month. At 50% utilization with FP8 quantization and prefix caching, that instance handles roughly 2-4 million 100K-token requests per month, depending on concurrency and hit rate. Per-request cost drops below $0.001.

Monthly volumeGPT-4o cost (100K input/req)H200 on Spheron ($4.84/hr, 50% util)Cheaper option
1,000 req$250$3,484 (fixed)API
5,000 req$1,250$3,484 (fixed)API
20,000 req$5,000$3,484 (fixed)Self-hosted
100,000 req$25,000$3,484 (fixed)Self-hosted

The crossover is roughly 14,000 requests per month at 100K tokens each, assuming 50% GPU utilization. At higher utilization, the crossover drops further. At lower context lengths, it shifts higher.

For workloads requiring 512K+ context, B200 is the right tier. At $7.41/hr per GPU on Spheron, two B200s (192 GB each, 384 GB total) run $10,670/month and can serve 512K-context agents at meaningful concurrency. At that context length, managed APIs become extremely expensive: 500K tokens at $2.50 per 1M = $1.25 per request. Around 8,600 requests/month exceeds the B200 pair cost.

Pricing fluctuates based on GPU availability. The prices above are based on 17 Jun 2026 and may have changed. Check current GPU pricing for live rates.

Spheron H200 and B200 for Context-Heavy Agents

H200 SXM5 has 141 GB of HBM3e memory at 4.8 TB/s bandwidth. With FP8 quantization applied to both model weights and KV cache, a 70B model shrinks to roughly 70 GB, leaving around 71 GB for KV cache. That fits 3 concurrent users at 128K context each (about 21 GB of KV per user at FP8), or 1 user at 256K. FP8 KV cache quantization alone is not sufficient here: without also quantizing the model weights, the 70B model weights alone occupy around 140 GB at FP16, leaving no room for KV cache on a 141 GB GPU. That covers most production agentic workloads where concurrency per instance stays in single digits.

B200 SXM6 has 192 GB of HBM3e at 8 TB/s bandwidth. Two B200s give 384 GB total, which fits a 70B model plus KV cache for 7 concurrent users at 256K context each at FP8. The 8 TB/s bandwidth matters specifically for decode throughput when serving large concurrent batches where KV cache reads dominate memory access patterns.

Both GPU tiers on Spheron are bare-metal instances with NVMe local storage, which matters for LMCache KV offload. Managed GPU services that abstract away the underlying hardware often prevent direct NVMe access, limiting KV offload to CPU RAM (which maxes out around 200 GB/s versus 7 GB/s for NVMe, but also fills up faster under sustained load).

Per-minute billing means you pay for compute time, not reserved capacity. For agent workloads with uneven traffic, this matters: a burst hour costs the same as the first minute of a non-burst hour, not a full reserved hour.

Starting points for H200 and B200 on Spheron:


Context-heavy agents need high-VRAM GPUs with KV-cache-friendly serving stacks. Spheron's H200 and B200 instances come bare-metal with NVMe storage, vLLM/SGLang pre-configured, and per-minute billing so you pay for compute, not idle time.

H200 on Spheron | B200 on Spheron | View all GPU pricing

STEPS / 06

Quick Setup Guide

  1. Profile your agent's input:output token ratio

    Add token logging to your agent loop. Record input_tokens and output_tokens per LLM call. If your ratio exceeds 10:1 (input to output), context engineering will have a larger impact on cost than any model-level optimization. If the ratio exceeds 50:1, context cost is the dominant factor and prefix caching is your highest priority.

  2. Enable prefix caching in vLLM or SGLang

    For vLLM: add --enable-prefix-caching to your launch command (enabled by default in vLLM V1/v0.18.0+). For SGLang: RadixAttention is on by default. Critical: fix your system prompt byte-for-byte across calls - any whitespace change invalidates the cached prefix. Fix tool definition order in every request.

  3. Instrument KV cache hit rate

    For SGLang: enable --enable-metrics and scrape the sglang_cache_hit_rate Prometheus metric. For vLLM: query the /metrics endpoint and track num_prefix_cache_hit_tokens / (num_prefix_cache_hit_tokens + num_prefix_cache_miss_tokens). A healthy agent workload should achieve 70%+ hit rate after the first few requests warm the cache.

  4. Calculate VRAM requirements for your context window target

    Use: kv_bytes = 2 * num_layers * num_kv_heads * head_dim * context_length * bytes_per_element. For Llama 3.1 70B at FP8: 2 * 80 * 8 * 128 * context_len * 1 bytes. At 128K context: ~21.5 GB per concurrent user slot. For 10 concurrent users at 128K: 215 GB KV cache, requiring at least 3x H200 (141 GB each) for KV alone plus model weights.

  5. Configure NVMe KV offload for >128K contexts

    Install LMCache: pip install lmcache. Create lmcache.yaml with local_disk: true, local_disk_path: /mnt/nvme/lmcache, max_local_disk_size: <available NVMe in GB>. Launch vLLM with: LMCACHE_CONFIG_FILE=lmcache.yaml python -m vllm.entrypoints.openai.api_server --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'. Cold KV reads from NVMe (~7 GB/s) still beat recomputing a 128K prefix (~11s on H100).

  6. Set up LMCache for multi-node KV cache sharing

    Start Redis: docker run -d -p 6379:6379 redis:7. Add remote_url: redis://YOUR_REDIS_IP:6379 to lmcache.yaml. With a shared Redis backend, all vLLM workers across nodes share the same prefix cache pool. The first worker that computes a given prefix writes it to Redis; every subsequent worker on any node serves it from cache.

FAQ / 05

Frequently Asked Questions

Context engineering is the practice of designing and managing what goes into an AI agent's context window to minimize compute cost without degrading output quality. In 2026 agents routinely send 50,000-500,000 input tokens per request against only a few hundred output tokens. Because LLM APIs price input and output tokens differently, and because compute cost scales with context, deciding what to include, cache, compress, or prune is the biggest cost lever in agentic AI systems.

KV cache hit rate is the fraction of input tokens for which pre-computed key-value attention tensors are served from cache rather than recomputed via a full prefill pass. At 100:1 input:output token ratios, prefill compute accounts for roughly 85-95% of total GPU time per request. A 90% KV cache hit rate means the server skips 90% of that prefill work, typically reducing TTFT from several seconds to under 200ms and cutting effective compute cost per request by 80-90%.

KV cache VRAM scales linearly with context length. For a 70B-class model at BF16 with one concurrent user: 32K context = ~10.7 GB, 128K = ~42.9 GB, 512K = ~171 GB, 1M = ~343 GB. FP8 KV quantization halves these figures. In practice, an H200 SXM5 (141 GB HBM3e) can serve a 70B model with full KV cache at up to ~128K context per user slot at FP8 precision. Beyond 500K tokens, NVMe KV offload or sequence parallelism is required.

RadixAttention is SGLang's prefix caching mechanism. It stores KV activations in a radix tree indexed by token sequence. When multiple requests share a common prefix (a system prompt, tool definitions, or conversation history), SGLang reuses the cached KV tensors instead of recomputing them. Workloads with 60%+ prefix overlap achieve 75-95% cache hit rates, reducing TTFT from seconds to under 200ms on long shared prefixes. vLLM implements the same concept as automatic prefix caching at the block level.

The crossover depends on context length and monthly volume. At 100K input tokens per request and GPT-4o API pricing, a single long context request costs roughly $0.25-0.50. A dedicated H200 on Spheron at on-demand pricing amortizes to under $0.003 per 100K tokens at 50% utilization. For agents making more than roughly 14,000 long-context calls per month (at $0.25/call) or 7,000 calls per month (at $0.50/call), self-hosted GPU is typically cheaper. For sub-1,000 call volumes, managed APIs with prefix caching are usually more economical due to fixed GPU costs.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.