GPU inference is expensive. At $2.54/hr per H100 SXM5 GPU, every token that doesn't need to be generated is money back in your budget. The problem is that most production LLM workloads repeat themselves constantly: agent frameworks send the same tool descriptions on every call, FAQ bots field near-identical questions from hundreds of users, RAG pipelines embed queries that differ by a word or two. Running a 70B model to answer "What are your business hours?" for the ten-thousandth time is pure waste.
Semantic caching intercepts those repeated queries before they reach the model. When a new request is semantically similar to a past one (above a configurable cosine similarity threshold), the cache returns the stored response in 3-8ms instead of 500-2000ms. Production hit rates on agent workflows and FAQ traffic typically land between 30-70%. This post covers the three caching layers, how to pick the right tool, and how to deploy a co-located cache plus inference stack on a single GPU node.
Three Caching Layers: Semantic, KV, and Prompt Cache
These are complementary layers, not alternatives. Most production stacks benefit from running all three.
| Layer | Where it operates | What it stores | Who manages it |
|---|---|---|---|
| Semantic cache | Application layer | Full LLM responses, indexed by query embedding | Application code / GPTCache |
| KV cache | GPU memory (inside model) | Attention key-value tensors for processed tokens | Inference framework (vLLM, SGLang) |
| Prompt cache / prefix cache | Inference framework | Computed prefill tensors for shared prefixes | SGLang RadixAttention / vLLM prefix cache (block-level hashing) |
Semantic cache sits furthest upstream. A query hits the cache proxy, gets embedded into a vector, and a nearest-neighbor search checks if any past response is similar enough to reuse. If yes, the LLM is never touched. If no, the request goes through to the model and the response gets stored for future hits.
KV cache operates entirely inside the GPU. The inference framework stores the key-value tensors computed during the attention pass for each token in the context. This avoids recomputing attention for tokens already processed, which matters most for long shared prefixes (system prompts, few-shot examples) and multi-turn conversations where the context grows with each turn.
Prompt cache / prefix cache (RadixAttention in SGLang, automatic block-level prefix caching in vLLM) is a smarter version of KV caching that identifies identical prompt prefixes across requests and reuses their computed tensors. Teams using a consistent system prompt see 20-40% compute reduction from this alone. For a full walkthrough of SGLang's RadixAttention behavior and tuning options, see the SGLang production deployment guide.
See KV Cache Optimization Guide for the full breakdown of attention memory management.
When Semantic Caching Works (and When It Doesn't)
High-hit workloads
These workloads see 40-70% cache hit rates in practice:
- Agent frameworks: LangChain, LlamaIndex, and custom agents call the same tool descriptions on every invocation. The system prompt and tool schemas are nearly identical across thousands of requests. Hit rates here often exceed 60%. For GPU infrastructure planning behind agentic workloads, see Agentic RAG on GPU Cloud.
- FAQ and support bots: Users ask similar questions in different words. "How do I reset my password?" and "I forgot my password, what do I do?" should map to the same cache entry with a threshold around 0.90.
- RAG over static documents: If the document corpus doesn't change frequently, similar queries hit overlapping chunks and generate similar answers. Caching makes sense when the underlying data is stable.
- Code generation with templated prompts: If your prompt structure is mostly fixed and only a small variable changes, the embedding similarity stays high.
Low-hit workloads
Don't bother with semantic caching for these:
- Creative generation (temperature > 0.5, unique prompts): Every request is intentionally different. The cache miss rate will be 95%+, and you're paying embedding overhead on every request for no benefit.
- Stateful multi-turn conversations: Each turn depends on the full conversation history, making queries unique by design. The context window changes with every message.
- Personalized recommendations: If the prompt encodes user-specific data, queries are structurally similar but semantically distinct. The cache might return the wrong user's recommendations.
Quick smell test before adding semantic caching:
- Do your users tend to ask the same things in different ways? If yes, proceed.
- Is your system prompt or tool schema repeated across most requests? Definitely add caching.
- Does your workload require fresh generation every time (creative, personalized)? Skip it.
Expected hit rates by workload:
| Workload type | Expected hit rate | Notes |
|---|---|---|
| FAQ bot (support, docs) | 50-70% | High repetition, well-defined query space |
| Agent tool calls (fixed schemas) | 40-65% | Tool descriptions dominate token count |
| RAG over static corpus | 30-50% | Depends on query diversity |
| Multi-turn chat | 5-15% | Context grows with each turn |
| Creative generation | 0-5% | Not a good fit |
GPTCache vs Redis Vector Cache vs LangChain Cache: Feature Comparison
| Tool | Backend support | Similarity algorithm | Production-ready | Latency overhead | License |
|---|---|---|---|---|---|
| GPTCache | FAISS, Qdrant, Redis, Milvus, Weaviate | Cosine similarity (configurable) | Yes (0.1.44+) | 3-10ms | MIT |
| Redis Vector Cache (RediSearch) | Redis Stack | Cosine, IP, L2 | Yes | 2-5ms | RSAL (server-side) |
| LangChain InMemoryCache | In-process dict | Exact match only | Dev/test only | <1ms | MIT |
| LangChain RedisCache (langchain-community) | Redis | Exact match only | Yes | 2-5ms | MIT |
| Qdrant-backed custom cache | Qdrant | Cosine (HNSW index) | Yes | 3-8ms | Apache 2.0 |
When to use GPTCache: Python-native stacks where you want minimal setup. GPTCache wraps your OpenAI client with two lines of code and handles embedding, vector search, and response storage transparently. It supports the most backends and has the richest configuration API for threshold tuning and eviction policies.
When to use Redis Vector Cache: Multi-language or distributed deployments where cache state needs to be shared across multiple inference pods. Redis is the standard choice for teams already running Redis in their infrastructure. It supports filtering (metadata-based cache scoping) and has strong operational tooling.
When to use LangChain caches: Only for development or exact-match use cases. LangChain's InMemoryCache doesn't persist across restarts and only matches identical queries. RedisCache adds persistence but still does exact matching. Neither is suited for semantic similarity.
When to build a custom Qdrant-backed cache: If you need fine-grained control over the HNSW index parameters, payload filtering (e.g., cache lookups scoped to a user tier or tenant), or you want Qdrant's native REST API for cache management. The operational cost is higher than GPTCache but the flexibility is greater.
Embedding Model Selection for Cache Key Generation
The embedding model determines cache quality more than any other tuning choice. A poor embedding model maps semantically distinct queries to nearby vectors, causing false cache hits and hallucinated responses. A model that's too slow adds latency to every request, including cache hits.
| Model | Dims | Latency (p50, CPU) | Latency (p50, GPU) | Recall@10 (MTEB) | Use case |
|---|---|---|---|---|---|
| BGE-M3 (512-dim, MRL-truncated) | 512 | ~12ms | ~2ms | 0.78 | Low-latency cache lookups |
| BGE-M3 (1024-dim, native) | 1024 | ~18ms | ~3ms | 0.82 | Balanced |
| Qwen3-Embedding | 2048 | ~25ms | ~4ms | 0.87 | High-recall RAG caching |
| text-embedding-3-small | 1536 | API round-trip | N/A | 0.86 | Managed API (not self-hosted) |
BGE-M3's native output dimension is 1024. The 512-dim variant is produced via Matryoshka Representation Learning (MRL) truncation: the model is trained to produce useful embeddings at multiple dimension levels, so you can truncate to 512 without retraining and lose only a few recall points.
BGE-M3 at 512 dimensions is the right starting point for most caching use cases. It runs in 2ms on GPU, which keeps the cache lookup latency well below the variance of a real LLM call. The recall is adequate for most FAQ and agent workloads.
Qwen3-Embedding is worth the extra 2ms if your query space is complex (long, technical queries with nuanced differences) and you need tighter recall to avoid false hits on RAG pipelines. Don't use it on the critical path for latency-sensitive applications unless your p99 budget allows 25-30ms for the embedding step.
Avoid large 1536-dim models on GPU for cache lookups. The vector search cost scales with dimension count, and the recall improvement over 512-dim models is marginal for short-to-medium prompts. Dimension reduction (via PCA or the model's built-in Matryoshka support) is worth exploring if you're storing millions of embeddings.
For TEI deployment details, see Self-Host Embeddings and Rerankers on GPU Cloud.
Memory sizing for the vector store: A 512-dim float32 embedding takes 2 KB per entry. One million cache entries in FAISS or Qdrant occupies ~2 GB RAM. For a 10M-entry cache, budget 20 GB RAM for the vector index alone. Use float16 or product quantization to halve this when scaling.
Deploying the Full Stack on a Single Spheron GPU Node
For stacks serving models up to 13B parameters, a single H100 SXM5 (80 GB VRAM) can host all four services: the embedding model, vector store, vLLM serving endpoint, and cache proxy. Co-location eliminates network round-trips between services, which keeps the cache overhead under 5ms. The cache proxy in this guide exposes an OpenAI-compatible endpoint. See how to self-host an OpenAI-compatible API with vLLM for the underlying serving setup.
Client Request
|
v
Cache Proxy (FastAPI + GPTCache)
|
+-- Cache HIT (cosine similarity >= threshold)
| |
| v
| Return cached response immediately (~3-8ms)
|
+-- Cache MISS
|
v
[Embedding Model (TEI, BGE-M3)] <-- same node, <2ms
|
v
[Vector Store (Qdrant)] <-- same node, <1ms
|
No match found
|
v
[vLLM Endpoint] <-- same node, 200-2000ms
|
v
Store response in vector store
|
v
Return response to clientDocker Compose setup
version: "3.9"
services:
embedding:
image: ghcr.io/huggingface/text-embeddings-inference:latest
command: --model-id BAAI/bge-m3 --port 8080 --max-batch-tokens 65536
ports:
- "8080:8080"
volumes:
- embedding_cache:/data
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
networks:
- inference_net
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
volumes:
- qdrant_storage:/qdrant/storage
networks:
- inference_net
vllm:
image: vllm/vllm-openai:latest
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--port 8000
--dtype bfloat16
--max-model-len 8192
--tensor-parallel-size 8
--enable-prefix-caching
ports:
- "8000:8000"
volumes:
- model_cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0", "1", "2", "3", "4", "5", "6", "7"]
capabilities: [gpu]
networks:
- inference_net
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
cache_proxy:
image: python:3.11-slim
command: sh -c "pip install fastapi uvicorn httpx qdrant-client numpy && uvicorn cache_proxy:app --host 0.0.0.0 --port 8888"
ports:
- "8888:8888"
volumes:
- ./cache_proxy.py:/app/cache_proxy.py
working_dir: /app
depends_on:
- embedding
- qdrant
- vllm
networks:
- inference_net
environment:
- EMBEDDING_URL=http://embedding:8080
- QDRANT_URL=http://qdrant:6333
- VLLM_URL=http://vllm:8000
volumes:
embedding_cache:
qdrant_storage:
model_cache:
networks:
inference_net:Note: GPU 0 is shared between the embedding model and vLLM. The embedding model uses only ~1.5 GB VRAM on GPU 0, leaving the remaining ~78.5 GB free for vLLM's tensor parallel shard on that GPU. vLLM uses all 8 GPUs with tp=8, which is a valid configuration for Llama 3.1 8B (8 KV heads divides evenly across 8 GPUs).
Cache proxy implementation
# cache_proxy.py
import hashlib
import json
import os
import time
import uuid
from typing import Optional
import httpx
import numpy as np
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from qdrant_client import AsyncQdrantClient
from qdrant_client.models import Distance, FieldCondition, Filter, MatchValue, PointStruct, VectorParams
from contextlib import asynccontextmanager
EMBEDDING_URL = os.environ.get("EMBEDDING_URL", "http://localhost:8080")
QDRANT_URL = os.environ.get("QDRANT_URL", "http://localhost:6333")
VLLM_URL = os.environ.get("VLLM_URL", "http://localhost:8000")
COLLECTION_NAME = "llm_cache"
SIMILARITY_THRESHOLD = 0.92
TTL_SECONDS = 72 * 3600 # 72-hour default TTL
qdrant = AsyncQdrantClient(url=QDRANT_URL)
@asynccontextmanager
async def lifespan(app: FastAPI):
# Create collection on startup if it doesn't exist
try:
await qdrant.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
except Exception:
pass # collection already exists
yield
await qdrant.close()
app = FastAPI(lifespan=lifespan)
class ChatRequest(BaseModel):
model: str
messages: list[dict]
temperature: Optional[float] = 0.0
max_tokens: Optional[int] = 512
async def embed_query(text: str) -> list[float]:
async with httpx.AsyncClient() as client:
response = await client.post(
f"{EMBEDDING_URL}/embed",
json={"inputs": text},
timeout=10.0,
)
response.raise_for_status()
return response.json()[0]
async def cache_lookup(vector: list[float], context_hash: str) -> Optional[str]:
results = await qdrant.search(
collection_name=COLLECTION_NAME,
query_vector=vector,
query_filter=Filter(
must=[FieldCondition(key="context_hash", match=MatchValue(value=context_hash))]
),
limit=1,
score_threshold=SIMILARITY_THRESHOLD,
)
if results and results[0].score >= SIMILARITY_THRESHOLD:
payload = results[0].payload
# Check TTL
if time.time() - payload.get("created_at", 0) < TTL_SECONDS:
return payload.get("response")
return None
async def cache_store(query_text: str, vector: list[float], response: str, context_hash: str) -> None:
point_id = str(uuid.UUID(bytes=hashlib.sha256((context_hash + query_text).encode()).digest()[:16]))
await qdrant.upsert(
collection_name=COLLECTION_NAME,
points=[
PointStruct(
id=point_id,
vector=vector,
payload={
"query": query_text,
"response": response,
"context_hash": context_hash,
"created_at": time.time(),
},
)
],
)
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
# Only cache zero-temperature requests (deterministic outputs)
if request.temperature is None or request.temperature > 0.1:
async with httpx.AsyncClient() as client:
resp = await client.post(
f"{VLLM_URL}/v1/chat/completions",
json=request.model_dump(),
timeout=120.0,
)
resp.raise_for_status()
return resp.json()
# Extract the last user message as the semantic query text
user_messages = [m for m in request.messages if m.get("role") == "user"]
if not user_messages:
raise HTTPException(status_code=400, detail="No user message found")
query_text = user_messages[-1]["content"]
# Hash the system prompt and prior conversation context to namespace the
# cache. Without this, two requests with the same final user message but
# different system prompts (different personas, RAG contexts, or tool
# schemas) would share a cache entry and receive the wrong response.
prior_context_str = json.dumps(request.messages[:-1], sort_keys=True)
context_hash = hashlib.sha256(prior_context_str.encode()).hexdigest()[:16]
vector = None
try:
vector = await embed_query(query_text)
cached_response = await cache_lookup(vector, context_hash)
except Exception:
cached_response = None # cache unavailable; fall through to vLLM
if cached_response:
# Return cached response in OpenAI format
return {
"id": f"cache-{uuid.uuid4().hex[:8]}",
"object": "chat.completion",
"model": request.model,
"choices": [
{
"index": 0,
"message": {"role": "assistant", "content": cached_response},
"finish_reason": "stop",
}
],
"usage": {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0},
}
# Cache miss: call vLLM
async with httpx.AsyncClient() as client:
resp = await client.post(
f"{VLLM_URL}/v1/chat/completions",
json=request.model_dump(),
timeout=120.0,
)
resp.raise_for_status()
result = resp.json()
# Cache write is best-effort; failures must not block the caller
try:
response_text = result["choices"][0]["message"]["content"]
if vector is not None:
await cache_store(query_text, vector, response_text, context_hash)
except Exception:
pass # log error; cache write failure is non-fatal
return resultSpheron provisioning
To deploy this stack on Spheron, provision an H100 SXM5 on-demand instance and clone your repository to the node. The 8-GPU H100 SXM5 configuration provides enough VRAM (640 GB total) to run the embedding model, Qdrant, and a 70B model with FP8 quantization simultaneously. For smaller models (7B-13B), an L40S PCIe node at $0.72/hr keeps costs low while still benefiting from co-location.
Tuning Similarity Thresholds and Eviction Policies
The similarity threshold is the most important knob in your cache configuration. Set it too low and you return wrong answers (hallucinated cache hits). Set it too high and your hit rate collapses to near zero.
Understanding the score distribution: When you first deploy, instrument your cache to log the cosine similarity score of every lookup (hits and near-misses). Plot a histogram. You'll typically see a bimodal distribution: a cluster of scores above 0.95 (genuine near-duplicates) and a cluster below 0.85 (unrelated queries). The dangerous zone is 0.88-0.94, where questions are topically related but not identical enough to share an answer.
Threshold tuning by workload:
| Workload | Recommended threshold | Risk of too-low | Risk of too-high |
|---|---|---|---|
| FAQ bot | 0.90-0.93 | Returning wrong answers to edge-case questions | Too many cache misses on paraphrase variants |
| RAG pipeline | 0.92-0.95 | Hallucinated responses from wrong cached context | Defeats the purpose of caching similar queries |
| Agent tool calls | 0.88-0.92 | Minor answer drift in tool output | Acceptable miss rate, mostly hits on identical calls |
| Code generation | 0.94-0.97 | Code for wrong task returned | Near-identical prompts only |
Start at 0.92 for any factual workload. Monitor false-positive rate (defined as: a user follow-up indicates the cached answer was wrong) for 48 hours. Adjust in 0.01 increments.
Eviction policies:
- TTL eviction is the right default. Set TTL based on how often your underlying facts change:
- News/current events: 24 hours
- Stable factual content (documentation, product specs): 72 hours
- Static FAQ responses: 7 days
- Never: only for truly immutable content (math definitions, historical facts)
- LRU eviction makes sense when your cache is memory-constrained and you want to keep the most recently accessed entries. LRU alone without TTL risks serving stale answers indefinitely.
- Capacity-based eviction: Set a maximum entry count (e.g., 500K entries) and evict least-recently-used entries when the limit is reached. Combine with TTL for the best results.
Cache poisoning mitigation: Validate and sanitize all inputs before they hit the embedding model. An adversarial user who crafts a query specifically to collide with another query's embedding can cause the wrong response to be served. Namespace the cache by user tier, model version, or system prompt hash to limit the blast radius.
Cost Math: GPU Hours Saved per 1M Requests
Live pricing as of 23 Apr 2026:
- H100 SXM5: $2.54/hr per GPU on-demand (8-GPU node: $20.32/hr)
- L40S PCIe: $0.72/hr per GPU on-demand
For a production FAQ bot serving 1M requests per day on a Llama 3.1 8B model:
Assumptions:
- Average prompt + response: 800 tokens
- vLLM throughput on H100: ~12,000 tokens/sec (batched)
- Time to process 1M requests without cache: 1M × 800 tokens / 12,000 tokens/sec = ~66,667 seconds = 18.5 GPU-hours
Cost comparison:
| Scenario | GPU-hours/day | Cost/day (H100, on-demand) | Cost/month |
|---|---|---|---|
| No caching (1M req/day) | 18.5 | $46.99 | ~$1,410 |
| 40% hit rate | 11.1 | $28.19 | ~$846 |
| 60% hit rate | 7.4 | $18.80 | ~$564 |
| 70% hit rate | 5.6 | $14.22 | ~$427 |
A 60% hit rate saves roughly $846/month on a single-GPU stack. At the L40S tier ($0.72/hr for smaller models), the absolute savings are lower but the relative percentage is identical.
Savings formula:
gpu_hours_saved = total_requests * hit_rate * avg_tokens_per_request / tokens_per_second / 3600
monthly_savings = gpu_hours_saved * 30 * gpu_hourly_rateThe cache itself adds minimal cost: Qdrant and the embedding model together use less than 5 GB VRAM and under $0.10/hr of the node's compute budget at low query volumes.
Spot vs on-demand: If your inference workload can tolerate occasional interruptions (batch jobs, async pipelines), spot instances on Spheron cut GPU costs significantly. Check current GPU pricing for spot vs on-demand rates, which fluctuate with supply.
For a full framework covering all inference cost levers beyond caching, see the GPU Cost Optimization Playbook.
Pricing fluctuates based on GPU availability. The prices above are based on 23 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Observability: Cache Hit Dashboards, Staleness, and Drift
Running a semantic cache without observability is the fastest way to start serving wrong answers at scale without noticing.
Three metrics to track:
- Hit rate (primary health signal):
cache_hits / (cache_hits + cache_misses). Track this per model, per endpoint, and per user cohort. A sudden drop in hit rate means your query distribution has shifted or TTL is evicting too aggressively.
- Similarity score distribution (quality signal): Log the cosine similarity score of every lookup. If the mean similarity score of cache hits starts declining over weeks (e.g., from 0.96 to 0.91), your embedding model's representation is diverging from your evolving query distribution. This is the "semantic drift" signal: time to re-evaluate the embedding model or rebuild the cache index.
- TTL expiry rate (freshness signal): How many entries expire before being accessed again? A high TTL expiry rate means your cache entries aren't hot enough to justify the TTL, and you're wasting storage on entries that will never hit.
Prometheus instrumentation (custom, GPTCache doesn't expose Prometheus natively):
from prometheus_client import Counter, Histogram, start_http_server
cache_hits = Counter("cache_hits_total", "Total semantic cache hits")
cache_misses = Counter("cache_misses_total", "Total semantic cache misses")
similarity_scores = Histogram(
"cache_similarity_score",
"Cosine similarity scores for cache lookups",
buckets=[0.80, 0.85, 0.88, 0.90, 0.92, 0.94, 0.96, 0.98, 1.0],
)
cache_latency = Histogram(
"cache_lookup_latency_seconds",
"End-to-end cache lookup latency",
buckets=[0.001, 0.003, 0.005, 0.010, 0.020, 0.050],
)
start_http_server(9090)Add these counters to the cache proxy's cache_lookup function: increment cache_hits on a hit, cache_misses on a miss, and observe the similarity score and latency on every call.
Drift detection: Set up a weekly job that computes the mean similarity score of the last 10,000 cache hits. If the mean drops more than 0.03 below your initial baseline, flag it for review. It means your user queries have evolved enough that the stored embeddings no longer represent the query space well. In most cases, rebuilding the cache (clearing entries and letting it warm up with new traffic) fixes the issue without retraining the embedding model.
Conclusion
Semantic caching is one of the few inference optimizations that scales with usage: the more requests you get, the more your cache warms up and the higher your hit rate climbs. A 60% hit rate on a 1M-requests/day workload translates to roughly $846/month saved at H100 on-demand pricing, without changing the model or degrading response quality.
The key to getting there is the right stack: embedding model fast enough to stay off the critical path, similarity threshold tuned to your workload, and TTL policies matched to how often your underlying facts change. Co-locating the embedding model, vector store, and LLM on the same GPU node is the simplest architecture that satisfies all three requirements.
Semantic caching is cheapest when the embedding model, vector store, and LLM share the same GPU node. Spheron on-demand H100 instances let you co-locate all three in a single region with no egress fees between services.
Rent H100 → | View spot pricing → | Get started on Spheron →
