Semantic Caching for LLM Inference: GPTCache, Redis Vector Cache, and Prompt Cache Setup (2026)

GPU inference is expensive. At $2.54/hr per H100 SXM5 GPU, every token that doesn't need to be generated is money back in your budget. The problem is that most production LLM workloads repeat themselves constantly: agent frameworks send the same tool descriptions on every call, FAQ bots field near-identical questions from hundreds of users, RAG pipelines embed queries that differ by a word or two. Running a 70B model to answer "What are your business hours?" for the ten-thousandth time is pure waste.

Semantic caching intercepts those repeated queries before they reach the model. When a new request is semantically similar to a past one (above a configurable cosine similarity threshold), the cache returns the stored response in 3-8ms instead of 500-2000ms. Production hit rates on agent workflows and FAQ traffic typically land between 30-70%. This post covers the three caching layers, how to pick the right tool, and how to deploy a co-located cache plus inference stack on a single GPU node.

Three Caching Layers: Semantic, KV, and Prompt Cache

These are complementary layers, not alternatives. Most production stacks benefit from running all three.

Layer	Where it operates	What it stores	Who manages it
Semantic cache	Application layer	Full LLM responses, indexed by query embedding	Application code / GPTCache
KV cache	GPU memory (inside model)	Attention key-value tensors for processed tokens	Inference framework (vLLM, SGLang)
Prompt cache / prefix cache	Inference framework	Computed prefill tensors for shared prefixes	SGLang RadixAttention / vLLM prefix cache (block-level hashing)

Semantic cache sits furthest upstream. A query hits the cache proxy, gets embedded into a vector, and a nearest-neighbor search checks if any past response is similar enough to reuse. If yes, the LLM is never touched. If no, the request goes through to the model and the response gets stored for future hits.

KV cache operates entirely inside the GPU. The inference framework stores the key-value tensors computed during the attention pass for each token in the context. This avoids recomputing attention for tokens already processed, which matters most for long shared prefixes (system prompts, few-shot examples) and multi-turn conversations where the context grows with each turn.

Prompt cache / prefix cache (RadixAttention in SGLang, automatic block-level prefix caching in vLLM) is a smarter version of KV caching that identifies identical prompt prefixes across requests and reuses their computed tensors. Teams using a consistent system prompt see 20-40% compute reduction from this alone. For a full walkthrough of SGLang's RadixAttention behavior and tuning options, see the SGLang production deployment guide.

See KV Cache Optimization Guide for the full breakdown of attention memory management. For a systems-level view of how these caching layers combine into a context engineering strategy for production agent workloads, see the context engineering guide for AI agents.

When Semantic Caching Works (and When It Doesn't)

High-hit workloads

These workloads see 40-70% cache hit rates in practice:

Agent frameworks: LangChain, LlamaIndex, and custom agents call the same tool descriptions on every invocation. The system prompt and tool schemas are nearly identical across thousands of requests. Hit rates here often exceed 60%. For GPU infrastructure planning behind agentic workloads, see Agentic RAG on GPU Cloud.
FAQ and support bots: Users ask similar questions in different words. "How do I reset my password?" and "I forgot my password, what do I do?" should map to the same cache entry with a threshold around 0.90.
RAG over static documents: If the document corpus doesn't change frequently, similar queries hit overlapping chunks and generate similar answers. Caching makes sense when the underlying data is stable.
Code generation with templated prompts: If your prompt structure is mostly fixed and only a small variable changes, the embedding similarity stays high.
MoA pipeline proposers: Proposer models in a MoA (Mixture of Agents) pipeline are ideal cache targets: the same user query maps to deterministic proposer outputs, and caching at the proposer layer cuts GPU-hours by 60-80% on repeated queries.

Low-hit workloads

Don't bother with semantic caching for these:

Creative generation (temperature > 0.5, unique prompts): Every request is intentionally different. The cache miss rate will be 95%+, and you're paying embedding overhead on every request for no benefit.
Stateful multi-turn conversations: Each turn depends on the full conversation history, making queries unique by design. The context window changes with every message.
Personalized recommendations: If the prompt encodes user-specific data, queries are structurally similar but semantically distinct. The cache might return the wrong user's recommendations.

Quick smell test before adding semantic caching:

Do your users tend to ask the same things in different ways? If yes, proceed.
Is your system prompt or tool schema repeated across most requests? Definitely add caching.
Does your workload require fresh generation every time (creative, personalized)? Skip it.

Expected hit rates by workload:

Workload type	Expected hit rate	Notes
FAQ bot (support, docs)	50-70%	High repetition, well-defined query space
Agent tool calls (fixed schemas)	40-65%	Tool descriptions dominate token count
RAG over static corpus	30-50%	Depends on query diversity
Multi-turn chat	5-15%	Context grows with each turn
Creative generation	0-5%	Not a good fit

GPTCache vs Redis Vector Cache vs LangChain Cache: Feature Comparison

Tool	Backend support	Similarity algorithm	Production-ready	Latency overhead	License
GPTCache	FAISS, Qdrant, Redis, Milvus, Weaviate	Cosine similarity (configurable)	Yes (0.1.44+)	3-10ms	MIT
Redis Vector Cache (RediSearch)	Redis Stack	Cosine, IP, L2	Yes	2-5ms	RSAL (server-side)
LangChain InMemoryCache	In-process dict	Exact match only	Dev/test only	<1ms	MIT
LangChain RedisCache (langchain-community)	Redis	Exact match only	Yes	2-5ms	MIT
Qdrant-backed custom cache	Qdrant	Cosine (HNSW index)	Yes	3-8ms	Apache 2.0

When to use GPTCache: Python-native stacks where you want minimal setup. GPTCache wraps your OpenAI client with two lines of code and handles embedding, vector search, and response storage transparently. It supports the most backends and has the richest configuration API for threshold tuning and eviction policies.

When to use Redis Vector Cache: Multi-language or distributed deployments where cache state needs to be shared across multiple inference pods. Redis is the standard choice for teams already running Redis in their infrastructure. It supports filtering (metadata-based cache scoping) and has strong operational tooling.

When to use LangChain caches: Only for development or exact-match use cases. LangChain's InMemoryCache doesn't persist across restarts and only matches identical queries. RedisCache adds persistence but still does exact matching. Neither is suited for semantic similarity.

When to build a custom Qdrant-backed cache: If you need fine-grained control over the HNSW index parameters, payload filtering (e.g., cache lookups scoped to a user tier or tenant), or you want Qdrant's native REST API for cache management. The operational cost is higher than GPTCache but the flexibility is greater.

Embedding Model Selection for Cache Key Generation

The embedding model determines cache quality more than any other tuning choice. A poor embedding model maps semantically distinct queries to nearby vectors, causing false cache hits and hallucinated responses. A model that's too slow adds latency to every request, including cache hits.

Model	Dims	Latency (p50, CPU)	Latency (p50, GPU)	Recall@10 (MTEB)	Use case
BGE-M3 (512-dim, MRL-truncated)	512	~12ms	~2ms	0.78	Low-latency cache lookups
BGE-M3 (1024-dim, native)	1024	~18ms	~3ms	0.82	Balanced
Qwen3-Embedding	2048	~25ms	~4ms	0.87	High-recall RAG caching
text-embedding-3-small	1536	API round-trip	N/A	0.86	Managed API (not self-hosted)

BGE-M3's native output dimension is 1024. The 512-dim variant is produced via Matryoshka Representation Learning (MRL) truncation: the model is trained to produce useful embeddings at multiple dimension levels, so you can truncate to 512 without retraining and lose only a few recall points.

BGE-M3 at 512 dimensions is the right starting point for most caching use cases. It runs in 2ms on GPU, which keeps the cache lookup latency well below the variance of a real LLM call. The recall is adequate for most FAQ and agent workloads.

Qwen3-Embedding is worth the extra 2ms if your query space is complex (long, technical queries with nuanced differences) and you need tighter recall to avoid false hits on RAG pipelines. Don't use it on the critical path for latency-sensitive applications unless your p99 budget allows 25-30ms for the embedding step.

Avoid large 1536-dim models on GPU for cache lookups. The vector search cost scales with dimension count, and the recall improvement over 512-dim models is marginal for short-to-medium prompts. Dimension reduction (via PCA or the model's built-in Matryoshka support) is worth exploring if you're storing millions of embeddings.

For TEI deployment details, see Self-Host Embeddings and Rerankers on GPU Cloud.

Memory sizing for the vector store: A 512-dim float32 embedding takes 2 KB per entry. One million cache entries in FAISS or Qdrant occupies ~2 GB RAM. For a 10M-entry cache, budget 20 GB RAM for the vector index alone. Use float16 or product quantization to halve this when scaling.

Deploying the Full Stack on a Single Spheron GPU Node

For stacks serving models up to 13B parameters, a single H100 SXM5 (80 GB VRAM) can host all four services: the embedding model, vector store, vLLM serving endpoint, and cache proxy. Co-location eliminates network round-trips between services, which keeps the cache overhead under 5ms. The cache proxy in this guide exposes an OpenAI-compatible endpoint. See how to self-host an OpenAI-compatible API with vLLM for the underlying serving setup.

Client Request
     |
     v
Cache Proxy (FastAPI + GPTCache)
     |
     +-- Cache HIT (cosine similarity >= threshold)
     |        |
     |        v
     |   Return cached response immediately (~3-8ms)
     |
     +-- Cache MISS
              |
              v
     [Embedding Model (TEI, BGE-M3)] <-- same node, <2ms
              |
              v
     [Vector Store (Qdrant)] <-- same node, <1ms
              |
     No match found
              |
              v
     [vLLM Endpoint] <-- same node, 200-2000ms
              |
              v
     Store response in vector store
              |
              v
     Return response to client

Docker Compose setup

yaml

version: "3.9"

services:
  embedding:
    image: ghcr.io/huggingface/text-embeddings-inference:latest
    command: --model-id BAAI/bge-m3 --port 8080 --max-batch-tokens 65536
    ports:
      - "8080:8080"
    volumes:
      - embedding_cache:/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    networks:
      - inference_net

  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - qdrant_storage:/qdrant/storage
    networks:
      - inference_net

  vllm:
    image: vllm/vllm-openai:latest
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --port 8000
      --dtype bfloat16
      --max-model-len 8192
      --tensor-parallel-size 8
      --enable-prefix-caching
    ports:
      - "8000:8000"
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1", "2", "3", "4", "5", "6", "7"]
              capabilities: [gpu]
    networks:
      - inference_net
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}

  cache_proxy:
    image: python:3.11-slim
    command: sh -c "pip install fastapi uvicorn httpx qdrant-client numpy && uvicorn cache_proxy:app --host 0.0.0.0 --port 8888"
    ports:
      - "8888:8888"
    volumes:
      - ./cache_proxy.py:/app/cache_proxy.py
    working_dir: /app
    depends_on:
      - embedding
      - qdrant
      - vllm
    networks:
      - inference_net
    environment:
      - EMBEDDING_URL=http://embedding:8080
      - QDRANT_URL=http://qdrant:6333
      - VLLM_URL=http://vllm:8000

volumes:
  embedding_cache:
  qdrant_storage:
  model_cache:

networks:
  inference_net:

Note: GPU 0 is shared between the embedding model and vLLM. The embedding model uses only ~1.5 GB VRAM on GPU 0, leaving the remaining ~78.5 GB free for vLLM's tensor parallel shard on that GPU. vLLM uses all 8 GPUs with tp=8, which is a valid configuration for Llama 3.1 8B (8 KV heads divides evenly across 8 GPUs).

Cache proxy implementation

python

# cache_proxy.py
import hashlib
import json
import os
import time
import uuid
from typing import Optional

import httpx
import numpy as np
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from qdrant_client import AsyncQdrantClient
from qdrant_client.models import Distance, FieldCondition, Filter, MatchValue, PointStruct, VectorParams

from contextlib import asynccontextmanager

EMBEDDING_URL = os.environ.get("EMBEDDING_URL", "http://localhost:8080")
QDRANT_URL = os.environ.get("QDRANT_URL", "http://localhost:6333")
VLLM_URL = os.environ.get("VLLM_URL", "http://localhost:8000")
COLLECTION_NAME = "llm_cache"
SIMILARITY_THRESHOLD = 0.92
TTL_SECONDS = 72 * 3600  # 72-hour default TTL

qdrant = AsyncQdrantClient(url=QDRANT_URL)


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Create collection on startup if it doesn't exist
    try:
        await qdrant.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
        )
    except Exception:
        pass  # collection already exists
    yield
    await qdrant.close()


app = FastAPI(lifespan=lifespan)


class ChatRequest(BaseModel):
    model: str
    messages: list[dict]
    temperature: Optional[float] = 0.0
    max_tokens: Optional[int] = 512


async def embed_query(text: str) -> list[float]:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{EMBEDDING_URL}/embed",
            json={"inputs": text},
            timeout=10.0,
        )
    response.raise_for_status()
    return response.json()[0]


async def cache_lookup(vector: list[float], context_hash: str) -> Optional[str]:
    results = await qdrant.search(
        collection_name=COLLECTION_NAME,
        query_vector=vector,
        query_filter=Filter(
            must=[FieldCondition(key="context_hash", match=MatchValue(value=context_hash))]
        ),
        limit=1,
        score_threshold=SIMILARITY_THRESHOLD,
    )
    if results and results[0].score >= SIMILARITY_THRESHOLD:
        payload = results[0].payload
        # Check TTL
        if time.time() - payload.get("created_at", 0) < TTL_SECONDS:
            return payload.get("response")
    return None


async def cache_store(query_text: str, vector: list[float], response: str, context_hash: str) -> None:
    point_id = str(uuid.UUID(bytes=hashlib.sha256((context_hash + query_text).encode()).digest()[:16]))
    await qdrant.upsert(
        collection_name=COLLECTION_NAME,
        points=[
            PointStruct(
                id=point_id,
                vector=vector,
                payload={
                    "query": query_text,
                    "response": response,
                    "context_hash": context_hash,
                    "created_at": time.time(),
                },
            )
        ],
    )


@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
    # Only cache zero-temperature requests (deterministic outputs)
    if request.temperature is None or request.temperature > 0.1:
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                f"{VLLM_URL}/v1/chat/completions",
                json=request.model_dump(),
                timeout=120.0,
            )
            resp.raise_for_status()
            return resp.json()

    # Extract the last user message as the semantic query text
    user_messages = [m for m in request.messages if m.get("role") == "user"]
    if not user_messages:
        raise HTTPException(status_code=400, detail="No user message found")

    query_text = user_messages[-1]["content"]
    # Hash the system prompt and prior conversation context to namespace the
    # cache. Without this, two requests with the same final user message but
    # different system prompts (different personas, RAG contexts, or tool
    # schemas) would share a cache entry and receive the wrong response.
    prior_context_str = json.dumps(request.messages[:-1], sort_keys=True)
    context_hash = hashlib.sha256(prior_context_str.encode()).hexdigest()[:16]
    vector = None
    try:
        vector = await embed_query(query_text)
        cached_response = await cache_lookup(vector, context_hash)
    except Exception:
        cached_response = None  # cache unavailable; fall through to vLLM

    if cached_response:
        # Return cached response in OpenAI format
        return {
            "id": f"cache-{uuid.uuid4().hex[:8]}",
            "object": "chat.completion",
            "model": request.model,
            "choices": [
                {
                    "index": 0,
                    "message": {"role": "assistant", "content": cached_response},
                    "finish_reason": "stop",
                }
            ],
            "usage": {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0},
        }

    # Cache miss: call vLLM
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            f"{VLLM_URL}/v1/chat/completions",
            json=request.model_dump(),
            timeout=120.0,
        )
        resp.raise_for_status()
        result = resp.json()

    # Cache write is best-effort; failures must not block the caller
    try:
        response_text = result["choices"][0]["message"]["content"]
        if vector is not None:
            await cache_store(query_text, vector, response_text, context_hash)
    except Exception:
        pass  # log error; cache write failure is non-fatal

    return result

Spheron provisioning

To deploy this stack on Spheron, provision an H100 SXM5 on-demand instance and clone your repository to the node. The 8-GPU H100 SXM5 configuration provides enough VRAM (640 GB total) to run the embedding model, Qdrant, and a 70B model with FP8 quantization simultaneously. For smaller models (7B-13B), an L40S PCIe node at $0.72/hr keeps costs low while still benefiting from co-location.

Tuning Similarity Thresholds and Eviction Policies

The similarity threshold is the most important knob in your cache configuration. Set it too low and you return wrong answers (hallucinated cache hits). Set it too high and your hit rate collapses to near zero.

Understanding the score distribution: When you first deploy, instrument your cache to log the cosine similarity score of every lookup (hits and near-misses). Plot a histogram. You'll typically see a bimodal distribution: a cluster of scores above 0.95 (genuine near-duplicates) and a cluster below 0.85 (unrelated queries). The dangerous zone is 0.88-0.94, where questions are topically related but not identical enough to share an answer.

Threshold tuning by workload:

Workload	Recommended threshold	Risk of too-low	Risk of too-high
FAQ bot	0.90-0.93	Returning wrong answers to edge-case questions	Too many cache misses on paraphrase variants
RAG pipeline	0.92-0.95	Hallucinated responses from wrong cached context	Defeats the purpose of caching similar queries
Agent tool calls	0.88-0.92	Minor answer drift in tool output	Acceptable miss rate, mostly hits on identical calls
Code generation	0.94-0.97	Code for wrong task returned	Near-identical prompts only

Start at 0.92 for any factual workload. Monitor false-positive rate (defined as: a user follow-up indicates the cached answer was wrong) for 48 hours. Adjust in 0.01 increments.

Eviction policies:

TTL eviction is the right default. Set TTL based on how often your underlying facts change:
News/current events: 24 hours
Stable factual content (documentation, product specs): 72 hours
Static FAQ responses: 7 days
Never: only for truly immutable content (math definitions, historical facts)

LRU eviction makes sense when your cache is memory-constrained and you want to keep the most recently accessed entries. LRU alone without TTL risks serving stale answers indefinitely.

Capacity-based eviction: Set a maximum entry count (e.g., 500K entries) and evict least-recently-used entries when the limit is reached. Combine with TTL for the best results.

Cache poisoning mitigation: Validate and sanitize all inputs before they hit the embedding model. An adversarial user who crafts a query specifically to collide with another query's embedding can cause the wrong response to be served. Namespace the cache by user tier, model version, or system prompt hash to limit the blast radius.

Cost Math: GPU Hours Saved per 1M Requests

Live pricing as of 23 Apr 2026:

H100 SXM5: $2.54/hr per GPU on-demand (8-GPU node: $20.32/hr)
L40S PCIe: $0.72/hr per GPU on-demand

For a production FAQ bot serving 1M requests per day on a Llama 3.1 8B model:

Assumptions:

Average prompt + response: 800 tokens
vLLM throughput on H100: ~12,000 tokens/sec (batched)
Time to process 1M requests without cache: 1M × 800 tokens / 12,000 tokens/sec = ~66,667 seconds = 18.5 GPU-hours

Cost comparison:

Scenario	GPU-hours/day	Cost/day (H100, on-demand)	Cost/month
No caching (1M req/day)	18.5	$46.99	~$1,410
40% hit rate	11.1	$28.19	~$846
60% hit rate	7.4	$18.80	~$564
70% hit rate	5.6	$14.22	~$427

A 60% hit rate saves roughly $846/month on a single-GPU stack. At the L40S tier ($0.72/hr for smaller models), the absolute savings are lower but the relative percentage is identical.

Savings formula:

gpu_hours_saved = total_requests * hit_rate * avg_tokens_per_request / tokens_per_second / 3600
monthly_savings = gpu_hours_saved * 30 * gpu_hourly_rate

The cache itself adds minimal cost: Qdrant and the embedding model together use less than 5 GB VRAM and under $0.10/hr of the node's compute budget at low query volumes.

Spot vs on-demand: If your inference workload can tolerate occasional interruptions (batch jobs, async pipelines), spot instances on Spheron cut GPU costs significantly. Check current GPU pricing for spot vs on-demand rates, which fluctuate with supply.

For a full framework covering all inference cost levers beyond caching, see the GPU Cost Optimization Playbook.

Pricing fluctuates based on GPU availability. The prices above are based on 23 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Observability: Cache Hit Dashboards, Staleness, and Drift

Running a semantic cache without observability is the fastest way to start serving wrong answers at scale without noticing.

Three metrics to track:

Hit rate (primary health signal): cache_hits / (cache_hits + cache_misses). Track this per model, per endpoint, and per user cohort. A sudden drop in hit rate means your query distribution has shifted or TTL is evicting too aggressively.

Similarity score distribution (quality signal): Log the cosine similarity score of every lookup. If the mean similarity score of cache hits starts declining over weeks (e.g., from 0.96 to 0.91), your embedding model's representation is diverging from your evolving query distribution. This is the "semantic drift" signal: time to re-evaluate the embedding model or rebuild the cache index.

TTL expiry rate (freshness signal): How many entries expire before being accessed again? A high TTL expiry rate means your cache entries aren't hot enough to justify the TTL, and you're wasting storage on entries that will never hit.

Prometheus instrumentation (custom, GPTCache doesn't expose Prometheus natively):

python

from prometheus_client import Counter, Histogram, start_http_server

cache_hits = Counter("cache_hits_total", "Total semantic cache hits")
cache_misses = Counter("cache_misses_total", "Total semantic cache misses")
similarity_scores = Histogram(
    "cache_similarity_score",
    "Cosine similarity scores for cache lookups",
    buckets=[0.80, 0.85, 0.88, 0.90, 0.92, 0.94, 0.96, 0.98, 1.0],
)
cache_latency = Histogram(
    "cache_lookup_latency_seconds",
    "End-to-end cache lookup latency",
    buckets=[0.001, 0.003, 0.005, 0.010, 0.020, 0.050],
)

start_http_server(9090)

Add these counters to the cache proxy's cache_lookup function: increment cache_hits on a hit, cache_misses on a miss, and observe the similarity score and latency on every call.

Drift detection: Set up a weekly job that computes the mean similarity score of the last 10,000 cache hits. If the mean drops more than 0.03 below your initial baseline, flag it for review. It means your user queries have evolved enough that the stored embeddings no longer represent the query space well. In most cases, rebuilding the cache (clearing entries and letting it warm up with new traffic) fixes the issue without retraining the embedding model.

Conclusion

Semantic caching is one of the few inference optimizations that scales with usage: the more requests you get, the more your cache warms up and the higher your hit rate climbs. A 60% hit rate on a 1M-requests/day workload translates to roughly $846/month saved at H100 on-demand pricing, without changing the model or degrading response quality.

The key to getting there is the right stack: embedding model fast enough to stay off the critical path, similarity threshold tuned to your workload, and TTL policies matched to how often your underlying facts change. Co-locating the embedding model, vector store, and LLM on the same GPU node is the simplest architecture that satisfies all three requirements.

Semantic caching is cheapest when the embedding model, vector store, and LLM share the same GPU node. Spheron on-demand H100 instances let you co-locate all three in a single region with no egress fees between services.
Check H100 availability → | View spot pricing → | Get started on Spheron →

STEPS / 07

Quick Setup Guide

Choose the right caching layer for your workload
Identify whether your bottleneck is repeated identical prompts (exact cache), semantically similar queries (semantic cache), or long shared prefixes (prefix/prompt cache). Most production agent workflows benefit from semantic caching; factual QA bots benefit from all three layers.
Select and deploy an embedding model for cache key generation
Deploy BGE-M3 or Qwen3-Embedding via Hugging Face TEI on your Spheron GPU node. Run: docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:latest --model-id BAAI/bge-m3
Set up the vector store backend
For single-node deployments, use FAISS (in-process, zero latency) or Qdrant (persistent, supports filtered search). For multi-instance deployments requiring shared cache state, use Redis with the RediSearch vector module.
Install and configure GPTCache
pip install gptcache. Configure with your embedding function and vector store: from gptcache import cache; from gptcache.embedding import Huggingface; cache.init(embedding_func=Huggingface('BAAI/bge-m3').to_embeddings, similarity_threshold=0.92)
Tune similarity threshold and eviction policy
Start at threshold=0.92, monitor false-positive rate (cache hits that return wrong answers) for 48 hours, then adjust in 0.01 increments. Set TTL: 24 hours for news/current events, 72 hours for stable factual content, 7 days for static FAQ responses.
Wire the cache in front of your LLM serving endpoint
Wrap your OpenAI-compatible inference endpoint (vLLM, SGLang) with the GPTCache adapter. Requests flow: client -> cache lookup (embed + vector search, ~3-8ms) -> cache hit returns immediately, cache miss calls the LLM and stores the response.
Deploy the full stack on a Spheron GPU node
Provision an H100 SXM5 on-demand node on Spheron. Run the TEI embedding container, Qdrant or FAISS, and vLLM in parallel on the same node. Use docker-compose to orchestrate. Co-location eliminates inter-service network latency.

FAQ / 05

Frequently Asked Questions

Semantic caching stores full LLM responses indexed by embedding vectors of past queries. When a new query is semantically similar (above a cosine similarity threshold), the cached response is returned without hitting the model. KV cache is a GPU-memory optimization inside the model that avoids recomputing attention over the already-processed prompt tokens - these are two different layers.

Production hit rates typically range from 30-70% depending on traffic patterns. Agent workflows with repeated tool calls and FAQ bots see the highest hit rates (40-65%). Creative generation tasks and stateful multi-turn chats see near-zero hit rates, making semantic caching a poor fit there.

For a latency-sensitive cache layer, BGE-M3 (512-dim) offers the best latency-recall tradeoff on short prompts. For higher recall on complex queries, Qwen3-Embedding or text-embedding-3-large work well but add ~2-5ms per lookup. Avoid large 1536-dim models on the critical path if p99 latency matters.

Yes, for small-to-medium stacks a single H100 SXM5 (80 GB) can host the embedding model (~1-2 GB VRAM), an in-process FAISS or Qdrant vector store, and a quantized 7B-13B LLM. Keeping all three co-located on a Spheron GPU node eliminates network round-trips for embedding lookups and minimizes added latency.

Set your cosine similarity threshold at or above 0.92 for factual/RAG workloads. Below 0.90, too many dissimilar queries match and you return incorrect cached answers. Also implement TTL-based eviction (24-72 hours for factual content) and track semantic drift with periodic cosine similarity score histograms to detect threshold decay.

Three Caching Layers: Semantic, KV, and Prompt Cache

When Semantic Caching Works (and When It Doesn't)

High-hit workloads

Low-hit workloads

GPTCache vs Redis Vector Cache vs LangChain Cache: Feature Comparison

Embedding Model Selection for Cache Key Generation

Deploying the Full Stack on a Single Spheron GPU Node

Docker Compose setup

Cache proxy implementation

Spheron provisioning

Tuning Similarity Thresholds and Eviction Policies

Cost Math: GPU Hours Saved per 1M Requests

Observability: Cache Hit Dashboards, Staleness, and Drift

Conclusion

Quick Setup Guide

Choose the right caching layer for your workload

Select and deploy an embedding model for cache key generation

Set up the vector store backend

Install and configure GPTCache

Tune similarity threshold and eviction policy

Wire the cache in front of your LLM serving endpoint

Deploy the full stack on a Spheron GPU node

Frequently Asked Questions

01What is semantic caching for LLMs and how is it different from KV cache?

02What hit rate can I expect from semantic caching in production?

03Which embedding model should I use for semantic cache key generation?

04Can I run GPTCache, the embedding model, and the LLM on the same GPU node?

05How do I avoid hallucinated cache hits with semantic caching?

Try It on Real GPUs