Sleep-Time Compute on GPU Cloud: Pre-Compute Long-Lived Context for 5x Lower Query Latency

Every persistent coding agent query that loads a 200K-token codebase re-pays the full prefill cost from scratch. Every RAG request that prepends a 50K-token document corpus re-computes those KV blocks on every call. Unlike test-time compute scaling, which spends extra compute at query time to improve answer quality, sleep-time compute moves that prefill work to idle GPU cycles before the query arrives. The result: TTFT drops by 3-5x, and your serving fleet handles the same traffic with less compute headroom.

What Is Sleep-Time Compute

The idea was formalized in 2025 research from UC Berkeley and Letta (Lin et al., "Sleep-time Compute: Beyond Inference Scaling at Test-time"): between user queries, GPU capacity sits partially idle. Sleep-time compute uses that idle capacity to run LLM inference in the background, pre-populating KV caches or generating compressed summaries against context that will almost certainly be needed again.

When a real query arrives, the model checks the shared KV store for a pre-populated cache hit. If the context fingerprint matches, the prefill step is skipped entirely. The model jumps straight to decode, which is why TTFT drops so dramatically.

The key insight is amortization. For a coding agent serving 1,000 queries per day against the same codebase, the prefill cost is paid 1,000 times. Sleep-time compute pays it once per cache TTL (typically 1-4 hours). The per-query cost of that pre-computation is then 1/1000th of what it would cost inline.

Contrast this with test-time compute, which optimizes answer quality by spending more tokens thinking. Sleep-time compute does not change what the model outputs. It only changes when the expensive part of inference runs.

Approach	When compute runs	Optimizes for	Ideal workload
Standard inference	Query time	Correctness	Short, stateless queries
Test-time scaling	Query time (extended)	Answer quality	Reasoning, math, code
Sleep-time compute	Idle periods (pre-query)	Query latency	Long-lived shared context

Workload Patterns That Benefit

Persistent Coding Agents

A coding agent loads a codebase at the start of every session. For a 200K-token codebase (about 150K lines of Python), each cold query on Llama 3.3 70B at FP16 costs 4-8 seconds TTFT on H100 before the model produces a single output token. That prefill is identical for every developer hitting the same repo.

Persistent agent memory systems reduce what needs to be passed in context on each turn. Sleep-time compute handles the remaining long-context cost by pre-filling the codebase prefix so the agent can skip that step at query time. The two techniques layer well: memory systems shrink the context, sleep-time compute pre-populates whatever context remains.

Customer-Support Knowledge Bases

A support bot that loads full conversation history plus product docs faces a growing context problem. Each new session turn adds tokens, and the full history is re-submitted on every turn. Sleep-time workers can pre-populate the context up to the most recent turn so the model's next response starts from a cache hit rather than a cold prefill.

For pipelines with stable system prompts and product documentation (tens of thousands of tokens), pre-computation amortizes across the entire user base. Every user hitting the same product docs shares a single pre-populated KV cache.

Document QA Over Static Corpora

A fixed PDF corpus, a regulatory document set, or a static knowledge base that multiple users query is a perfect sleep-time target. The context is identical across all queries. Pre-populating it once means thousands of queries can skip prefill entirely. Combine this with semantic caching for exact-hit deduplication when the same question recurs verbatim.

Codebase-Aware RAG

Retrieval indexes that change on a nightly schedule (re-indexed after code commits or document updates) have a 23-hour window where the retrieved chunks are stable. Sleep-time workers can pre-fill the top-K chunks from a retrieval index during the overnight idle window, reducing both retrieval latency and prefill cost for the next day's traffic.

Architecture

Idle-Time Worker Pools

The topology separates two fleets. The serving fleet runs on on-demand GPUs to guarantee SLO. The sleep-time worker fleet runs on spot GPUs and handles all background pre-computation.

Both fleets share a KV cache store via LMCache with a Redis backend. Workers write pre-populated KV pages to Redis with a TTL. Serving instances check Redis before prefill, and serve from cache on a hit.

[Serving fleet]            [Sleep-time worker fleet]
On-demand H100 x N    <-- Redis KV store --> Spot A100 x M
    |                                            |
Check cache hit                         Pop context queue
Skip prefill if hit                     Run prefill-only
Return decoded tokens                   Write KV pages to Redis

Worker restarts on preemption do not lose work because the KV store is external. The next worker picks up the queue from where the previous one stopped.

KV Cache Pre-Population

The sleep-time worker runs prefill-only inference. There is no decode step because the goal is to populate the KV cache, not generate tokens. vLLM's --enable-prefix-caching flag makes pre-populated KV blocks available to the serving instances via the shared LMCache backend.

For a KV cache pre-population strategy that handles multi-node deployments, LMCache v0.2+ is recommended. Check the LMCache documentation for version-specific vLLM compatibility before deploying.

bash

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --max-model-len 131072 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

FP8 KV cache quantization cuts VRAM usage roughly in half compared to FP16 with minimal quality impact on prefill-heavy workloads. At 128K tokens, FP8 KV cache for a 70B model uses approximately 35-45 GB, which fits within H100 SXM5 headroom alongside the model weights.

Summary Distillation

For contexts longer than 64K tokens, pre-generating a compressed summary is often more practical than pre-filling the raw context. The serving model attends to the summary plus the current query rather than the full raw context. This trades a small quality loss for a 4-8x reduction in effective context length, which compounds the TTFT benefit.

Distillation runs during the same idle window as KV pre-population. A smaller summary model (7B-13B) handles distillation cheaply while the primary model handles inference.

Inferred-Question Prefetching

For FAQ-style knowledge bases, a lightweight question-prediction model can pre-fill not just the context but the most likely user queries. The worker generates the top N probable questions given the context, pre-fills those prompts, and caches the results. In our internal testing on FAQ-style workloads, cache hit rates of 30-60% are typical for structured FAQ corpora, meaning a third to half of all serving queries skip prefill entirely.

Reference Implementation

Project Structure

sleep-time-compute/
├── docker-compose.yml
├── worker/
│   ├── prefetch_worker.py
│   └── scheduler.py
├── serving/
│   └── serving_vllm.py
└── redis/
    └── queue.py

Docker Compose

yaml

version: "3.9"

services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes

  sleep-worker:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - REDIS_URL=redis://redis:6379
      - MODEL_NAME=${MODEL_NAME:-meta-llama/Llama-3.3-70B-Instruct}
      - KV_CACHE_BUDGET_TOKENS=${KV_CACHE_BUDGET_TOKENS:-500000}
      - NVIDIA_VISIBLE_DEVICES=${SPOT_GPU_IDS:-0,1}
      - LMCACHE_STORAGE_BACKEND=redis
      - LMCACHE_REMOTE_URL=redis://redis:6379
    volumes:
      - ./worker:/app/worker
    command: python /app/worker/prefetch_worker.py
    depends_on:
      - redis

  serving:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - REDIS_URL=redis://redis:6379
      - MODEL_NAME=${MODEL_NAME:-meta-llama/Llama-3.3-70B-Instruct}
      - NVIDIA_VISIBLE_DEVICES=${SERVING_GPU_IDS:-2,3}
      - LMCACHE_STORAGE_BACKEND=redis
      - LMCACHE_REMOTE_URL=redis://redis:6379
    ports:
      - "8000:8000"
    command: >
      python -m vllm.entrypoints.openai.api_server
      --model ${MODEL_NAME:-meta-llama/Llama-3.3-70B-Instruct}
      --enable-prefix-caching
      --kv-cache-dtype fp8
      --tensor-parallel-size 2
    depends_on:
      - redis

volumes:
  redis_data:

Prefetch Worker

python

import redis
import hashlib
import os
from vllm import LLM, SamplingParams

r = redis.Redis.from_url(os.environ["REDIS_URL"])
llm = LLM(
    model=os.environ.get("MODEL_NAME", "meta-llama/Llama-3.3-70B-Instruct"),
    enable_prefix_caching=True,
    kv_cache_dtype="fp8",
    tensor_parallel_size=2,
)

def prefetch_context(context: str) -> str:
    fingerprint = hashlib.sha256(context.encode()).hexdigest()[:16]
    # Check if already cached and fresh
    if r.get(f"kv:ready:{fingerprint}"):
        return fingerprint
    # Prefill-only: 1 token max to trigger KV population
    # max_tokens=0 is not supported in all vLLM versions; use 1 instead
    params = SamplingParams(max_tokens=1, temperature=0)
    llm.generate([context], params)
    r.setex(f"kv:ready:{fingerprint}", 3600, "1")
    return fingerprint

def run_worker():
    import time
    import logging
    while True:
        # Pop highest-priority context from sorted set
        items = r.zpopmax("prefetch:queue", count=1)
        if not items:
            time.sleep(5)
            continue
        context_key, score = items[0]
        context = r.get(f"context:{context_key.decode()}")
        if context:
            try:
                prefetch_context(context.decode())
                r.delete(f"context:{context_key.decode()}")
            except Exception as exc:
                logging.exception("prefetch_context failed for key %s: %s", context_key, exc)
                # Requeue with original priority so the item is not permanently lost
                r.zadd("prefetch:queue", {context_key: score})
        else:
            logging.warning("context missing for key %s, item dropped", context_key)

if __name__ == "__main__":
    run_worker()

Scheduler

python

import redis
import os
import time
from datetime import datetime, timezone

r = redis.Redis.from_url(os.environ["REDIS_URL"])
DAILY_TOKEN_BUDGET = int(os.environ.get("KV_CACHE_BUDGET_TOKENS", 500_000))

# Atomically increment the budget counter and set the TTL only when none is
# present. Without atomicity, a crash between INCRBY and EXPIREAT leaves the
# key with TTL=-1 (never expires), permanently inflating the user's counter.
_BUDGET_LUA = """
local v = redis.call('INCRBY', KEYS[1], ARGV[1])
if redis.call('TTL', KEYS[1]) == -1 then
    redis.call('EXPIREAT', KEYS[1], ARGV[2])
end
return v
"""

def get_midnight_utc_timestamp() -> int:
    now = datetime.now(timezone.utc)
    midnight = now.replace(hour=0, minute=0, second=0, microsecond=0)
    return int(midnight.timestamp()) + 86400

def check_budget(user_id: str, tokens: int) -> bool:
    key = f"user:{user_id}:bg_tokens"
    new_val = r.eval(_BUDGET_LUA, 1, key, tokens, get_midnight_utc_timestamp())
    if new_val > DAILY_TOKEN_BUDGET:
        r.decrby(key, tokens)
        return False
    return True

def enqueue_context(user_id: str, context_key: str, context_text: str, token_count: int, priority: float):
    if not check_budget(user_id, token_count):
        return
    # Store payload so the worker can retrieve it by key; TTL exceeds kv:ready TTL so
    # stale payloads self-expire even if the worker never processes them
    r.setex(f"context:{context_key}", 7200, context_text)
    # Priority = recency_weight * last_access_delta + freq_weight * access_count_7d
    r.zadd("prefetch:queue", {context_key: priority})

def compute_priority(last_access_seconds_ago: float, access_count_7d: int) -> float:
    recency_weight = 0.7
    freq_weight = 0.3
    recency_score = 1.0 / (1.0 + last_access_seconds_ago / 3600)
    freq_score = min(access_count_7d / 100.0, 1.0)
    return recency_weight * recency_score + freq_weight * freq_score

Scheduler Design

Two triggers wake the sleep-time worker pool:

Scheduled cron runs during off-peak hours. A typical schedule: 0 2 * * * (2 AM UTC) pre-populates the highest-priority contexts for the day's traffic.

Queue depth threshold runs on demand. When the pending queue exceeds N items (e.g., 50), the scheduler wakes workers immediately rather than waiting for the cron window.

Prioritization uses a simple scoring formula:

score = 0.7 * (1 / (1 + last_access_hours)) + 0.3 * min(access_count_7d / 100, 1.0)

Contexts accessed 10 minutes ago score near 1.0. Contexts not touched in 24 hours score near 0.3. High-frequency contexts (100+ accesses in 7 days) get a 0.3 bonus even if stale.

Per-user budget enforcement prevents any single user from consuming the entire background fleet. Redis counters with EXPIREAT set to next midnight UTC reset automatically. The budget key pattern is user:{id}:bg_tokens and increments on every enqueue, blocking further enqueues once the daily cap is reached.

Cache Layer Integration

Requests go through a layered cache stack. Each miss falls through to the next layer:

Query arrives
    |
[1] Semantic cache (GPTCache / Zilliz) -- exact/fuzzy hit for repeated questions
    | miss
[2] Sleep-time KV store (vLLM prefix cache + LMCache) -- pre-populated long context
    | miss
[3] Prompt cache (Anthropic API / vLLM built-in) -- static system prompt cache
    | miss
[4] Standard prefill + decode

Layer 1 catches exact and near-duplicate queries before they reach the model. Layer 2 catches queries against pre-populated contexts even when the query itself is novel. Layer 3 handles static system prompt caching. Layer 4 is the fallback.

For workloads where the KV cache grows beyond GPU VRAM, cold cache pages can be evicted to NVMe and reloaded in roughly 50ms via NVMe KV cache offloading. This extends effective context depth well beyond what GPU VRAM alone can hold.

GPU Selection and Pricing

Sleep-time workers and serving instances have different requirements. Workers need enough VRAM for prefill-only inference, accept spot preemption, and run during off-peak hours. Serving instances need stable on-demand capacity and responsive decode throughput.

GPU	VRAM	On-Demand	Spot	Best for
H100 SXM5	80 GB	$2.57/hr	$1.66/hr	Serving + prefill of 70B+ models
A100 80GB PCIe	80 GB	$1.04/hr	$1.14/hr*	Sleep-time workers on 7B-70B models
L40S PCIe	48 GB	$0.72/hr	$1.07/hr	Smaller models, summary distillation

*A100 80GB PCIe spot is currently priced above its on-demand rate, which is unusual. Use on-demand for predictable budgeting or switch to H100 SXM5 spot at $1.66/hr for 70B models.

Teams running sleep-time workers for Llama 3.3 70B typically provision H100 SXM5 instances on Spheron for serving and spot H100 SXM5 rentals for background pre-computation. H100 SXM5 spot at $1.66/hr covers 70B prefill comfortably with FP8 KV quantization. If you prefer a lower-cost worker GPU, A100 80GB PCIe on-demand at $1.04/hr is currently cheaper than its spot price and works well for 70B models with tensor parallelism.

Pricing fluctuates based on GPU availability. The prices above are based on 19 May 2026 and may have changed. Check current GPU pricing → for live rates.

Benchmarks

The table below shows approximate TTFT measurements on H100 SXM5, running vLLM with FP8 KV cache and --enable-prefix-caching. Baseline is cold prefill (no cache). Sleep-time column assumes 100% KV cache hit rate. Real-world hit rates of 40-80% reduce the effective improvement proportionally.

Model	Context Length	TTFT (baseline)	TTFT (sleep-time)	Reduction
Llama 3.3 70B	64K tokens	~6.2s	~1.1s	5.6x
Llama 3.3 70B	128K tokens	~12.8s	~1.4s	9.1x
Qwen 3 32B	32K tokens	~2.9s	~0.7s	4.1x
Qwen 3 32B	64K tokens	~5.8s	~0.9s	6.4x

These are approximate figures at 100% KV cache hit rate on a single-request, non-batched benchmark. Batch inference, tensor parallelism configuration, and hardware generation all affect absolute numbers. Run your own benchmarks on your target hardware before committing to infrastructure decisions.

Cost Model

Let C_prefill be the cost of prefilling a context once at the serving GPU's rate. Let C_bg be the cost of the same prefill on a background worker at spot rates. Let Q be the number of queries that reuse the pre-computed context.

Break-even occurs when Q * C_prefill > C_bg, which simplifies to Q > C_bg / C_prefill.

Since both costs are proportional to the same token count, the ratio reduces to the GPU rate ratio:

Q_breakeven = spot_rate / on_demand_rate = $1.66 / $2.57 ≈ 0.65

At just one reuse query, you are already past break-even and saving roughly 35% of the prefill cost for that query.

Context length	Background cost (H100 SXM5 spot, ~10K tok/sec)	Prefill saved per query (H100 SXM5 on-demand, ~10K tok/sec)	Break-even Q
32K tokens	~$0.0015/run	~$0.0023/query	1 query
64K tokens	~$0.0030/run	~$0.0046/query	1 query
128K tokens	~$0.0059/run	~$0.0091/query	1 query

For document QA workloads where hundreds of users share the same corpus, the savings multiply by the number of users. A 64K-token corpus queried by 500 users per day saves $2.30/day in serving prefill cost for $0.003 in background compute. The ROI is not marginal.

Quick Setup Checklist

Profile your query logs and identify contexts with prefixes longer than 4K tokens that recur across requests
Choose a background GPU (H100 SXM5 spot at $1.66/hr is the recommended default for 70B models; A100 80GB PCIe on-demand at $1.04/hr works if you need lower cost, but note A100 spot is currently priced above on-demand)
Deploy Redis as the shared KV page store with persistence enabled
Launch a vLLM worker with --enable-prefix-caching and LMCache v0.2+ connector pointed at Redis
Build a context fingerprinting queue (SHA-256 of normalized context, stored as Redis sorted set)
Set per-user daily token budgets enforced by Redis counters with midnight UTC TTL
Configure two wake triggers: scheduled cron (off-peak) and queue depth threshold
Deploy the sleep-time worker fleet as spot instances and serving fleet as on-demand
Wire the serving instances to check Redis for KV cache hits before running prefill
Benchmark TTFT before and after using vLLM's /metrics endpoint, targeting mean_time_to_first_token_seconds

Sleep-time compute turns idle spot GPU cycles into latency headroom for your serving fleet. Spheron's spot H100 and A100 instances are priced for exactly this kind of background workload - run pre-computation cheap, serve queries fast.
Browse H100 capacity → | Check A100 availability → | View spot pricing → | Get started →

STEPS / 06

Quick Setup Guide

Identify long-lived context patterns in your application
Profile your existing query logs to find requests that share context prefixes longer than 4K tokens. These are the pre-computation targets. A coding agent that loads a codebase on every query is a strong candidate.
Set up a vLLM worker with persistent KV cache
Launch a vLLM worker with --enable-prefix-caching and LMCache connector. Point the cache backend at a Redis cluster shared with your serving fleet. This worker runs during low-traffic windows.
Build a Redis-backed prefetch queue
Push context fingerprints (SHA-256 of normalized context) into a Redis sorted set ordered by predicted query probability. The sleep-time worker pops items and runs prefill-only inference, writing filled KV pages back into the shared store.
Schedule background workers on Spheron spot GPUs
Deploy sleep-time workers as spot instances on Spheron. Set a per-user token budget (e.g., 50K background tokens/day) enforced by the scheduler. Workers wake on a configurable cron or on inbound queue depth exceeding a threshold.
Integrate semantic cache and prompt cache layers
Place semantic cache (GPTCache or Zilliz) above the sleep-time KV layer for exact/fuzzy hit detection, and prompt cache (Anthropic API or vLLM prefix cache) below as a fallback. The stack order is: semantic cache -> sleep-time KV -> standard prefill.
Benchmark TTFT before and after
Use vLLM's built-in /metrics endpoint to track mean_time_to_first_token_seconds. Run a 100-query replay against Llama 3.3 70B with and without sleep-time pre-computation on a fixed context corpus. Expect 3-5x TTFT reduction on high-hit workloads.

FAQ / 05

Frequently Asked Questions

Sleep-time compute runs LLM inference during idle periods to pre-populate KV caches and pre-generate summaries against long-lived context (codebases, customer histories, document corpora). When a real query arrives, the model skips the expensive prefill step, which cuts time-to-first-token by 3-5x.

Test-time compute scaling (chain-of-thought, best-of-N sampling) spends extra compute at query time to improve answer quality. Sleep-time compute spends compute before the query arrives, during idle GPU cycles, to reduce query-time latency. The two are orthogonal and can be layered.

Workloads with long-lived, slowly-changing context see the biggest gains: persistent coding agents with large codebases, customer-support bots with full conversation histories, document QA over static corpora, and knowledge-base RAG with a fixed retrieval index.

Yes. Sleep-time workers are the ideal spot workload: they are stateless once the KV cache is written to a shared store, interruptible (the next worker restart picks up from Redis), and latency-insensitive. Serving instances should use on-demand capacity for SLO guarantees.

A Llama 3.3 70B model with 128K context uses roughly 70-90 GB of KV cache VRAM at FP16. With FP8 KV quantization this drops to 35-45 GB. Use H100 SXM5 (80 GB HBM3) or offload cold cache pages to NVMe using LMCache for longer contexts.

What Is Sleep-Time Compute

Workload Patterns That Benefit

Persistent Coding Agents

Customer-Support Knowledge Bases

Document QA Over Static Corpora

Codebase-Aware RAG

Architecture

Idle-Time Worker Pools

KV Cache Pre-Population

Summary Distillation

Inferred-Question Prefetching

Reference Implementation

Project Structure

Docker Compose

Prefetch Worker

Scheduler

Scheduler Design

Cache Layer Integration

GPU Selection and Pricing

Benchmarks

Cost Model

Quick Setup Checklist

Quick Setup Guide

Identify long-lived context patterns in your application

Set up a vLLM worker with persistent KV cache

Build a Redis-backed prefetch queue

Schedule background workers on Spheron spot GPUs

Integrate semantic cache and prompt cache layers

Benchmark TTFT before and after

Frequently Asked Questions

01What is sleep-time compute in LLMs?

02How does sleep-time compute differ from test-time compute scaling?

03Which workloads benefit most from sleep-time pre-computation?

04Can sleep-time compute run on spot GPU instances?

05How much VRAM does KV cache pre-population require for a 70B model?

Build what's next.