Engineering

GPU Cold Start on Serverless LLM Inference: 4 Fixes That Actually Work (2026)

GPU Cold StartServerless GPU Cold StartLLM Cold Start LatencyServerless GPULLM InferenceCUDA GraphsModel Weight StreamingGPU Cloud
GPU Cold Start on Serverless LLM Inference: 4 Fixes That Actually Work (2026)

Serverless GPU promises zero infrastructure management and scale-to-zero billing. It delivers both, at the cost of a 40-90 second penalty the first time a pod has to load a large model from scratch. For any production LLM API with real users, that first-token latency is a product-level problem that vendor marketing does not acknowledge.

This post covers four techniques that directly address the cold start gap: model weight streaming, CUDA and GPU memory snapshotting, container image slimming, and pre-warming with persistent NVMe caches. For teams running Kubernetes-native autoscaling, the GPU autoscaling guide with CRIU restore covers KEDA and Knative specifics in depth. For the underlying billing model that determines when any of this matters economically, see the serverless vs dedicated GPU billing model comparison.

What each fix eliminates

FixPhase eliminated or reducedTime saved (70B model)Infrastructure required
safetensors streaming loadWeight transfer overlap20-30sLocal NVMe 3+ GB/s
CUDA graph cache persistenceGraph capture10-30sPersistent NVMe volume
Container image slimmingContainer pull3-6 min (cold cache)Multi-stage Dockerfile
CRIU/GPU snapshot restoreAll phases50-80sNVIDIA driver 550+, cgroups v2
Keep-warm min replicaN/A (no cold start)40-90s (entire cold start)Persistent dedicated GPU

Anatomy of a GPU cold start

A cold start is not one thing. It is four sequential phases, each with its own bottleneck:

Phase8B model70B modelDominant factor
Container pull (uncached)4-8 min4-8 minImage size (15-18 GB)
Weight load from storage~8s~40-45sNVMe bandwidth
CUDA context + graph capture~12s~30sModel complexity
KV cache warmup~5s~10sFirst N requests
Total cold (uncached)~6 min~7-9 min
Total warm (cached image)~25s~85s

The "warm" row assumes the container image is already on the node - no pull required. That is the floor you can realistically hit with weight loading and CUDA graph capture still running. The four fixes below attack specific rows in this table.


Why a 7B model moves ~14 GB to VRAM

The math is straightforward: 7 billion parameters at FP16 (2 bytes each) = 14 GB. A 70B model at FP16 is 140 GB. That is why a 70B model needs 2-4 H100s (80 GB VRAM each) and takes 40-45 seconds to load even over fast NVMe.

Quantization cuts the per-parameter footprint:

  • FP16: 2 bytes/param, 14 GB for 7B
  • FP8: 1 byte/param, 7 GB for 7B
  • INT4 (AWQ/GPTQ): ~0.5 bytes/param, 4 GB for 7B

The catch is storage bandwidth. At 500 MB/s over NFS, a 4 GB INT4 7B model still takes 8 seconds to load. On local PCIe Gen4 NVMe at 3.5 GB/s, the same load drops to about 1 second. This gap is why NVMe locality matters more than quantization alone for cold start latency. The GPU memory requirements for LLMs post has detailed VRAM sizing tables if you are picking the right GPU tier for a specific model.


Fix 1: Model weight streaming with safetensors

The traditional weight loading pattern is sequential: read all shards from disk, transfer to GPU, then start the model. safetensors format breaks this by storing tensors in a flat binary layout with a small index header. The loader can mmap specific shards directly and overlap disk reads of shard N+1 with CUDA transfers of shard N.

For a Llama-3-8B model split across 4 shards, the streaming loader starts transferring shard 1 to GPU while reading shard 2 from disk in parallel. The first few transformer layers become resident in VRAM before the last shards have finished loading. vLLM uses this path automatically when safetensors weights are available and there is no --load-format override forcing pt or gguf.

To verify you are on the streaming path:

bash
# Confirm model uses safetensors format
ls ~/.cache/huggingface/hub/models--meta-llama--Llama-3-8B/snapshots/*/
# Look for model.safetensors.index.json and multiple model-*.safetensors shards

# Launch vLLM without forcing a non-streaming loader
vllm serve meta-llama/Llama-3-8B \
  --dtype bfloat16 \
  --max-model-len 4096
  # Note: no --load-format argument needed; safetensors is default

The real limiting factor for streaming is NVMe bandwidth. At 3-4 GB/s on local NVMe, streaming overlap saves 20-30 seconds on a 70B model. At 500 MB/s over NFS, the benefit narrows because the disk read bottleneck dominates. For workloads where NVMe bandwidth is the ceiling, GPU Direct Storage for maximum NVMe bandwidth covers the cuFile path that eliminates the CPU bounce buffer entirely.


Fix 2: CUDA and GPU memory snapshotting

There are two distinct snapshotting approaches, ordered by implementation complexity:

CUDA graph and Inductor cache persistence (simpler)

Every time vLLM starts, it captures CUDA graphs for the model's forward pass across a range of batch sizes. This takes 10-30 seconds depending on model complexity. With torch.compile, Inductor kernel compilation adds another 30-120 seconds on the first run.

Both are pure computation that produces deterministic output given the same model and hardware. They can be persisted to disk and reused across container restarts:

bash
export TORCHINDUCTOR_CACHE_DIR=/mnt/nvme/inductor-cache
export CUDA_GRAPHS_CACHE_DIR=/mnt/nvme/cuda-graphs

vllm serve meta-llama/Llama-3-70B \
  --dtype bfloat16 \
  --max-model-len 8192

On the first run, graph capture takes 30 seconds and writes to the NVMe cache directories. On every subsequent cold start with the same model and GPU, vLLM reads the compiled graphs from cache and skips the capture phase. The torch.compile and CUDA graph capture guide covers cache management in detail, including invalidation rules when you change model configs or CUDA versions.

This requires a persistent NVMe volume that survives container restarts. On serverless platforms with ephemeral filesystems, the cache is lost on every cold start, so this fix only works on dedicated or persistent instances.

Full process CRIU checkpoint/restore (eliminates weight loading too)

CRIU (Checkpoint/Restore In Userspace) captures the complete process state: CPU memory, GPU VRAM contents, file descriptors, CUDA context, and execution state. On restore, the process resumes from the exact point of the checkpoint without re-executing weight loading, graph capture, or any initialization code.

Requirements:

  • NVIDIA driver 550+ (565+ recommended for the full CRIUgpu reference stack)
  • CUDA 12.x
  • CRIU 3.19+ with kernel 5.15+ patches
  • cgroups v2 enabled
  • Kubernetes 1.30+ for the ContainerCheckpoint API (beta, enabled by default)

With a checkpoint stored on local NVMe, cold start drops from 60-90 seconds to 2-5 seconds for small models (7B-13B). The checkpoint is the weight data, so size scales with model: expect 14 GB for a 7B FP16 model, 140 GB for 70B. At 3.5 GB/s restore speed, a 14 GB checkpoint restores in about 4 seconds; a 140 GB checkpoint (70B) restores in about 40 seconds.

To trigger a checkpoint via the Kubernetes ContainerCheckpoint API:

bash
# Trigger checkpoint after the pod has fully warmed (weights loaded, graphs captured)
curl -X POST \
  https://localhost:10250/checkpoint/<namespace>/<pod-name>/<container-name> \
  --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
  --key /etc/kubernetes/pki/apiserver-kubelet-client.key \
  --cacert /etc/kubernetes/pki/ca.crt

# Checkpoint archive appears at:
# /var/lib/kubelet/checkpoints/checkpoint-<namespace>_<pod-name>_<container-name>-<timestamp>.tar

This is the same mechanism covered in the KEDA/Knative guide. Modal's GPU memory snapshot feature (alpha 2025) follows the same principle on their managed platform.


Fix 3: Container image slimming and dependency pruning

Container image pull is the most impactful cold start fix for teams that have not addressed it yet. A full LLM serving image includes Python, CUDA runtime, PyTorch, vLLM, and all transitive dependencies - typically 15-18 GB. Without image caching on every node, pulling that image takes 4-8 minutes.

The multi-stage build pattern separates build tools from runtime:

dockerfile
# Build stage: has gcc, cmake, build-essential, everything needed to compile
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS build

RUN apt-get update && apt-get install -y python3.10 python3-pip
RUN pip install --no-cache-dir vllm transformers accelerate safetensors

# Runtime stage: only the installed packages, none of the build tools
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3.10 python3-pip --no-install-recommends
COPY --from=build /usr/local/lib/python3.10/dist-packages \
                  /usr/local/lib/python3.10/dist-packages
COPY --from=build /usr/local/bin/vllm /usr/local/bin/vllm

CMD ["vllm", "serve", "meta-llama/Llama-3-8B", "--dtype", "bfloat16"]

Before/after sizes:

ComponentFull imageSlim runtime image
CUDA base~4.5 GB (devel)~1.8 GB (runtime)
Python + pip~300 MB~200 MB
PyTorch~2.5 GB~2.5 GB
vLLM + deps~4 GB~2 GB (no build tools)
Total~15-18 GB~5-7 GB

A 6 GB slim image at 100 MB/s pull speed finishes in under 1 minute. A 16 GB full image takes 2-3 minutes at the same speed. With image pre-pulling on every node via a Kubernetes DaemonSet, the pull phase drops to zero for pods that land on a pre-warmed node.


Fix 4: Pre-warming pools, keep-warm replicas, and NVMe snapshot caches

The most reliable fix for cold starts is to avoid them entirely. Three approaches, ordered by cost:

Minimum replica for production APIs

Set minScale: 1 on Knative or minReplicaCount: 1 on KEDA. One warm pod stays running at all times. The idle cost is the hourly GPU rate multiplied by idle hours. For H100 at $2.01/hr on Spheron, 24 idle hours costs $48.24. That is cheap compared to losing users to 60-second first-token latency on a consumer product. Scale-to-zero is a dev/staging optimization, not a production strategy.

Pre-warming on schedule

Use a KEDA CronTrigger or Knative cron source to spin up replicas 5 minutes before expected traffic peaks. If your inference API receives 90% of requests between 9 AM and 10 PM, run minScale: 1 only during those hours:

yaml
# KEDA ScaledObject with cron pre-warm
triggers:
  - type: cron
    metadata:
      timezone: "America/New_York"
      start: "55 8 * * 1-5"  # warm up 5 min before 9 AM weekdays
      end: "0 22 * * 1-5"    # scale to zero at 10 PM
      desiredReplicas: "1"

NVMe snapshot cache shared across replicas

Mount a PersistentVolume backed by local NVMe as the CUDA graph and CRIU checkpoint directory. The first pod to warm populates the cache. Every subsequent cold start in the cluster reads from the warm cache rather than re-capturing graphs or re-loading weights. This is the compound benefit: the first cold start is slow (it populates the cache), and every subsequent one is fast.

Spheron bare-metal nodes have local PCIe Gen4 NVMe attached directly to the host. This is what makes snapshot restores and graph cache loads fast compared to shared NFS or S3: 3-4 GB/s local bandwidth vs 100-500 MB/s over network storage. For the KV cache side of this story, NVMe KV cache offloading covers the InfiniStore/DiskKV patterns for extending context beyond VRAM.

For a production deployment that combines all these techniques, the production LLM deployment guide covers the full stack from container build to monitoring. Spheron's LLM quick-guides also cover the end-to-end provisioning steps.


The cost tradeoff: when serverless loses to dedicated

Serverless GPU platforms bill at a premium per compute-second. The convenience of scale-to-zero justifies that premium at low utilization. At high utilization, you are paying the premium without getting the convenience benefit.

Breakeven analysis (H100)

The breakeven utilization is: on-demand rate / serverless effective rate.

Serverless platforms do not publish a single "per-GPU-hour" price because billing models vary, but effective H100 rates based on public pricing and typical usage patterns are roughly 1.5-2x the bare-metal rate. Using Modal's published H100 pricing as a reference estimate:

ScenarioServerless effective rateSpheron H100 PCIe on-demandBreakeven utilization
H100 PCIe~$3.95/hr (Modal, estimated)$2.01/hr~51%

At 60% utilization, dedicated wins on both cost and latency. The math for a 720-hour month:

Billing modelCompute hours (60% util × 720)Effective rateMonthly cost
Serverless432 compute hours$3.95/hr$1,706
Spheron H100 PCIe (dedicated)720 hours (always-on)$2.01/hr$1,447
Saving~$259/month

The dedicated instance also eliminates cold start latency entirely for the first user request of any session, which has a real but hard-to-quantify cost in user satisfaction and retry rates.

Small-model tier: L40S for 7B-30B models

The H100 analysis above applies to 70B-class workloads. For 7B-30B parameter models, a single L40S (48 GB GDDR6) fits a 30B FP8 model comfortably and runs at $0.96/hr on-demand on Spheron. That is roughly half the H100 rate for workloads that do not need H100-class HBM bandwidth.

ScenarioServerless effective rateSpheron L40S on-demandBreakeven utilization
L40S (7B-30B tier)~$1.92/hr (estimated, 2x on-demand)$0.96/hr~50%

At 60% utilization, a 720-hour month on L40S:

Billing modelCompute hours (60% util × 720)Effective rateMonthly cost
Serverless432 compute hours$1.92/hr$830
Spheron L40S (dedicated)720 hours (always-on)$0.96/hr$691
Saving~$139/month

Running an H100 for a 7B model wastes the bulk of its VRAM and costs roughly 2x more than necessary. If your model fits on an L40S, start there. See L40S rental pricing and availability for current rates.

For H200 at $4.84/hr on-demand (useful for memory-bound 70B+ models), the math runs similarly: check on-demand pricing vs your serverless platform's H200 rate to find your specific breakeven point.

Pricing fluctuates based on GPU availability. The prices above are based on 15 Jun 2026 and may have changed. Check current GPU pricing → for live rates.


Decision framework by workload type

WorkloadTraffic patternCold start toleranceRecommendation
Chat API (consumer product)Sustained, latency-sensitiveZeroDedicated GPU, minScale >= 1
Voice AI (real-time)Bursty, sub-second TTFTZeroDedicated GPU with pre-warm
Batch inferenceAsync, no SLAHighServerless + CRIU snapshots
Agentic pipelinesUnpredictable burstsMediumServerless with GPU memory snapshots or CRIU
Dev/staging endpointsSporadicHighScale-to-zero with CRIU restore

The threshold between "tolerable cold start" and "unacceptable" is workload-specific. The LLM serving optimization guide covers the batching and scheduling side of inference optimization once you have eliminated the cold start tax. For latency budgets and TTFT targets per workload type, the LLM inference SLO guide has the full breakdown.

The short version: if users are waiting for the first token and you have no cold start mitigation, fix the container and snapshot caching first. If you are already past 50-60% GPU utilization, a dedicated instance is cheaper and faster than serverless for the long run.

Serverless cold starts are a problem you stop having with a persistent dedicated instance. Spheron H100 and H200 nodes include local NVMe for snapshot caching and CUDA graph persistence, giving you the sub-second first-token latency that serverless cannot match at production scale.

H100 SXM5 on Spheron → | On-demand H200 → | Compare GPU pricing →

STEPS / 05

Quick Setup Guide

  1. Measure your baseline cold start phases

    SSH into your inference node or container and run time kubectl exec <pod> -- python -c 'import time; print(time.time())' while watching docker events or containerd events in parallel. Record timestamps at container start, first Python import, first model weight shard load (add logging to your load() method), CUDA graph capture completion (watch vLLM logs for 'CUDA graphs captured'), and first successful token generation. This gives you the phase-by-phase breakdown before you start optimizing. You cannot fix what you have not measured.

  2. Enable safetensors streaming weight load

    Install the safetensors library (pip install safetensors) and confirm your model weights are in safetensors format (model.safetensors.index.json present in the model dir). In vLLM, safetensors streaming is the default loader when the format is available. Set VLLM_USE_MODELSCOPE=0 and verify there is no --load-format gguf or --load-format pt override in your launch flags. The safetensors mmap() path can overlap disk read and GPU transfer for multi-shard checkpoints.

  3. Persist the CUDA Inductor and graph cache to local NVMe

    Export TORCHINDUCTOR_CACHE_DIR=/mnt/nvme/inductor-cache and CUDA_GRAPHS_CACHE_DIR=/mnt/nvme/cuda-graphs before starting vLLM or your custom PyTorch inference server. On Spheron H100 nodes, /mnt/nvme is backed by local PCIe Gen4 NVMe at 3-4 GB/s. The compiled kernels and captured CUDA graphs persist across container restarts, so subsequent cold starts skip the 10-30 second graph capture phase entirely. On serverless platforms with ephemeral filesystems, these caches are lost on every cold start.

  4. Slim your container image

    Start from the official nvidia/cuda:12.4.1-runtime-ubuntu22.04 base (runtime, not devel). Install Python and dependencies with pip install --no-deps to avoid pulling package documentation and tests. Use a multi-stage Dockerfile: a build stage that installs all packages, and a final stage that only copies the installed site-packages tree. Run docker inspect <image> | jq '.[0].Size' before and after to verify the reduction. A typical vLLM container drops from 15-18 GB to 4-6 GB with this approach.

  5. Set up a keep-warm minimum replica for production

    For production serving endpoints, set minScale: 1 (Knative) or minReplicaCount: 1 (KEDA). This keeps at least one warmed replica running at all times, eliminating cold starts for user-facing requests. Calculate the idle cost as (hourly GPU rate) * (idle hours). For Spheron H100 at the current on-demand rate, compare this idle cost to the latency tax of cold starts - the idle cost is almost always cheaper than losing users to 60-second TTFT. For development or staging, scale-to-zero with CRIU restore is the right tradeoff.

FAQ / 05

Frequently Asked Questions

A GPU cold start for LLM serving has four distinct phases: container image pull (most of the time without caching, 4-8 minutes for a 15GB image), model weight transfer from storage to GPU VRAM (40+ seconds for a 70B FP16 model at 3-4 GB/s NVMe), CUDA context initialization and CUDA graph capture (10-30 seconds for vLLM's default graph sizes), and optionally KV cache warmup. Without optimization, a 70B model cold start on an uncached container easily hits 6-10 minutes total. Even with pre-pulled images, 60-90 seconds is common.

Model weight streaming loads tensor shards from disk to GPU progressively rather than waiting for all weights to transfer before the first forward pass. With safetensors format and a sufficiently fast NVMe (3+ GB/s), a Llama-3-8B model can begin serving requests within a few seconds of container start because the first few layers are resident in VRAM while later layers are still loading. This cuts the effective cold start latency for the first token without requiring the full load to complete.

CUDA memory snapshotting captures the full GPU state - model weights in VRAM, CUDA kernel objects, and execution context - as a byte-for-byte snapshot on persistent storage. On restore, the GPU state is replayed directly from the snapshot, skipping weight loading, CUDA graph capture, and context initialization entirely. CRIU (Checkpoint/Restore In Userspace) supports this at the Linux process level with NVIDIA driver 550+. Modal introduced GPU memory snapshots (alpha 2025) in their platform. With a snapshot on local NVMe, cold start time drops from 60-90 seconds to 2-5 seconds for small models (7B-13B). Larger models take longer because the checkpoint size scales with weight data: a 70B FP16 checkpoint is ~140 GB, which restores in about 40 seconds at 3.5 GB/s.

The breakeven point is workload utilization. Serverless GPU platforms charge a premium per compute-second for the convenience of scale-to-zero. At Spheron's H100 on-demand rate and a typical serverless effective rate roughly 1.5-2x higher, the breakeven is around 50-60% average GPU utilization. Below that threshold, serverless with tolerable cold starts saves money. Above it, a persistent dedicated instance on Spheron beats serverless on both cost-per-token and latency. For production APIs with consistent traffic, the utilization threshold is almost always exceeded.

Container image pull is the single largest driver of GPU cold start latency for teams without a local image cache. A typical LLM serving container bundles Python, CUDA libraries, PyTorch, and the serving framework. These commonly add up to 10-18 GB. Slimming techniques include using multi-stage builds to separate build tools from runtime, using official CUDA base images instead of full ubuntu images, removing test suites and documentation from installed packages, and pre-baking the image onto the node. A 3 GB runtime-only image with pre-pulled caching on every node reduces the container phase from minutes to under 10 seconds.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.