Standard Kubernetes HPA breaks on LLM inference. Not occasionally, not in edge cases. Every time a GPU pod restarts cold, the new replica spends 3-10 minutes pulling images, loading model weights from storage, capturing CUDA graphs, and warming its KV cache. By the time it is ready to serve, the upstream pod has been under load for minutes.
The tools that fix this, KEDA for queue-depth autoscaling and Knative for scale-to-zero, have good documentation for web services. Neither has a coherent guide for GPU LLM workloads specifically. The forum posts are scattered. The cold-start anatomy is underdocumented. No one source covers the full stack from metric triggers through CRIU checkpoint restore.
This guide walks through the complete stack on Spheron bare-metal GPU nodes: KEDA for metric-driven scaling, Knative for scale-to-zero, cold-start optimization techniques including CRIU checkpointing, and a cost model across three realistic traffic patterns. For Kubernetes GPU cluster setup fundamentals, see Kubernetes GPU Orchestration 2026. For vLLM tuning and multi-GPU tensor parallelism, see the vLLM Multi-GPU Production Deployment 2026 guide.
Why Standard HPA Fails for LLM Workloads
CPU and memory utilization do not reflect how an LLM inference pod is actually stressed. A vLLM pod handling 50 concurrent requests runs at 5-8% CPU while its GPU sits at 95%. HPA sees a healthy pod. Users see 30-second token generation latency as the KV cache fills and requests stack in the queue.
Here is a broken HPA spec that teams typically start with:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-server
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70This will never scale. The CPU target of 70% will not be reached during normal inference load. Even under severe GPU saturation, the CPU metric stays low because token generation is GPU-bound.
The second problem is cold-start time. When HPA eventually triggers a scale event (after CPU actually spikes due to something unrelated, or never), the new pod faces this timeline:
| Phase | 8B model (H100 NVMe) | 70B model (H100 NVMe) |
|---|---|---|
| Container pull (uncached) | 4-6 min (15GB image) | 4-6 min |
| Weight load from NVMe | ~8s | ~45s |
| CUDA graph capture | ~12s | ~30s |
| KV cache warmup | ~5s | ~10s |
| Total (cold, uncached) | ~5-7 min | ~6-8 min |
| Total (warm image cache) | ~25s | ~85s |
Pre-pulled images are the single biggest lever. Everything else is seconds. The container pull is minutes.
Cold-Start Anatomy: What Actually Takes Time
Weight loading from NVMe. This is where bare metal matters. Spheron's H100 nodes have local NVMe with sequential read bandwidth of 3-4 GB/s. A 70B BF16 model is about 140GB on disk. At 3.5 GB/s that is 40 seconds. Cloud-attached NFS storage runs 400-600 MB/s. The same model takes 4-6 minutes. That one difference makes local NVMe essential for any workload that expects to scale up and be serving within 90 seconds.
CUDA graph capture. When vLLM runs without --enforce-eager (the default), it captures CUDA graphs for common batch sizes. A CUDA graph records a fixed sequence of GPU operations and replays them with lower launch overhead than dynamic dispatch. The capture itself takes 10-30 seconds depending on the model and how many batch sizes are traced. On subsequent runs, if you persist the graph cache to a PVC, capture time drops to near zero. The env var CUDA_GRAPHS_CACHE_DIR controls the cache location.
KV cache warmup. The first N requests populate the prefix cache. vLLM's --enable-prefix-caching flag lets subsequent requests with shared prefixes skip redundant computation. The warmup period is typically 5-10 requests, after which hit rates stabilize. For more detail on KV cache sizing and prefix cache tuning, see KV Cache Optimization Guide.
The NVMe advantage directly benefits Spheron H100 nodes: bare-metal provisioning means model weights load from local disk at full NVMe bandwidth, not over a network filesystem.
KEDA vs Knative vs Custom Controller: Which for Which Stack
| Autoscaler | Trigger type | Scale-to-zero | Routing changes | Best for |
|---|---|---|---|---|
| KEDA | Any metric (Prometheus, queues, custom) | With KEDA-HTTP add-on | None (standard Service) | Queue-depth LLM serving, batch processing |
| Knative Serving | HTTP concurrency / RPS | Native | Replaces Service with Knative Route | HTTP inference endpoints, user-facing chatbots |
| Ray Serve autoscaler | Ray task queue | Partial (cluster-level) | Ray Serve router | Multi-model pipelines, agent frameworks |
| Custom operator (llm-d) | Routing gateway metrics | No | Kubernetes Gateway API | Prefill/decode disaggregation, multi-pool routing |
KEDA works with standard Kubernetes Deployments and Services. No routing changes. You add a ScaledObject and KEDA manages an HPA behind the scenes, but using metrics you define instead of CPU/memory. The tradeoff: KEDA-HTTP is required for true scale-to-zero, and it adds a reverse proxy in the request path.
Knative Serving intercepts all traffic through its Activator component. The Activator buffers incoming requests while a pod cold-starts, then proxies them once the pod is ready. This makes scale-to-zero transparent to callers, with latency as the only user-visible impact. The tradeoff: Knative replaces your Service with a Knative Route, which changes how you address the endpoint.
For disaggregated inference patterns, see llm-d Kubernetes disaggregated inference guide. For Ray Serve on GPU cloud, see Ray Serve GPU cloud LLM deployment.
For workloads that need GPU quota management and fractional sharing on top of Kubernetes rather than reactive autoscaling, NVIDIA Run:ai on GPU Cloud covers the scheduler-first approach.
KEDA: Scale on Queue Depth and GPU Utilization
Install KEDA via Helm:
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace \
--version 2.19.0vLLM exposes a Prometheus-compatible /metrics endpoint. The key metric for queue depth is vllm:num_requests_waiting. When this exceeds your threshold, new replicas should come up. The second trigger, GPU utilization from DCGM Exporter, catches burst conditions before the queue fills.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
namespace: inference
spec:
scaleTargetRef:
name: vllm-server
pollingInterval: 15
cooldownPeriod: 120
minReplicaCount: 1
maxReplicaCount: 8
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 1
periodSeconds: 60
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc:9090
metricName: vllm_requests_waiting
query: sum(vllm:num_requests_waiting{namespace="inference"})
threshold: "5"
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc:9090
metricName: dcgm_gpu_utilization
query: avg(DCGM_FI_DEV_GPU_UTIL{namespace="inference",pod=~"vllm-server-.*"})
threshold: "80"Key settings to understand:
pollingInterval: 15checks metrics every 15 seconds. Lower values give faster reaction but add Prometheus query load.cooldownPeriod: 120waits 2 minutes after last scale event before allowing scale-down. Prevents flapping.stabilizationWindowSeconds: 60on scale-down means KEDA tracks the highest replica count recommendation over the last 60 seconds before reducing. This prevents a momentary metric dip from triggering a premature scale-down.minReplicaCount: 1keeps one warm replica for production. Use0only with KEDA-HTTP for true scale-to-zero.
The dual-trigger approach matters: the GPU utilization trigger fires when existing replicas are hot, before the queue fills and before user-visible latency degrades. The queue depth trigger catches cases where GPU utilization is moderate but requests are stacking (long-context inputs with slow decode).
Knative Serving: Scale-to-Zero with GPU Pods
Install Knative Serving and the net-kourier ingress:
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.22.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.22.0/serving-core.yaml
kubectl apply -f https://github.com/knative/net-kourier/releases/download/knative-v1.22.0/kourier.yaml
kubectl patch configmap/config-network \
--namespace knative-serving \
--type merge \
--patch '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'Define a Knative Service for vLLM:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: vllm-ksvc
namespace: inference
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/metric: concurrency
autoscaling.knative.dev/target: "10"
autoscaling.knative.dev/minScale: "0"
autoscaling.knative.dev/maxScale: "5"
autoscaling.knative.dev/scale-to-zero-grace-period: "300s"
autoscaling.knative.dev/window: "60s"
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.5
resources:
limits:
nvidia.com/gpu: "1"
memory: 80Gi
requests:
nvidia.com/gpu: "1"
memory: 80Gi
env:
- name: VLLM_WORKER_MULTIPROC_METHOD
value: spawn
- name: HF_HOME
value: /models
- name: CUDA_GRAPHS_CACHE_DIR
value: /cuda-graphs
args:
- --model
- meta-llama/Llama-3-8B-Instruct
- --enable-prefix-caching
volumeMounts:
- name: model-cache
mountPath: /models
- name: cuda-graph-cache
mountPath: /cuda-graphs
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-weights-pvc
- name: cuda-graph-cache
persistentVolumeClaim:
claimName: cuda-graphs-pvcGPU-specific considerations with Knative:
Activator buffering. When a pod is scaled to zero, Knative's Activator receives the first request and holds it (with HTTP 503 internally, not exposed to the caller) while the GPU pod comes up. The caller sees a slow response, not an error. This works as long as the client does not time out before the pod is ready. Set client timeouts above your worst-case cold-start time, typically 120-180 seconds for 8B models with warm image cache.
Queue proxy overhead. Knative injects a queue-proxy sidecar into every pod. It consumes ~100m CPU and 100Mi memory. Budget for this in your node sizing.
minScale: 0 vs minScale: 1. For production user-facing inference, use minScale: 1. The cold-start latency tax is not acceptable for interactive use cases. For dev/staging endpoints, internal tools, and batch workloads with user-visible scheduling, minScale: 0 with CRIU restore makes sense economically.
Cold-Start Optimization Techniques
1. Pre-pull images via DaemonSet
Container image pull is the single longest phase for uncached pods. A DaemonSet that pre-pulls the vLLM image on all GPU nodes eliminates it entirely:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: vllm-image-prepuller
namespace: inference
spec:
selector:
matchLabels:
app: vllm-prepuller
template:
metadata:
labels:
app: vllm-prepuller
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
initContainers:
- name: pull-vllm
image: vllm/vllm-openai:v0.8.5
command: ["sh", "-c", "echo Image pulled"]
resources:
limits:
memory: 100Mi
containers:
- name: pause
image: gcr.io/google_containers/pause:3.9
resources:
limits:
memory: 10MiThe initContainer triggers a pull. The pause container keeps the DaemonSet pod running. When vLLM scales up, the image is already present on the node.
2. Quantized weights
FP8 quantization reduces a 70B model from ~140GB to ~40GB. Load time drops by roughly 3x. Use --quantization fp8 in vLLM:
args:
- --model
- meta-llama/Llama-3-70B-Instruct
- --quantization
- fp8For FP8 setup details and accuracy impact, see vLLM Multi-GPU Production Deployment 2026.
3. CUDA graph cache on a PVC
CUDA graphs capture GPU operation sequences. Without caching, every new pod re-captures them at startup. With a shared PVC:
volumes:
- name: cuda-graph-cache
persistentVolumeClaim:
claimName: cuda-graphs-pvcSet CUDA_GRAPHS_CACHE_DIR=/cuda-graphs and omit --enforce-eager so CUDA graphs remain enabled. The first pod to start writes the captured graphs to the PVC. Subsequent pods load from it, shaving 12-30 seconds from startup.
One caveat: ReadWriteOnce PVCs can only be mounted by pods on the same node simultaneously. If your replicas scale across multiple nodes, provision the PVC with a ReadWriteMany storage class (NFS, Longhorn in RWX mode, or a cloud-native equivalent) so all pods can read the cache regardless of placement.
4. NVMe model weight cache via hostPath
The fastest way to serve weights to a new pod is from local NVMe. Mount the model weights directory as a hostPath volume on GPU nodes. Weights are downloaded once from the Hugging Face hub and cached locally:
volumes:
- name: model-cache
hostPath:
path: /data/model-weights
type: DirectoryOrCreateThis survives pod restarts and node reboots (the files persist on disk). On Spheron's bare-metal nodes, NVMe sequential reads run 3-4 GB/s vs 400-600 MB/s for network-attached storage.
5. Layer caching in Dockerfiles
If you want model weights baked into the image, keep them in a separate weight-only base image so they survive vLLM version bumps:
# Weight image - built separately, rarely changes
# Tag this independently (e.g. weights:llama-3-70b-v1) so it is not
# rebuilt when vLLM is updated.
FROM python:3.11-slim AS weights
COPY model_weights/ /weights/
# Runtime image - changes with vLLM updates
FROM vllm/vllm-openai:v0.8.5
COPY serving_config.json /config/
COPY --from=weights /weights/ /weights/Because the weight image is built and tagged independently, upgrading vllm/vllm-openai does not invalidate it. Note that if you build both stages in the same docker build invocation with the same base, Docker will still re-copy the weights layer. The decoupling only holds when the weight image is pre-built and referenced by a fixed tag. For 40-140GB models the simpler approach is a hostPath or PVC mount (covered above) so weights live entirely outside the image.
Snapshot-Based Fast Restore: CRIU and CUDA Checkpoint
CRIU (Checkpoint/Restore In Userspace) captures the entire state of a running process, including its GPU context, to disk. On restore, the process continues from exactly where it was checkpointed, without re-executing initialization code.
For a vLLM pod, this means:
- Checkpoint after first successful model load (weights loaded, CUDA graphs captured, KV cache initialized)
- On subsequent cold starts, restore from the checkpoint bundle
- Time to first token: under 5 seconds instead of 25-85 seconds
Requirements:
- NVIDIA driver 550+ (565+ recommended for the CRIUgpu reference stack)
- CUDA 12.x
- containerd 1.7+ with CRIU support
- Linux kernel 5.15+ (cgroups v2 enabled)
- Privileged containers (required for CRIU ptrace access)
Bare-metal matters here. Serverless GPU products run containers in restricted security contexts that block ptrace and CAP_SYS_PTRACE, which CRIU requires. Spheron's bare-metal H100 and L40S nodes give you full root access to configure cgroups v2 and run privileged containers.
Triggering a checkpoint with the Kubernetes ContainerCheckpoint API:
The ContainerCheckpoint API (alpha in 1.25, beta and enabled by default since 1.30) is not exposed as a kubectl subcommand; you call the kubelet API directly on the node where the pod is running:
# Find which node the pod is on
NODE=$(kubectl get pod vllm-server-abc123 -n inference -o jsonpath='{.spec.nodeName}')
NODE_IP=$(kubectl get node $NODE -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}')
# POST to the kubelet checkpoint endpoint
# Endpoint format: /checkpoint/<namespace>/<pod>/<container>
curl -X POST \
"https://${NODE_IP}:10250/checkpoint/inference/vllm-server-abc123/vllm" \
--cert /path/to/client.crt \
--key /path/to/client.key \
--cacert /path/to/ca.crtThis writes a checkpoint bundle (OCI image archive) to the node's containerd checkpoint directory, typically /var/lib/kubelet/checkpoints/. To restore from it, use the checkpoint image as an Image volume in a new pod spec (available in Kubernetes 1.31+ (beta in 1.33) via the ImageVolume feature gate) so containerd restores the CRIU snapshot instead of starting the container from scratch.
Expected numbers:
| Restore method | 8B model | 70B model |
|---|---|---|
| Scratch (warm image cache) | ~25s | ~85s |
| CRIU restore | <5s | <5s |
| CRIU restore (quantized FP8 70B) | <5s | <5s |
CRIU restores GPU memory directly. Model size does not affect restore time since the checkpoint captures the already-loaded state.
This pairs well with persistent NVMe CUDA graph caches, covered in Torch.compile and CUDA Graphs for LLM Inference with PyTorch 2.6.
Scale-to-Zero Economics: When Is It Worth It
Using live pricing from the Spheron API (H100 SXM5 at $4.41/hr on-demand):
| Pattern | GPU | $/hr (on-demand) | Monthly idle cost | Break-even RPS |
|---|---|---|---|---|
| Always-on 1 replica | H100 SXM5 | $4.41 | $3,222 | N/A (always warm) |
| minScale=1 + KEDA burst | H100 SXM5 | $4.41 | $3,222 (1 replica) | Any traffic |
| Scale-to-zero (Knative) | H100 SXM5 | $4.41 (pay only while active) | ~$0-50 | <5 RPS sustained |
| Scale-to-zero | L40S | See /pricing/ for current rates | ~$0-30 | <10 RPS sustained |
Pricing fluctuates based on GPU availability. The prices above are based on 07 May 2026 and may have changed. Check current GPU pricing → for live rates.
The math flips when traffic exceeds about 5-10 RPS sustained. At that point a warm replica costs less than the user-facing latency tax of cold starts, especially without CRIU.
For a full comparison of billing models including reserved and spot pricing, see serverless vs on-demand vs reserved GPU billing comparison.
Step-by-Step: Deploy vLLM on Spheron with KEDA Autoscaling
Step 1: Provision a Spheron GPU Node and Bootstrap Kubernetes
Rent an H100 GPU rental on Spheron for this walkthrough. An H100 SXM5 with 80GB HBM3 handles Llama-3-70B with 32K context without CPU offload.
After the node is provisioned, install k3s with GPU support:
# Install k3s with containerd (GPU support enabled by default)
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable=traefik" sh -
# Verify node
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# gpu-node-001 Ready control-plane,master 60s v1.32.4+k3s1
# Verify GPU
nvidia-smi
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
# +-----------------------------------------------------------------------------+Install the NVIDIA device plugin for Kubernetes:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
# Verify GPU resource is visible
kubectl describe node gpu-node-001 | grep nvidia.com/gpu
# nvidia.com/gpu: 1Step 2: Deploy vLLM as a Kubernetes Deployment
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-weights-pvc
namespace: inference
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: nfs # Must support ReadWriteMany (NFS, Longhorn RWX, AWS EFS, etc.)
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
namespace: inference
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.5
resources:
limits:
nvidia.com/gpu: "1"
memory: 80Gi
requests:
nvidia.com/gpu: "1"
memory: 40Gi
env:
- name: VLLM_WORKER_MULTIPROC_METHOD
value: spawn
- name: HF_HOME
value: /models
- name: HF_HUB_OFFLINE
value: "0"
args:
- --model
- meta-llama/Llama-3-8B-Instruct
- --enable-prefix-caching
- --port
- "8000"
ports:
- containerPort: 8000
name: http
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
failureThreshold: 12
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-weights-pvc
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
namespace: inference
spec:
selector:
app: vllm-server
ports:
- name: http
port: 8000
targetPort: 8000
type: ClusterIPStorage class note: local-path binds the PV to a single Kubernetes node and only supports ReadWriteOnce. When KEDA scales this Deployment to multiple replicas (maxReplicaCount: 8 in Step 3), the scheduler will place burst replicas on additional nodes. Any replica on a node other than the one holding the local-path PV will get stuck in Pending because it cannot mount the volume. Use a ReadWriteMany storage class so all replicas can mount the weights regardless of node placement. Common options: NFS (nfs), Longhorn in RWX mode (longhorn), or a cloud-native equivalent like AWS EFS. For single-node clusters only, you can substitute ReadWriteOnce with storageClassName: local-path. Alternatively, skip the PVC entirely and use hostPath as described in the NVMe model weight cache section earlier in this post, which gives faster local NVMe reads and works naturally with per-node autoscaling.
Verify the pod reaches Running state and the API responds:
kubectl wait --for=condition=ready pod -l app=vllm-server -n inference --timeout=300s
curl http://$(kubectl get svc vllm-server -n inference -o jsonpath='{.spec.clusterIP}'):8000/v1/modelsStep 3: Install KEDA and Configure ScaledObject
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace \
--version 2.19.0
# Wait for KEDA to be ready
kubectl wait --for=condition=ready pod -l app=keda-operator -n keda --timeout=60sApply the ScaledObject from the KEDA section above. Once applied, verify:
# KEDA creates and manages an HPA
kubectl get hpa -n inference
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# keda-hpa-vllm-scaledobject Deployment/vllm-server 0/5 (avg) 1 8 1
# Check ScaledObject status
kubectl get scaledobject -n inference
# NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS
# vllm-scaledobject apps/v1.Deployment vllm-server 1 8 prometheus, prometheusStep 4: Add GPU Utilization Trigger and Prometheus Scraping
Install DCGM Exporter for GPU metrics:
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring \
--create-namespace \
--version 3.5.0Install kube-prometheus-stack for Prometheus:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.scrapeInterval=15sTo observe KEDA scaling in action, run a basic load test in one terminal and watch pods in another:
# Terminal 1: Watch pods
kubectl get pods -n inference -w
# Terminal 2: Generate load
for i in $(seq 1 20); do
curl -s http://VLLM_SERVICE:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3-8B-Instruct","prompt":"Explain GPU autoscaling:","max_tokens":100}' &
done
waitKEDA should trigger a scale-up within one to two polling intervals (15-30 seconds) when the queue depth or GPU utilization exceeds the thresholds.
Step 5: Configure Knative for Dev/Staging Endpoints
Apply Knative Serving (from the Knative section above). Then apply the Knative Service manifest.
Test cold-start behavior:
# Scale to zero by deleting all pods (Knative will restore from zero on next request)
kubectl delete pods -l serving.knative.dev/service=vllm-ksvc -n inference
# Time the cold start
time curl -s http://vllm-ksvc.inference.svc.cluster.local/v1/modelsWith pre-pulled images and a warm NVMe cache, expect 25-30 seconds for an 8B model cold start. With CRIU restore, under 5 seconds.
Step 6: Load Test and Observe Autoscaling
Use k6 for a gradual ramp-up:
import http from 'k6/http';
import { sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 5 }, // ramp to 5 users
{ duration: '5m', target: 20 }, // ramp to 20 users
{ duration: '2m', target: 0 }, // scale down
],
};
export default function () {
const payload = JSON.stringify({
model: 'meta-llama/Llama-3-8B-Instruct',
prompt: 'Write a short explanation of Kubernetes autoscaling:',
max_tokens: 150,
});
http.post('http://vllm-ksvc.inference.svc.cluster.local/v1/completions', payload, {
headers: { 'Content-Type': 'application/json' },
});
sleep(1);
}Run with k6 run load-test.js and in a separate terminal:
# Watch ScaledObject metric readings
watch "kubectl get scaledobject vllm-scaledobject -n inference -o yaml | grep -A5 externalMetricNames"
# Watch pod count
kubectl get pods -n inference -wMulti-Replica Patterns: Warm + Burst
The most cost-effective production pattern: one always-warm replica that handles baseline P50 traffic, with KEDA scaling burst replicas when queue depth spikes.
ScaledObject configuration:
spec:
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc:9090
query: sum(vllm:num_requests_waiting{namespace="inference"})
threshold: "5"Add a PodDisruptionBudget to ensure the warm replica survives node maintenance:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: vllm-pdb
namespace: inference
spec:
minAvailable: 1
selector:
matchLabels:
app: vllm-serverUse pod anti-affinity to place burst replicas on different nodes from the warm replica, avoiding resource contention:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: vllm-server
topologyKey: kubernetes.io/hostnameCost math for a single warm H100 SXM5 plus burst replicas on-demand: during off-peak hours (16h/day) you run exactly 1 replica at $4.41/hr. During peak hours you might average 3 replicas. Compared to always-on 5 replicas, this cuts costs by roughly 60-70%.
Cost Model: 24/7 Reserved vs Autoscaled vs Scale-to-Zero
| Traffic pattern | Reserved 24/7 | On-demand autoscaled (KEDA) | Scale-to-zero (Knative) |
|---|---|---|---|
| Constant 50 RPS | $3,222/mo | $3,222/mo (no idle savings) | Not viable (SLA impact) |
| 9am-5pm burst (8h/day) | $3,222/mo | ~$1,611/mo (~50% savings) | ~$1,128/mo (~65% savings) |
| Dev endpoint (<100 req/day) | $3,222/mo | ~$330/mo (~90% savings) | ~$50/mo (>95% savings) |
Recommendations:
- Constant 50 RPS: Use always-on with KEDA burst scaling. Scale-to-zero adds cold-start latency that users will notice.
- 9am-5pm burst: KEDA with
minReplicaCount: 0at night (via cron trigger or manual annotation) gives 50% savings. Knative with CRIU restore gets to 65% by eliminating idle time between request bursts during the workday. - Dev endpoint: Scale-to-zero is the correct choice. Even the KEDA/on-demand column (~$330/mo) assumes active use during working hours. Knative scale-to-zero with CRIU restore brings the cost to near zero for low-traffic endpoints.
For AI agent fleet scaling patterns, see scale AI agent fleets with GPU autoscaling and MCP orchestration.
GPU autoscaling for LLMs requires bare-metal control: pre-pulled image caches, NVMe weight loading, and privileged CRIU containers that serverless GPU products do not expose. Spheron's bare-metal H100 and L40S nodes give you those knobs without locking you into a managed serving layer.
Rent H100 on Spheron → | See L40S pricing → | View all GPU pricing →
