Why does standard Kubernetes HPA fail for LLM inference workloads?

HPA scales on CPU and memory metrics, neither of which reflects GPU queue depth or actual request latency for LLMs. A loaded vLLM instance keeps CPU near-idle while its token generation queue fills. By the time HPA detects stress and triggers scale-up, the new pod needs 3-10 minutes to pull images, load weights, capture CUDA graphs, and warm its KV cache - all while the existing pod is overloaded. KEDA's custom metric triggers (queue depth, vllm:num_requests_waiting) react before requests are dropped.

What is the difference between KEDA and Knative for GPU autoscaling?

KEDA is a metric-driven scaler: it watches any custom metric (Prometheus, Redis, SQS, Azure Service Bus) and adjusts replica count. It works with standard Kubernetes Deployments and does not impose routing constraints. Knative Serving is request-driven and HTTP-aware: it intercepts traffic through a proxy, measures concurrent requests per pod, and handles scale-to-zero natively with zero-second idle pods. Knative is better for HTTP-serving workloads with bursty, unpredictable traffic. KEDA is better for queue-depth scenarios, batch processing, and workloads that are not easily proxied by Knative's Activator.

How long does a cold start take for LLM inference on Kubernetes?

It depends on the model size and optimization state. For a Llama-3-8B pod on an H100 with pre-pulled images: weight load ~8s, CUDA graph capture ~12s, KV warmup ~5s, total around 25-30 seconds. For a 70B model without image caching: container pull 4-8 min, weight load 45s, CUDA graph capture 30s, total 6-10 minutes. CRIU checkpoint restore reduces this to under 5 seconds regardless of model size, since the entire process memory is restored rather than re-executed.

When does scale-to-zero make sense for LLM serving?

Scale-to-zero is economically justified when the idle cost of a warm pod exceeds the latency tax users accept for cold starts. On an H100 at roughly $4.41/hr, 24/7 idle costs about $3,222/month. If your workload receives fewer than a few hundred requests per day and users tolerate 30-60 second initial response times (or you have CRIU for sub-5s restores), scale-to-zero saves substantially. Development endpoints, staging environments, internal tools, and low-traffic demos are strong candidates. Production inference for user-facing products usually wants minScale >= 1.

Does CRIU checkpoint restore work for GPU LLM workloads?

Yes, with the right kernel and driver stack. CRIU GPU checkpoint support requires NVIDIA driver 550+ (565+ recommended for the CRIUgpu reference stack), CUDA 12.x, and either CRIU 3.19+ with kernel 5.15+ patches or the Kubernetes ContainerCheckpoint API (alpha in 1.25, beta and enabled by default since 1.30). Spheron bare-metal instances give you full root access to configure cgroups v2 and privileged containers, which serverless GPU products do not expose. The checkpoint captures GPU memory (model weights, KV cache state) and CPU process state. On restore, the CUDA context comes back without re-executing weight loading or graph capture.

GPU Inference Autoscaling with KEDA and Knative on Kubernetes: Cold-Start and Scale-to-Zero for LLM Serving (2026)

Standard Kubernetes HPA breaks on LLM inference. Not occasionally, not in edge cases. Every time a GPU pod restarts cold, the new replica spends 3-10 minutes pulling images, loading model weights from storage, capturing CUDA graphs, and warming its KV cache. By the time it is ready to serve, the upstream pod has been under load for minutes.

The tools that fix this, KEDA for queue-depth autoscaling and Knative for scale-to-zero, have good documentation for web services. Neither has a coherent guide for GPU LLM workloads specifically. The forum posts are scattered. The cold-start anatomy is underdocumented. No one source covers the full stack from metric triggers through CRIU checkpoint restore.

This guide walks through the complete stack on Spheron bare-metal GPU nodes: KEDA for metric-driven scaling, Knative for scale-to-zero, cold-start optimization techniques including CRIU checkpointing, and a cost model across three realistic traffic patterns. For Kubernetes GPU cluster setup fundamentals, see Kubernetes GPU Orchestration 2026. For vLLM tuning and multi-GPU tensor parallelism, see the vLLM Multi-GPU Production Deployment 2026 guide.

Why Standard HPA Fails for LLM Workloads

CPU and memory utilization do not reflect how an LLM inference pod is actually stressed. A vLLM pod handling 50 concurrent requests runs at 5-8% CPU while its GPU sits at 95%. HPA sees a healthy pod. Users see 30-second token generation latency as the KV cache fills and requests stack in the queue.

Here is a broken HPA spec that teams typically start with:

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-server
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

This will never scale. The CPU target of 70% will not be reached during normal inference load. Even under severe GPU saturation, the CPU metric stays low because token generation is GPU-bound.

The second problem is cold-start time. When HPA eventually triggers a scale event (after CPU actually spikes due to something unrelated, or never), the new pod faces this timeline:

Phase	8B model (H100 NVMe)	70B model (H100 NVMe)
Container pull (uncached)	4-6 min (15GB image)	4-6 min
Weight load from NVMe	~8s	~45s
CUDA graph capture	~12s	~30s
KV cache warmup	~5s	~10s
Total (cold, uncached)	~5-7 min	~6-8 min
Total (warm image cache)	~25s	~85s

Pre-pulled images are the single biggest lever. Everything else is seconds. The container pull is minutes.

Cold-Start Anatomy: What Actually Takes Time

Weight loading from NVMe. This is where bare metal matters. Spheron's H100 nodes have local NVMe with sequential read bandwidth of 3-4 GB/s. A 70B BF16 model is about 140GB on disk. At 3.5 GB/s that is 40 seconds. Cloud-attached NFS storage runs 400-600 MB/s. The same model takes 4-6 minutes. That one difference makes local NVMe essential for any workload that expects to scale up and be serving within 90 seconds.

CUDA graph capture. When vLLM runs without --enforce-eager (the default), it captures CUDA graphs for common batch sizes. A CUDA graph records a fixed sequence of GPU operations and replays them with lower launch overhead than dynamic dispatch. The capture itself takes 10-30 seconds depending on the model and how many batch sizes are traced. On subsequent runs, if you persist the graph cache to a PVC, capture time drops to near zero. The env var CUDA_GRAPHS_CACHE_DIR controls the cache location.

KV cache warmup. The first N requests populate the prefix cache. vLLM's --enable-prefix-caching flag lets subsequent requests with shared prefixes skip redundant computation. The warmup period is typically 5-10 requests, after which hit rates stabilize. For more detail on KV cache sizing and prefix cache tuning, see KV Cache Optimization Guide.

The NVMe advantage directly benefits Spheron H100 nodes: bare-metal provisioning means model weights load from local disk at full NVMe bandwidth, not over a network filesystem.

KEDA vs Knative vs Custom Controller: Which for Which Stack

Autoscaler	Trigger type	Scale-to-zero	Routing changes	Best for
KEDA	Any metric (Prometheus, queues, custom)	With KEDA-HTTP add-on	None (standard Service)	Queue-depth LLM serving, batch processing
Knative Serving	HTTP concurrency / RPS	Native	Replaces Service with Knative Route	HTTP inference endpoints, user-facing chatbots
Ray Serve autoscaler	Ray task queue	Partial (cluster-level)	Ray Serve router	Multi-model pipelines, agent frameworks
Custom operator (llm-d)	Routing gateway metrics	No	Kubernetes Gateway API	Prefill/decode disaggregation, multi-pool routing

KEDA works with standard Kubernetes Deployments and Services. No routing changes. You add a ScaledObject and KEDA manages an HPA behind the scenes, but using metrics you define instead of CPU/memory. The tradeoff: KEDA-HTTP is required for true scale-to-zero, and it adds a reverse proxy in the request path.

Knative Serving intercepts all traffic through its Activator component. The Activator buffers incoming requests while a pod cold-starts, then proxies them once the pod is ready. This makes scale-to-zero transparent to callers, with latency as the only user-visible impact. The tradeoff: Knative replaces your Service with a Knative Route, which changes how you address the endpoint.

For disaggregated inference patterns, see llm-d Kubernetes disaggregated inference guide. For Ray Serve on GPU cloud, see Ray Serve GPU cloud LLM deployment.

For workloads that need GPU quota management and fractional sharing on top of Kubernetes rather than reactive autoscaling, NVIDIA Run:ai on GPU Cloud covers the scheduler-first approach.

KEDA: Scale on Queue Depth and GPU Utilization

Install KEDA via Helm:

bash

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace \
  --version 2.19.0

vLLM exposes a Prometheus-compatible /metrics endpoint. The key metric for queue depth is vllm:num_requests_waiting. When this exceeds your threshold, new replicas should come up. The second trigger, GPU utilization from DCGM Exporter, catches burst conditions before the queue fills.

yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
  namespace: inference
spec:
  scaleTargetRef:
    name: vllm-server
  pollingInterval: 15
  cooldownPeriod: 120
  minReplicaCount: 1
  maxReplicaCount: 8
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 60
          policies:
          - type: Pods
            value: 1
            periodSeconds: 60
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.monitoring.svc:9090
      metricName: vllm_requests_waiting
      query: sum(vllm:num_requests_waiting{namespace="inference"})
      threshold: "5"
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.monitoring.svc:9090
      metricName: dcgm_gpu_utilization
      query: avg(DCGM_FI_DEV_GPU_UTIL{namespace="inference",pod=~"vllm-server-.*"})
      threshold: "80"

Key settings to understand:

pollingInterval: 15 checks metrics every 15 seconds. Lower values give faster reaction but add Prometheus query load.
cooldownPeriod: 120 waits 2 minutes after last scale event before allowing scale-down. Prevents flapping.
stabilizationWindowSeconds: 60 on scale-down means KEDA tracks the highest replica count recommendation over the last 60 seconds before reducing. This prevents a momentary metric dip from triggering a premature scale-down.
minReplicaCount: 1 keeps one warm replica for production. Use 0 only with KEDA-HTTP for true scale-to-zero.

The dual-trigger approach matters: the GPU utilization trigger fires when existing replicas are hot, before the queue fills and before user-visible latency degrades. The queue depth trigger catches cases where GPU utilization is moderate but requests are stacking (long-context inputs with slow decode).

Knative Serving: Scale-to-Zero with GPU Pods

Install Knative Serving and the net-kourier ingress:

bash

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.22.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.22.0/serving-core.yaml
kubectl apply -f https://github.com/knative/net-kourier/releases/download/knative-v1.22.0/kourier.yaml
kubectl patch configmap/config-network \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'

Define a Knative Service for vLLM:

yaml

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: vllm-ksvc
  namespace: inference
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/metric: concurrency
        autoscaling.knative.dev/target: "10"
        autoscaling.knative.dev/minScale: "0"
        autoscaling.knative.dev/maxScale: "5"
        autoscaling.knative.dev/scale-to-zero-grace-period: "300s"
        autoscaling.knative.dev/window: "60s"
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.8.5
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 80Gi
          requests:
            nvidia.com/gpu: "1"
            memory: 80Gi
        env:
        - name: VLLM_WORKER_MULTIPROC_METHOD
          value: spawn
        - name: HF_HOME
          value: /models
        - name: CUDA_GRAPHS_CACHE_DIR
          value: /cuda-graphs
        args:
        - --model
        - meta-llama/Llama-3-8B-Instruct
        - --enable-prefix-caching
        volumeMounts:
        - name: model-cache
          mountPath: /models
        - name: cuda-graph-cache
          mountPath: /cuda-graphs
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 5
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-weights-pvc
      - name: cuda-graph-cache
        persistentVolumeClaim:
          claimName: cuda-graphs-pvc

GPU-specific considerations with Knative:

Activator buffering. When a pod is scaled to zero, Knative's Activator receives the first request and holds it (with HTTP 503 internally, not exposed to the caller) while the GPU pod comes up. The caller sees a slow response, not an error. This works as long as the client does not time out before the pod is ready. Set client timeouts above your worst-case cold-start time, typically 120-180 seconds for 8B models with warm image cache.

Queue proxy overhead. Knative injects a queue-proxy sidecar into every pod. It consumes ~100m CPU and 100Mi memory. Budget for this in your node sizing.

minScale: 0 vs minScale: 1. For production user-facing inference, use minScale: 1. The cold-start latency tax is not acceptable for interactive use cases. For dev/staging endpoints, internal tools, and batch workloads with user-visible scheduling, minScale: 0 with CRIU restore makes sense economically.

Cold-Start Optimization Techniques

1. Pre-pull images via DaemonSet

Container image pull is the single longest phase for uncached pods. A DaemonSet that pre-pulls the vLLM image on all GPU nodes eliminates it entirely:

yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: vllm-image-prepuller
  namespace: inference
spec:
  selector:
    matchLabels:
      app: vllm-prepuller
  template:
    metadata:
      labels:
        app: vllm-prepuller
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      initContainers:
      - name: pull-vllm
        image: vllm/vllm-openai:v0.8.5
        command: ["sh", "-c", "echo Image pulled"]
        resources:
          limits:
            memory: 100Mi
      containers:
      - name: pause
        image: gcr.io/google_containers/pause:3.9
        resources:
          limits:
            memory: 10Mi

The initContainer triggers a pull. The pause container keeps the DaemonSet pod running. When vLLM scales up, the image is already present on the node.

2. Quantized weights

FP8 quantization reduces a 70B model from ~140GB to ~40GB. Load time drops by roughly 3x. Use --quantization fp8 in vLLM:

yaml

args:
- --model
- meta-llama/Llama-3-70B-Instruct
- --quantization
- fp8

For FP8 setup details and accuracy impact, see vLLM Multi-GPU Production Deployment 2026.

3. CUDA graph cache on a PVC

CUDA graphs capture GPU operation sequences. Without caching, every new pod re-captures them at startup. With a shared PVC:

yaml

volumes:
- name: cuda-graph-cache
  persistentVolumeClaim:
    claimName: cuda-graphs-pvc

Set CUDA_GRAPHS_CACHE_DIR=/cuda-graphs and omit --enforce-eager so CUDA graphs remain enabled. The first pod to start writes the captured graphs to the PVC. Subsequent pods load from it, shaving 12-30 seconds from startup.

One caveat: ReadWriteOnce PVCs can only be mounted by pods on the same node simultaneously. If your replicas scale across multiple nodes, provision the PVC with a ReadWriteMany storage class (NFS, Longhorn in RWX mode, or a cloud-native equivalent) so all pods can read the cache regardless of placement.

4. NVMe model weight cache via hostPath

The fastest way to serve weights to a new pod is from local NVMe. Mount the model weights directory as a hostPath volume on GPU nodes. Weights are downloaded once from the Hugging Face hub and cached locally:

yaml

volumes:
- name: model-cache
  hostPath:
    path: /data/model-weights
    type: DirectoryOrCreate

This survives pod restarts and node reboots (the files persist on disk). On Spheron's bare-metal nodes, NVMe sequential reads run 3-4 GB/s vs 400-600 MB/s for network-attached storage.

5. Layer caching in Dockerfiles

If you want model weights baked into the image, keep them in a separate weight-only base image so they survive vLLM version bumps:

dockerfile

# Weight image - built separately, rarely changes
# Tag this independently (e.g. weights:llama-3-70b-v1) so it is not
# rebuilt when vLLM is updated.
FROM python:3.11-slim AS weights
COPY model_weights/ /weights/

# Runtime image - changes with vLLM updates
FROM vllm/vllm-openai:v0.8.5
COPY serving_config.json /config/
COPY --from=weights /weights/ /weights/

Because the weight image is built and tagged independently, upgrading vllm/vllm-openai does not invalidate it. Note that if you build both stages in the same docker build invocation with the same base, Docker will still re-copy the weights layer. The decoupling only holds when the weight image is pre-built and referenced by a fixed tag. For 40-140GB models the simpler approach is a hostPath or PVC mount (covered above) so weights live entirely outside the image.

Snapshot-Based Fast Restore: CRIU and CUDA Checkpoint

CRIU (Checkpoint/Restore In Userspace) captures the entire state of a running process, including its GPU context, to disk. On restore, the process continues from exactly where it was checkpointed, without re-executing initialization code.

For a vLLM pod, this means:

Checkpoint after first successful model load (weights loaded, CUDA graphs captured, KV cache initialized)
On subsequent cold starts, restore from the checkpoint bundle
Time to first token: under 5 seconds instead of 25-85 seconds

Requirements:

NVIDIA driver 550+ (565+ recommended for the CRIUgpu reference stack)
CUDA 12.x
containerd 1.7+ with CRIU support
Linux kernel 5.15+ (cgroups v2 enabled)
Privileged containers (required for CRIU ptrace access)

Bare-metal matters here. Serverless GPU products run containers in restricted security contexts that block ptrace and CAP_SYS_PTRACE, which CRIU requires. Spheron's bare-metal H100 and L40S nodes give you full root access to configure cgroups v2 and run privileged containers.

Triggering a checkpoint with the Kubernetes ContainerCheckpoint API:

The ContainerCheckpoint API (alpha in 1.25, beta and enabled by default since 1.30) is not exposed as a kubectl subcommand; you call the kubelet API directly on the node where the pod is running:

bash

# Find which node the pod is on
NODE=$(kubectl get pod vllm-server-abc123 -n inference -o jsonpath='{.spec.nodeName}')
NODE_IP=$(kubectl get node $NODE -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}')

# POST to the kubelet checkpoint endpoint
# Endpoint format: /checkpoint/<namespace>/<pod>/<container>
curl -X POST \
  "https://${NODE_IP}:10250/checkpoint/inference/vllm-server-abc123/vllm" \
  --cert /path/to/client.crt \
  --key /path/to/client.key \
  --cacert /path/to/ca.crt

This writes a checkpoint bundle (OCI image archive) to the node's containerd checkpoint directory, typically /var/lib/kubelet/checkpoints/. To restore from it, use the checkpoint image as an Image volume in a new pod spec (available in Kubernetes 1.31+ (beta in 1.33) via the ImageVolume feature gate) so containerd restores the CRIU snapshot instead of starting the container from scratch.

Expected numbers:

Restore method	8B model	70B model
Scratch (warm image cache)	~25s	~85s
CRIU restore	<5s	<5s
CRIU restore (quantized FP8 70B)	<5s	<5s

CRIU restores GPU memory directly. Model size does not affect restore time since the checkpoint captures the already-loaded state.

This pairs well with persistent NVMe CUDA graph caches, covered in Torch.compile and CUDA Graphs for LLM Inference with PyTorch 2.6.

Scale-to-Zero Economics: When Is It Worth It

Using live pricing from the Spheron API (H100 SXM5 at $4.41/hr on-demand):

Pattern	GPU	$/hr (on-demand)	Monthly idle cost	Break-even RPS
Always-on 1 replica	H100 SXM5	$4.41	$3,222	N/A (always warm)
minScale=1 + KEDA burst	H100 SXM5	$4.41	$3,222 (1 replica)	Any traffic
Scale-to-zero (Knative)	H100 SXM5	$4.41 (pay only while active)	~$0-50	<5 RPS sustained
Scale-to-zero	L40S	See /pricing/ for current rates	~$0-30	<10 RPS sustained

Pricing fluctuates based on GPU availability. The prices above are based on 07 May 2026 and may have changed. Check current GPU pricing → for live rates.

The math flips when traffic exceeds about 5-10 RPS sustained. At that point a warm replica costs less than the user-facing latency tax of cold starts, especially without CRIU.

For a full comparison of billing models including reserved and spot pricing, see serverless vs on-demand vs reserved GPU billing comparison.

Step-by-Step: Deploy vLLM on Spheron with KEDA Autoscaling

Step 1: Provision a Spheron GPU Node and Bootstrap Kubernetes

Rent an H100 GPU rental on Spheron for this walkthrough. An H100 SXM5 with 80GB HBM3 handles Llama-3-70B with 32K context without CPU offload.

After the node is provisioned, install k3s with GPU support:

bash

# Install k3s with containerd (GPU support enabled by default)
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable=traefik" sh -

# Verify node
kubectl get nodes
# NAME           STATUS   ROLES                  AGE   VERSION
# gpu-node-001   Ready    control-plane,master   60s   v1.32.4+k3s1

# Verify GPU
nvidia-smi
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 565.57.01     Driver Version: 565.57.01   CUDA Version: 12.7     |
# +-----------------------------------------------------------------------------+

Install the NVIDIA device plugin for Kubernetes:

bash

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

# Verify GPU resource is visible
kubectl describe node gpu-node-001 | grep nvidia.com/gpu
# nvidia.com/gpu:  1

Step 2: Deploy vLLM as a Kubernetes Deployment

yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-weights-pvc
  namespace: inference
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 50Gi
  storageClassName: nfs  # Must support ReadWriteMany (NFS, Longhorn RWX, AWS EFS, etc.)
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.8.5
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 80Gi
          requests:
            nvidia.com/gpu: "1"
            memory: 40Gi
        env:
        - name: VLLM_WORKER_MULTIPROC_METHOD
          value: spawn
        - name: HF_HOME
          value: /models
        - name: HF_HUB_OFFLINE
          value: "0"
        args:
        - --model
        - meta-llama/Llama-3-8B-Instruct
        - --enable-prefix-caching
        - --port
        - "8000"
        ports:
        - containerPort: 8000
          name: http
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 30
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 5
          failureThreshold: 12
        volumeMounts:
        - name: model-cache
          mountPath: /models
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-weights-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
  namespace: inference
spec:
  selector:
    app: vllm-server
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  type: ClusterIP

Storage class note: local-path binds the PV to a single Kubernetes node and only supports ReadWriteOnce. When KEDA scales this Deployment to multiple replicas (maxReplicaCount: 8 in Step 3), the scheduler will place burst replicas on additional nodes. Any replica on a node other than the one holding the local-path PV will get stuck in Pending because it cannot mount the volume. Use a ReadWriteMany storage class so all replicas can mount the weights regardless of node placement. Common options: NFS (nfs), Longhorn in RWX mode (longhorn), or a cloud-native equivalent like AWS EFS. For single-node clusters only, you can substitute ReadWriteOnce with storageClassName: local-path. Alternatively, skip the PVC entirely and use hostPath as described in the NVMe model weight cache section earlier in this post, which gives faster local NVMe reads and works naturally with per-node autoscaling.

Verify the pod reaches Running state and the API responds:

bash

kubectl wait --for=condition=ready pod -l app=vllm-server -n inference --timeout=300s
curl http://$(kubectl get svc vllm-server -n inference -o jsonpath='{.spec.clusterIP}'):8000/v1/models

Step 3: Install KEDA and Configure ScaledObject

bash

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace \
  --version 2.19.0

# Wait for KEDA to be ready
kubectl wait --for=condition=ready pod -l app=keda-operator -n keda --timeout=60s

Apply the ScaledObject from the KEDA section above. Once applied, verify:

bash

# KEDA creates and manages an HPA
kubectl get hpa -n inference
# NAME                      REFERENCE                 TARGETS       MINPODS   MAXPODS   REPLICAS
# keda-hpa-vllm-scaledobject   Deployment/vllm-server   0/5 (avg)     1         8         1

# Check ScaledObject status
kubectl get scaledobject -n inference
# NAME                 SCALETARGETKIND      SCALETARGETNAME   MIN   MAX   TRIGGERS
# vllm-scaledobject    apps/v1.Deployment   vllm-server       1     8     prometheus, prometheus

Step 4: Add GPU Utilization Trigger and Prometheus Scraping

Install DCGM Exporter for GPU metrics:

bash

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --create-namespace \
  --version 3.5.0

Install kube-prometheus-stack for Prometheus:

bash

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.scrapeInterval=15s

To observe KEDA scaling in action, run a basic load test in one terminal and watch pods in another:

bash

# Terminal 1: Watch pods
kubectl get pods -n inference -w

# Terminal 2: Generate load
for i in $(seq 1 20); do
  curl -s http://VLLM_SERVICE:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"meta-llama/Llama-3-8B-Instruct","prompt":"Explain GPU autoscaling:","max_tokens":100}' &
done
wait

KEDA should trigger a scale-up within one to two polling intervals (15-30 seconds) when the queue depth or GPU utilization exceeds the thresholds.

Step 5: Configure Knative for Dev/Staging Endpoints

Apply Knative Serving (from the Knative section above). Then apply the Knative Service manifest.

Test cold-start behavior:

bash

# Scale to zero by deleting all pods (Knative will restore from zero on next request)
kubectl delete pods -l serving.knative.dev/service=vllm-ksvc -n inference

# Time the cold start
time curl -s http://vllm-ksvc.inference.svc.cluster.local/v1/models

With pre-pulled images and a warm NVMe cache, expect 25-30 seconds for an 8B model cold start. With CRIU restore, under 5 seconds.

Step 6: Load Test and Observe Autoscaling

Use k6 for a gradual ramp-up:

javascript

import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 5 },   // ramp to 5 users
    { duration: '5m', target: 20 },  // ramp to 20 users
    { duration: '2m', target: 0 },   // scale down
  ],
};

export default function () {
  const payload = JSON.stringify({
    model: 'meta-llama/Llama-3-8B-Instruct',
    prompt: 'Write a short explanation of Kubernetes autoscaling:',
    max_tokens: 150,
  });

  http.post('http://vllm-ksvc.inference.svc.cluster.local/v1/completions', payload, {
    headers: { 'Content-Type': 'application/json' },
  });
  sleep(1);
}

Run with k6 run load-test.js and in a separate terminal:

bash

# Watch ScaledObject metric readings
watch "kubectl get scaledobject vllm-scaledobject -n inference -o yaml | grep -A5 externalMetricNames"

# Watch pod count
kubectl get pods -n inference -w

Multi-Replica Patterns: Warm + Burst

The most cost-effective production pattern: one always-warm replica that handles baseline P50 traffic, with KEDA scaling burst replicas when queue depth spikes.

ScaledObject configuration:

yaml

spec:
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.monitoring.svc:9090
      query: sum(vllm:num_requests_waiting{namespace="inference"})
      threshold: "5"

Add a PodDisruptionBudget to ensure the warm replica survives node maintenance:

yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-pdb
  namespace: inference
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: vllm-server

Use pod anti-affinity to place burst replicas on different nodes from the warm replica, avoiding resource contention:

yaml

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: vllm-server
        topologyKey: kubernetes.io/hostname

Cost math for a single warm H100 SXM5 plus burst replicas on-demand: during off-peak hours (16h/day) you run exactly 1 replica at $4.41/hr. During peak hours you might average 3 replicas. Compared to always-on 5 replicas, this cuts costs by roughly 60-70%.

Cost Model: 24/7 Reserved vs Autoscaled vs Scale-to-Zero

Traffic pattern	Reserved 24/7	On-demand autoscaled (KEDA)	Scale-to-zero (Knative)
Constant 50 RPS	$3,222/mo	$3,222/mo (no idle savings)	Not viable (SLA impact)
9am-5pm burst (8h/day)	$3,222/mo	~$1,611/mo (~50% savings)	~$1,128/mo (~65% savings)
Dev endpoint (<100 req/day)	$3,222/mo	~$330/mo (~90% savings)	~$50/mo (>95% savings)

Recommendations:

Constant 50 RPS: Use always-on with KEDA burst scaling. Scale-to-zero adds cold-start latency that users will notice.
9am-5pm burst: KEDA with minReplicaCount: 0 at night (via cron trigger or manual annotation) gives 50% savings. Knative with CRIU restore gets to 65% by eliminating idle time between request bursts during the workday.
Dev endpoint: Scale-to-zero is the correct choice. Even the KEDA/on-demand column (~$330/mo) assumes active use during working hours. Knative scale-to-zero with CRIU restore brings the cost to near zero for low-traffic endpoints.

For AI agent fleet scaling patterns, see scale AI agent fleets with GPU autoscaling and MCP orchestration.

GPU autoscaling for LLMs requires bare-metal control: pre-pulled image caches, NVMe weight loading, and privileged CRIU containers that serverless GPU products do not expose. Spheron's bare-metal H100 and L40S nodes give you those knobs without locking you into a managed serving layer.
Rent H100 on Spheron → | See L40S pricing → | View all GPU pricing →

Why Standard HPA Fails for LLM Workloads

Cold-Start Anatomy: What Actually Takes Time

KEDA vs Knative vs Custom Controller: Which for Which Stack

KEDA: Scale on Queue Depth and GPU Utilization

Knative Serving: Scale-to-Zero with GPU Pods

Cold-Start Optimization Techniques

Snapshot-Based Fast Restore: CRIU and CUDA Checkpoint

Scale-to-Zero Economics: When Is It Worth It

Step-by-Step: Deploy vLLM on Spheron with KEDA Autoscaling

Step 1: Provision a Spheron GPU Node and Bootstrap Kubernetes

Step 2: Deploy vLLM as a Kubernetes Deployment

Step 3: Install KEDA and Configure ScaledObject

Step 4: Add GPU Utilization Trigger and Prometheus Scraping

Step 5: Configure Knative for Dev/Staging Endpoints

Step 6: Load Test and Observe Autoscaling

Multi-Replica Patterns: Warm + Burst

Cost Model: 24/7 Reserved vs Autoscaled vs Scale-to-Zero

Build what's next.