Engineering

GPU Goodput Engineering: Why AI Clusters Sit at 5% Utilization and How to Fix It (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 19, 2026
GPU GoodputGPU UtilizationEnterprise AI ClusterGPU Goodput vs UtilizationAI Cluster IdleMIG GPU PackingPrefill Decode DisaggregationGPU FinOpsDCGM ExporterGPU Cost Optimization
GPU Goodput Engineering: Why AI Clusters Sit at 5% Utilization and How to Fix It (2026)

Enterprise AI clusters run at 5% GPU utilization on average. That number comes from Cast AI's 2026 State of Kubernetes Optimization Report, which measured SM utilization across real production clusters. Five percent. That means for every dollar you spend on H100 time, ninety-five cents is generating heat, not tokens.

The gap between what you pay for and what you get is not a hardware problem. It's a scheduling, sizing, and measurement problem. This post diagnoses the four root causes and gives you concrete fixes for each. For broader cost context, the GPU cost optimization playbook covers the full spectrum of GPU spend reduction, including this domain.

The 5% Number and What It Actually Measures

SM utilization measures the fraction of streaming multiprocessors that are actively executing warps at a given instant. A 5% SM utilization reading means the GPU's compute units are doing something useful only 5% of the time. The other 95% is idle: waiting for memory transactions, waiting for the next batch to arrive, waiting for prefill to finish before decode can start.

Three metrics matter for diagnosing this:

SM utilization (DCGM_FI_DEV_GPU_UTIL): The top-level signal. Anything below 40% for an inference server under load is a red flag.

Memory bandwidth utilization (DCGM_FI_DEV_MEM_COPY_UTIL): A high MBU with low SM util points to memory-bound workloads where the GPU is stalled on data movement, not compute. Common with large-context KV-cache reads.

Goodput: The metric neither of those captures. More on this in the next section.

The Cast AI report's 5% figure is a median across anonymized enterprise clusters. Individual teams see wide variance, but the direction is consistent: reserved GPU capacity sits underused because it's provisioned for peak loads that rarely arrive, and because the workloads running on it aren't scheduled efficiently.

The $401 billion AI infrastructure market is being built on clusters that are 95% idle. That math eventually comes due.

What Goodput Actually Means

Goodput is not throughput. Throughput counts tokens generated per second. Goodput counts tokens that arrive within your latency budget.

A useful operator-friendly approximation, inspired by the DistServe paper (Zhong et al., 2024): goodput rate equals throughput multiplied by the probability that latency falls below the SLO threshold.

goodput_rate = throughput * P(latency < SLO_threshold)

A cluster can show 80% SM utilization and deliver 20% goodput. Here's a concrete example. You have an H100 running a 70B model with max_num_seqs=256 and a chat endpoint with a TTFT SLO of 500 ms. Batches of 256 requests take 2-3 seconds to prefill. SM utilization is high: all that compute is genuinely busy doing the prefill. But 80% of requests exceed the 500 ms TTFT target before they get a single output token. Throughput looks fine. Goodput is 20%.

Raw utilization hides this completely. You need both metrics instrumented before you can diagnose anything.

Diagnosing Your Cluster: The Four Failure Modes

Deploy NVIDIA DCGM Exporter as a DaemonSet and configure Prometheus to scrape these counters at 10-second intervals:

  • DCGM_FI_DEV_GPU_UTIL (SM utilization)
  • DCGM_FI_DEV_MEM_COPY_UTIL (memory bandwidth utilization)
  • DCGM_FI_PROF_SM_ACTIVE (fraction of cycles with at least one warp active)
  • DCGM_FI_PROF_SM_OCCUPANCY (ratio of resident warps to max warps)
  • DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_FB_FREE (VRAM pressure)

Cross-reference with vLLM's built-in metrics: vllm:gpu_cache_usage_perc, vllm:request_success_total, and vllm:time_to_first_token_seconds.

Failure Mode 1: Oversized Batch Windows

Symptom: SM utilization above 70%, but TTFT p95 is 3-5x your SLO target. Throughput looks fine in your dashboard. Users are hitting timeouts.

Cause: max_num_seqs is set too high for your TTFT SLO. vLLM fills the batch before starting prefill, and a full batch of 128-256 sequences takes seconds to process. Every request in the batch waits for every other request before getting its first token.

DCGM signals: High SM active, moderate MBU, no obvious idle gaps. The compute is genuinely busy. The problem isn't idleness, it's that you're optimizing for throughput at the expense of latency.

Fix: Tune max_num_seqs down until p95 TTFT meets your SLO. Enable chunked prefill (--enable-chunked-prefill) to interleave prefill and decode steps rather than running full-batch prefill serially.

Failure Mode 2: Prefill Blocking Decode

Symptom: Long TTFT spikes that don't correlate with batch size. Spiky SM utilization: high during prefill, low during decode streaming.

Cause: Prefill and decode share the same GPU. When a new long-context request arrives, it monopolizes the GPU for prefill while all in-progress decode requests stall. Users streaming responses see a pause mid-token, then a burst.

DCGM signals: SM util oscillates between 60-90% and 10-20% in cycles of 1-3 seconds. In an Nsight Systems trace, you see long compute kernels (prefill) followed by smaller, fragmented decode kernels with memory-wait gaps between them.

Fix: Prefill-decode disaggregation, covered in Fix 2 below.

Failure Mode 3: KV-Cache Thrash

Symptom: SM utilization looks normal, but vllm:gpu_cache_usage_perc stays above 90%. Request latency climbs steadily over time. The KV-cache optimization guide covers this failure mode in depth, including prefix caching and chunked prefill interactions.

For long-context workloads, see also the NVMe KV cache offloading guide for extending the effective cache size beyond VRAM.

Cause: The KV cache is full. New requests force eviction of cached key-value tensors from earlier requests. Those requests then need to re-compute their KV cache on the next step, wasting compute on work that was already done.

DCGM signals: High MBU with rising latency. DCGM_FI_DEV_FB_USED pegged near the VRAM ceiling. Eviction counter in vLLM logs climbing steadily.

Fix: Reduce max context length to match actual request distribution (most requests are short; don't reserve KV cache for 128K tokens if p95 is 8K). Enable prefix caching. Consider NVMe offloading for long-context workloads.

Failure Mode 4: Single-Tenant GPU Pinning

Symptom: SM utilization is low during off-peak hours, high during business hours. One team's workload claims a full H100 that runs at 8% SM utilization during the night shift.

Cause: Reserved capacity model. Each team gets dedicated GPUs that sit mostly idle when their workload isn't running. The GPU is claimed but not used.

DCGM signals: Spiky SM utilization with long idle valleys. DCGM_FI_DEV_FB_USED near zero during idle periods even though the GPU is nominally reserved.

Fix: Multi-tenant packing with MIG or time-slicing (Fix 1), or switching from reserved capacity to on-demand scheduling (Fix 4).

Fix 1: Pack Workloads with MIG and Time-Slicing

Multi-Instance GPU (MIG) partitions one physical GPU into isolated slices, each with dedicated SM capacity, L2 cache, and VRAM. Three 7B models on a single H100 on Spheron will use 3x the tokens-per-dollar vs three separate H100 instances at 8% utilization each.

MIG is available on H100, A100, H200, and B200 only. For L40S and RTX PRO 6000, time-slicing is the alternative.

MIG configuration for H100 (7x 1g.10gb slices):

bash
# Enable MIG mode
sudo nvidia-smi -mig 1

# Create 7 smallest slices (each gets ~10GB VRAM and ~14% SM)
sudo nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb -C

# List created instances
nvidia-smi -L

MIG for larger models (2x 3g.40gb):

bash
# Two slices of 40GB VRAM, ~38% SM each
sudo nvidia-smi mig -cgi 3g.40gb,3g.40gb -C

Time-slicing for non-MIG GPUs:

yaml
# NVIDIA GPU Operator ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

For a detailed walkthrough of running multiple models on a single GPU, see the guide to run multiple LLMs on one GPU with MIG and the fractional GPU inference guide for MPS and vGPU alternatives.

Practical packing example: An H100 SXM5 running a single Llama 3 8B instance at 6% SM utilization. With 7x 1g.10gb MIG slices, you run 7 separate INT4-quantized Llama 3 8B instances, each isolated with 10GB VRAM (FP16 weights alone require ~16 GB, so INT4 quantization is required to fit within the 10 GB slice). Total SM utilization: 42-50%. Cost per request: ~7x lower. The GPU is the same. The difference is scheduling.

Fix 2: Decouple Prefill and Decode

Prefill and decode have opposite compute profiles. Prefill is compute-bound: it processes the full input prompt in parallel and saturates SM capacity. Decode is memory-bandwidth-bound: it generates one token at a time, reading the full KV cache on each step. Running both on the same GPU means neither runs optimally.

The DistServe paper demonstrated that separating prefill and decode into distinct worker pools improves goodput by 3-8x for latency-sensitive workloads, depending on input length distribution. The prefill-decode disaggregation guide covers deployment patterns in detail.

vLLM disaggregated prefill (v0.4+):

bash
# Prefill instance
vllm serve meta-llama/Llama-3-70b-instruct \
  --kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}' \
  --port 8100

# Decode instance
vllm serve meta-llama/Llama-3-70b-instruct \
  --kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}' \
  --port 8200

Route requests based on prompt length: requests with prompts above 2,000 tokens go to prefill workers first. Decode workers handle streaming output independently. The result is consistent TTFT regardless of input length, because long-context prefill no longer blocks decode.

The throughput tradeoff: you need more total GPU instances to achieve the same peak throughput as a monolithic setup. The goodput gain justifies this for latency-sensitive endpoints. For batch workloads where TTFT doesn't matter, monolithic serving is more efficient.

Fix 3: Right-Size Your GPU Choice

The highest-utilization fix is often just picking a smaller GPU. Most teams default to H100 because it's the benchmark-topping flagship. But H100 HBM3 bandwidth (3.35 TB/s) only meaningfully outperforms GDDR6X (960 GB/s) when you're running large models at large batch sizes. For a 7B model at batch size 32, you're not close to saturating either.

The cost-per-useful-token argument depends on three factors: model size, batch size, and SLO. The GPU cost-per-token benchmarks give you empirical data across a range of models and batch sizes to make this comparison concrete.

For L40S for AI inference: the L40S trades HBM bandwidth for lower price and still delivers 80-90% of H100 throughput on 7-34B models at batch sizes under 64.

Right-sizing table:

WorkloadModel sizeRecommended SKUWhy
Chat endpoint7-13BL40S or RTX PRO 6000 (Blackwell)Bandwidth sufficient, cost 2-3x lower
Chat endpoint30-34BL40S48GB fits INT8-quantized model, batch size < 64
Chat endpoint70BH100 SXM5HBM3 bandwidth needed for latency SLO
Batch inference7-34BL40S (spot)Throughput-optimized; 30-34B requires INT8 quantization
Batch inference70B+H100 (spot)Model size requires HBM
Training / fine-tuneAnyH100 or H200NVLink and HBM3 matter for gradient sync
Frontier MoE (405B+)405B+H200 or B200141-192GB VRAM required

L40S vs H100 concrete comparison (Llama 3 8B, batch size 32):

  • L40S throughput: ~2,800 tokens/sec
  • H100 SXM5 throughput: ~3,400 tokens/sec
  • Throughput gap: 18%
  • Price gap: H100 at $3.90/hr, L40S at $1.99/hr = ~1.96x cheaper
  • L40S cost-per-token: ~40% lower

For workloads that fit this profile, L40S on Spheron delivers substantially better economics than defaulting to H100.

Fix 4: Schedule Across the Marketplace

Reserved GPU capacity has a structural goodput ceiling. When you provision for peak traffic, off-peak hours generate zero revenue from those GPUs. A cluster sized for 10,000 requests/hour at 9 AM that idles at 200 requests/hour at 2 AM is paying for 98% GPU idle time during those low-traffic windows. No amount of MIG packing or disaggregation fixes that.

The structural fix is switching from reserved to on-demand scheduling. You rent H100 instances for the duration of the job, pay only for that time, and release the GPUs when the job is done. For latency-sensitive endpoints, keep a minimal on-demand pool. For batch workloads, use spot.

Current pricing from the Spheron marketplace (as of 19 May 2026):

GPUVRAMOn-demand ($/hr)Spot ($/hr)
L40S48 GB$1.99$1.03
RTX PRO 600096 GB$1.77$1.27
A100 SXM480 GB$1.76N/A
H100 SXM580 GB$3.90$1.66
H200 SXM5141 GB$4.62$1.92
B200 SXM6192 GB$7.21$3.81

For L40S batch workloads, spot pricing adds another 30-60% discount on top of the on-demand rates where available. For training jobs, see the spot GPU checkpointing guide for making batch pipelines interruption-tolerant.

A practical fleet split: 20% on-demand H100 or H200 for latency-sensitive chat and agent traffic, 80% spot L40S or A100 for batch embedding, evals, and fine-tuning. At these prices, that mix cuts GPU spend 40-60% versus an all-reserved H100 fleet at the same throughput.

Pricing fluctuates based on GPU availability. The prices above are based on 19 May 2026 and may have changed. Check current GPU pricing for live rates.

Goodput Targets by Workload

A reference table for setting SLOs before you start tuning:

WorkloadTTFT targetITL targetGoodput minimumRecommended SKU
Interactive chat< 500 ms< 50 ms95%L40S (7-34B), H100 (70B)
Coding agent< 2,000 ms< 100 ms90%L40S or H100
RAG retrieval< 200 ms< 30 ms98%L40S (embedding), H100 (70B reader)
Batch embeddingUnconstrainedUnconstrainedN/A (throughput-maximized)L40S spot
Batch inference< 60 s/jobUnconstrained99% (job completion)H100 spot
Fine-tuningN/AN/AN/A (throughput)H100 or H200
Frontier MoE (405B+)< 3,000 ms< 150 ms85%H200 or B200

These are starting points. Your actual SLOs depend on user expectations and contract commitments. The key is that you pick them before you instrument, not after. A goodput number without a defined SLO threshold is just a number.

How to Set SLOs That Drive Scheduler Decisions

An SLO that doesn't affect scheduling decisions is a vanity metric. The goal is to connect your TTFT and ITL targets directly to vLLM configuration parameters so that hitting the SLO is a consequence of correct config, not heroic tuning.

For vLLM, the key parameters are max_num_seqs (batch size ceiling), max_num_batched_tokens (prefill token budget per step), and chunked_prefill_size (how to split long prefills into smaller chunks). A TTFT SLO of 500 ms with a 70B model at p95 prompt length of 1,000 tokens gives you a concrete max_num_seqs budget you can calculate: if prefill takes 200 ms per 1,000 tokens at batch size 1, then max_num_seqs of 2 means 400 ms prefill, leaving 100 ms margin. Scale from there with load testing.

Set --enable-chunked-prefill and tune --chunked-prefill-size to cap the per-step compute budget. This interleaves prefill and decode, smoothing TTFT across variable-length inputs without requiring full disaggregation.

Then add a Prometheus alerting rule that fires when vllm:request_success_total / vllm:request_total drops below your goodput minimum. When the alert fires, you have a specific metric, a specific SLO breach, and specific knobs to turn. That's a tractable operational model. The vLLM production deployment guide goes deeper on tuning these parameters for different traffic patterns.


Running inference workloads at 5% GPU utilization is a scheduling and sizing problem, not a hardware one. The fixes here, MIG packing, prefill disaggregation, right-sizing to L40S, and on-demand scheduling, all pull in the same direction: more tokens per dollar.

Rent H100 → | Rent L40S → | View all pricing →

STEPS / 07

Quick Setup Guide

  1. Install DCGM Exporter and Prometheus

    Deploy the NVIDIA DCGM Exporter sidecar alongside your inference pods. Configure Prometheus to scrape DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_PROF_SM_ACTIVE, DCGM_FI_PROF_SM_OCCUPANCY, and DCGM_FI_DEV_FB_USED. Set a 10-second scrape interval for inference clusters.

  2. Define your goodput SLO

    Set time-to-first-token (TTFT) and inter-token latency (ITL) targets per workload type: chat (TTFT < 500 ms, ITL < 50 ms), coding agent (TTFT < 2 s, ITL < 100 ms), batch (throughput maximized, TTFT unconstrained), embedding (latency < 20 ms at batch size 64). Add a Prometheus recording rule that counts requests meeting SLO vs total and derives a goodput percentage.

  3. Profile for the four failure modes

    Run nvidia-smi dmon -s u or DCGM dashboard queries to check SM utilization vs MBU vs goodput simultaneously. Look for: (1) SM util > 80% but goodput < 50% - oversized batch, (2) SM util spiky with long idle gaps - single-tenant pinning, (3) vllm:gpu_cache_usage_perc near 100% - KV-cache thrash, (4) high TTFT with moderate SM util - prefill blocking decode.

  4. Apply MIG or time-slicing for multi-tenant packing

    On H100/A100 instances from Spheron, enable MIG mode and create 3g.40gb slices for INT8-quantized 30-40B models or 1g.10gb for INT4-quantized 7B models. For non-MIG GPUs (L40S, RTX PRO 6000), configure time-slicing via the NVIDIA GPU Operator. Point multiple vLLM instances at separate MIG devices using CUDA_VISIBLE_DEVICES.

  5. Enable prefill-decode disaggregation

    Use vLLM's disaggregated prefill mode or deploy DistServe/LMDeploy with separate prefill and decode worker pools. Route long-context requests (prompt > 2K tokens) to prefill workers, keeping decode workers free for streaming output. This eliminates prefill blocking and is the single highest-impact fix for TTFT SLO violations.

  6. Right-size GPU SKU per workload

    Map each model to the smallest GPU whose throughput meets the SLO. Use Spheron's catalog: RTX PRO 6000 for models up to 13B at < 30 requests/sec, L40S for 13-34B or higher throughput needs, H100 for 70B models or heavy batch, H200/B200 for 405B+ or MoE frontier models. Validate with a 15-minute load test before committing.

  7. Schedule spot instances for batch workloads

    Deploy embedding generation, batch inference, and evals on Spheron spot GPUs. Configure your job scheduler to use on-demand only for latency-sensitive chat and coding-agent traffic. Use checkpointing and retry logic so spot interruptions do not lose work. A mixed fleet of 20% on-demand and 80% spot typically cuts overall GPU spend 40-60% without changing the goodput of critical paths.

FAQ / 05

Frequently Asked Questions

GPU utilization (SM active %) measures whether the hardware is busy. GPU goodput measures the fraction of that compute that produces output meeting your SLO - requests per second that complete within the target latency window. A cluster can show 80% SM utilization while delivering 20% goodput if batches are oversized, prefill blocks decode, or KV-cache thrash forces re-computation. The DistServe paper defines goodput as: goodput = (requests meeting SLO) / (total requests attempted) * total throughput.

Cast AI's 2026 State of Kubernetes Optimization Report measured SM utilization across anonymized enterprise clusters and found the median at 5%. The main causes: single-tenant GPU pinning reserves full H100s for one team or model, oversized batch windows inflate latency and trigger SLO misses, KV-cache thrash from long-context workloads forces expensive re-computation, and over-provisioned reserved capacity sits idle during off-peak hours. None of these show up as 'wasted' in a billing dashboard until you instrument goodput separately.

Start with DCGM_FI_DEV_GPU_UTIL (SM utilization), DCGM_FI_DEV_MEM_COPY_UTIL (memory bandwidth), and DCGM_FI_DEV_FB_USED vs DCGM_FI_DEV_FB_FREE (VRAM pressure). Cross-reference with your inference framework's goodput counter: vLLM exposes vllm:gpu_cache_usage_perc and vllm:request_success_total. A pattern of high SM util with high cache eviction rate and low request success rate points to KV-cache thrash. High SM util with low throughput and high time-to-first-token points to oversized batch windows blocking prefill.

The L40S wins on cost-per-useful-token for inference workloads under 34B parameters at moderate batch sizes, where HBM bandwidth advantage is less critical and GDDR6X bandwidth is sufficient. At Spheron's current pricing, the on-demand rate gap between L40S and H100 SXM5 is roughly 2-3x. For a 7B chat model at batch size 32, L40S throughput is within 15-20% of H100, making the cost-per-token roughly 40-50% cheaper. For 70B models or batch sizes above 64, H100's HBM3 bandwidth advantage closes the gap.

The traditional reserved-capacity model forces teams to provision for peak load, leaving clusters idle during off-peak hours. Spheron's on-demand and spot marketplace means you pay only when running inference. Spot instances can cut per-token cost 40-70% for batch and training workloads. The heterogeneous SKU catalog (RTX PRO 6000, L40S, H100, H200, B200) lets you right-size per workload rather than pinning everything to flagship GPUs - L40S for moderate inference, H100/H200 for large-batch or 70B+ work, B200 for frontier MoE models.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.