Engineering

Heterogeneous GPU Inference: Mix GPU Types to Cut Costs by 40% (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 17, 2026
LLM InferenceGPU CloudInference OptimizationNVIDIA DynamovLLMH100A100L40SCost Optimization
Heterogeneous GPU Inference: Mix GPU Types to Cut Costs by 40% (2026)

Most teams running LLM inference default to homogeneous H100 clusters. That is the expensive default.

Why Homogeneous GPU Clusters Waste Money on Inference

LLM inference has two phases with completely different hardware requirements. Running both on the same GPU means you are paying for capabilities you cannot use half the time.

PhaseBottleneckWhat it needsWhat a homogeneous cluster gives it
PrefillFP8 TFLOPSH100/B200Full H100 capacity (good)
DecodeHBM bandwidth (TB/s)Memory-dense GPUAlso a full H100, most of which sits idle

During decode, an H100's 1,979 FP8 TFLOPS are almost entirely idle. You are paying for compute you cannot use. The GPU spends its time reading KV cache tensors from memory, not running matrix multiplies.

Here is what that looks like in practice. A 512-token output generation on Llama-3.1-70B spends roughly 80% of wall-clock time in decode. On a pure-H100 cluster, 80% of your H100 budget is doing memory reads, not computation. The arithmetic TFLOPS you paid a premium for are sitting unused.

If you are new to the prefill-decode split, see our prefill-decode disaggregation guide for the foundational concepts.

Prefill vs Decode: The Hardware Mismatch

The two phases need fundamentally different hardware. Here is how the relevant GPUs compare:

GPUPeak TOPSMemory BWVRAMOn-demand price/hr
H100 SXM51,979 FP83.35 TB/s80 GB HBM3$2.57
B200 SXM6~4,500 FP88.0 TB/s192 GB HBM3e$7.43
L40S733 FP8864 GB/s48 GB GDDR6$0.72
A100 80G PCIe624 INT8†~1.94 TB/s80 GB HBM2e$1.04

Pricing fluctuates based on GPU availability. The prices above are based on 17 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

†A100 (Ampere) does not support FP8. The 624 TOPS figure reflects INT8 TOPS. FP8 support was introduced with the Hopper architecture (H100). The ~1.94 TB/s bandwidth figure is for the PCIe variant; the SXM4 variant reaches 2.0 TB/s.

Prefill is bottlenecked by TFLOPS. H100 SXM5 delivers 1,979 FP8 TFLOPS. L40S delivers 733. You do not want an L40S handling prefill on long prompts: it will be 2.7x slower on that phase.

Decode is bottlenecked by memory bandwidth. L40S has 864 GB/s of GDDR6 bandwidth at $0.72/hr. H100 SXM5 has 3.35 TB/s of HBM3 at $2.57/hr. For decode alone, you are paying roughly 3.6x more per hour for bandwidth you do not need on the H100. The L40S bandwidth is sufficient for decode on models up to 70B at moderate concurrency.

Architecture Pattern: H100/B200 for Prefill, L40S/A100 for Decode

The heterogeneous pattern routes each phase to the GPU that handles it well:

Client Request
     |
     v
[Load Balancer / Router (Dynamo)]
     |               |
     v               v
[Prefill Pool]   [Decode Pool]
H100 SXM5 x2    L40S x4 (or A100 x2)
     |               ^
     +--KV Cache---->+
        (NIXL/RDMA)

The critical plumbing here is NVIDIA NIXL (NVIDIA Inference Xfer Library). After a prefill worker processes the input prompt and generates the KV cache, NIXL ships that KV cache over InfiniBand or RDMA to a decode worker. The decode worker picks up generation from there without re-running prefill.

The router assigns each incoming request to a prefill worker. That worker processes the prompt, generates the KV cache, and sends it over the network to a decode worker. The decode worker runs token generation to completion. Neither pool ever blocks the other.

NVIDIA Dynamo 1.0 is the primary open-source framework for this routing layer. See our Dynamo deployment guide for the full setup walkthrough.

Setting Up Heterogeneous Inference with NVIDIA Dynamo and vLLM

Step 1: Provision heterogeneous GPU instances on Spheron

Spheron lets you mix GPU types in a single account without minimum commitments. Provision at least one prefill node (H100 SXM5 or B200) and two decode nodes (L40S or A100 80G).

Nodes need network connectivity. Ideally, deploy them in the same datacenter region to keep KV cache transfer latency low.

Step 2: Install NVIDIA Dynamo and vLLM on each node

bash
# On all nodes
pip install 'vllm>=0.8.0'
pip install nvidia-dynamo
# Or pull from NVIDIA NGC
docker pull nvcr.io/nvidia/dynamo:latest

Step 3: Configure dynamo.yaml for heterogeneous worker pools

yaml
prefill_workers:
  - host: "prefill-node-1"
    gpu_type: "H100_SXM5"
    gpu_count: 8
    role: prefill
decode_workers:
  - host: "decode-node-1"
    gpu_type: "L40S"
    gpu_count: 4
    role: decode
  - host: "decode-node-2"
    gpu_type: "L40S"
    gpu_count: 4
    role: decode
kv_transfer:
  backend: nixl
  protocol: rdma  # or tcp for non-InfiniBand setups
model: meta-llama/Llama-3.1-70B-Instruct

Step 4: Start prefill workers

bash
python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode prefill \
  --tensor-parallel-size 8 \
  --dtype fp8

Step 5: Start decode workers

The --dtype flag depends on your decode GPU. FP8 is supported on Hopper (H100, H200), Ada Lovelace (L40S), and Blackwell (B200), but not on Ampere (A100). Use bfloat16 for A100 nodes instead.

If your decode nodes are L40S, H100, H200, or B200 (FP8-capable):

bash
python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode decode \
  --tensor-parallel-size 4 \
  --dtype fp8

If your decode nodes are A100 (Ampere, no FP8 support):

bash
python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode decode \
  --tensor-parallel-size 4 \
  --dtype bfloat16

Step 6: Start the Dynamo router

bash
dynamo serve --config dynamo.yaml --port 8000

For smaller deployments without Dynamo, vLLM v0.8+ supports disaggregated prefilling natively via NixlConnector. You can run prefill and decode as separate vLLM instances that transfer KV cache over the network. This works well for 1-2 node setups but lacks Dynamo's load balancing and worker lifecycle management. See the vLLM production deployment guide for the vLLM-only setup path.

Real-World Cost Savings: Before and After

Workload: Llama-3.1-70B-Instruct, 1,000 req/hr, 256-token average input, 512-token average output.

ConfigurationGPUsCost/hrThroughput (tok/s)Cost per 1M tokens
Homogeneous H100 SXM5 x44x H100$10.28~3,200~$0.89
Heterogeneous: H100 x1 prefill + L40S x2 decode1x H100, 2x L40S$4.01~2,400~$0.46
Heterogeneous: H100 x2 prefill + A100 x4 decode2x H100, 4x A100$9.30~4,800~$0.54

The first heterogeneous config costs 61% less per hour than 4x H100 ($4.01 vs $10.28). It also runs at 25% lower throughput (2,400 vs 3,200 tok/s), so this is not an equivalent-throughput comparison.

On a cost-per-token basis, the reduction is roughly 48% ($0.46 vs $0.89 per 1M tokens). That is the right metric for comparing these configurations on equivalent work.

If your workload can run at 2,400 tok/s, the first config is the right call. If you need to match the 3,200 tok/s baseline, the second config gets you there at nearly the same hourly cost but with 39% better cost-per-token, because the A100 decode nodes are better utilized than H100s doing decode work.

Note that throughput numbers are representative. Actual results vary by model size, prompt/output ratio, batch size, and network latency between prefill and decode nodes.

Pricing fluctuates based on GPU availability. The prices above are based on 17 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

GPU Pairing Guide: Which Combinations Work Best

Model SizePrefill GPUDecode GPUNotes
7B-13BH100 SXM5 (x1)L40S (x1-2)L40S 48GB fits the model and handles decode bandwidth
30B-70BH100 SXM5 (x2-4)L40S (x4) or A100 80G (x2)70B needs tensor parallelism on prefill
70B-180BB200 (x2)A100 80G (x4)B200 dominates prefill at this scale
405B+B200 (x4-8)A100 80G (x8) or H200 (x4)Large decode VRAM requirement

One case where you should choose H200 for decode instead of L40S or A100: models above 70B where KV cache size is the binding constraint. The H200 has 141GB of HBM3e. The L40S has 48GB. At long context lengths (32K+ tokens), KV cache alone can exceed what an L40S can hold per decode worker, and you would need to shard across more L40S nodes to compensate. For 70B models at normal context lengths, L40S or A100 are the better cost choices.

See Best GPU for AI Inference 2026 for full GPU specs and a workload-based decision guide.

How to Deploy Heterogeneous GPU Clusters on Spheron

Most hyperscalers only offer GPU families in uniform clusters. Mixing H100 and L40S on the same account, with them connected on the same network fabric, requires separate account setups and manual networking on AWS or GCP. Spheron is designed differently: you can mix GPU types in a single deployment without locked-in instance families or minimum cluster sizes.

The workflow: browse the GPU rental catalog, pick H100 instances for your prefill nodes and L40S or A100 instances for your decode nodes, deploy via the dashboard or API, then connect via SSH. All instances on the same region share the same datacenter network fabric for low-latency NIXL KV cache transfers.

For multi-node networking setup, see the Spheron documentation at docs.spheron.ai.

Spheron's per-minute billing means you can scale decode capacity up during peak hours and back down during off-peak. That is not possible with reserved hyperscaler instances, where you pay for the cluster size you committed to regardless of load.

For further cost reduction on decode nodes, spot GPU instances are viable here. Decode worker preemption is handled by the Dynamo router: if a spot decode node gets reclaimed, Dynamo redistributes its pending work to other decode workers. Spot pricing on decode nodes can cut their cost by another 50-70%.

Key Tradeoffs to Watch

Network latency matters. KV cache transfer between prefill and decode nodes adds latency. On InfiniBand at 200Gb/s or higher, this is sub-millisecond for most models. On commodity Ethernet, it can add 5-20ms per request. For interactive use cases where time-to-first-token is critical, verify your inter-node bandwidth before committing to a heterogeneous setup.

Prefill-to-decode ratio determines savings. If your workload produces very short outputs (under 64 tokens), decode is fast on H100 anyway. The savings from cheaper decode GPUs shrink with shorter outputs. Heterogeneous setups are most beneficial for workloads with output length above 256 tokens.

Model sharding complexity. Models above 70B require tensor parallelism across multiple GPUs on both prefill and decode sides. This adds configuration complexity: you need to define tensor parallel groups within each pool, not just across the pool. Start with a 7B-13B model to validate your heterogeneous setup before scaling to larger models.

Spot preemption on decode nodes. Use Dynamo's built-in failover, not custom retry logic. Dynamo tracks KV cache locations and can reroute generation to another decode worker when a node goes down. Custom retry logic at the application layer will not have visibility into where the KV cache lives.


Spheron lets you mix H100, L40S, A100, and B200 instances in a single account with per-minute billing. There are no locked-in instance families or minimum cluster sizes. Start a heterogeneous inference cluster now or view live GPU pricing →.

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.