Heterogeneous GPU Inference: Mix GPU Types to Cut Costs by 40% (2026)

Most teams running LLM inference default to homogeneous H100 clusters. That is the expensive default.

Why Homogeneous GPU Clusters Waste Money on Inference

LLM inference has two phases with completely different hardware requirements. Running both on the same GPU means you are paying for capabilities you cannot use half the time.

Phase	Bottleneck	What it needs	What a homogeneous cluster gives it
Prefill	FP8 TFLOPS	H100/B200	Full H100 capacity (good)
Decode	HBM bandwidth (TB/s)	Memory-dense GPU	Also a full H100, most of which sits idle

During decode, an H100's 1,979 FP8 TFLOPS are almost entirely idle. You are paying for compute you cannot use. The GPU spends its time reading KV cache tensors from memory, not running matrix multiplies.

Here is what that looks like in practice. A 512-token output generation on Llama-3.1-70B spends roughly 80% of wall-clock time in decode. On a pure-H100 cluster, 80% of your H100 budget is doing memory reads, not computation. The arithmetic TFLOPS you paid a premium for are sitting unused.

If you are new to the prefill-decode split, see our prefill-decode disaggregation guide for the foundational concepts.

Prefill vs Decode: The Hardware Mismatch

The two phases need fundamentally different hardware. Here is how the relevant GPUs compare:

GPU	Peak TOPS	Memory BW	VRAM	On-demand price/hr
H100 SXM5	1,979 FP8	3.35 TB/s	80 GB HBM3	$2.57
B200 SXM6	~4,500 FP8	8.0 TB/s	192 GB HBM3e	$7.43
L40S	733 FP8	864 GB/s	48 GB GDDR6	$0.72
A100 80G PCIe	624 INT8†	~1.94 TB/s	80 GB HBM2e	$1.04

Pricing fluctuates based on GPU availability. The prices above are based on 17 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

†A100 (Ampere) does not support FP8. The 624 TOPS figure reflects INT8 TOPS. FP8 support was introduced with the Hopper architecture (H100). The ~1.94 TB/s bandwidth figure is for the PCIe variant; the SXM4 variant reaches 2.0 TB/s.

Prefill is bottlenecked by TFLOPS. H100 SXM5 delivers 1,979 FP8 TFLOPS. L40S delivers 733. You do not want an L40S handling prefill on long prompts: it will be 2.7x slower on that phase.

Decode is bottlenecked by memory bandwidth. L40S has 864 GB/s of GDDR6 bandwidth at $0.72/hr. H100 SXM5 has 3.35 TB/s of HBM3 at $2.57/hr. For decode alone, you are paying roughly 3.6x more per hour for bandwidth you do not need on the H100. The L40S bandwidth is sufficient for decode on models up to 70B at moderate concurrency.

Architecture Pattern: H100/B200 for Prefill, L40S/A100 for Decode

The heterogeneous pattern routes each phase to the GPU that handles it well:

Client Request
     |
     v
[Load Balancer / Router (Dynamo)]
     |               |
     v               v
[Prefill Pool]   [Decode Pool]
H100 SXM5 x2    L40S x4 (or A100 x2)
     |               ^
     +--KV Cache---->+
        (NIXL/RDMA)

The critical plumbing here is NVIDIA NIXL (NVIDIA Inference Xfer Library). After a prefill worker processes the input prompt and generates the KV cache, NIXL ships that KV cache over InfiniBand or RDMA to a decode worker. The decode worker picks up generation from there without re-running prefill.

The router assigns each incoming request to a prefill worker. That worker processes the prompt, generates the KV cache, and sends it over the network to a decode worker. The decode worker runs token generation to completion. Neither pool ever blocks the other.

NVIDIA Dynamo 1.0 is the primary open-source framework for this routing layer. See our Dynamo deployment guide for the full setup walkthrough.

Setting Up Heterogeneous Inference with NVIDIA Dynamo and vLLM

Step 1: Provision heterogeneous GPU instances on Spheron

Spheron lets you mix GPU types in a single account without minimum commitments. Provision at least one prefill node (H100 SXM5 or B200) and two decode nodes (L40S or A100 80G).

Rent H100 → for prefill nodes
Rent L40S → or Rent A100 → for decode nodes

Nodes need network connectivity. Ideally, deploy them in the same datacenter region to keep KV cache transfer latency low.

Step 2: Install NVIDIA Dynamo and vLLM on each node

bash

# On all nodes
pip install 'vllm>=0.8.0'
pip install nvidia-dynamo
# Or pull from NVIDIA NGC
docker pull nvcr.io/nvidia/dynamo:latest

Step 3: Configure `dynamo.yaml` for heterogeneous worker pools

yaml

prefill_workers:
  - host: "prefill-node-1"
    gpu_type: "H100_SXM5"
    gpu_count: 8
    role: prefill
decode_workers:
  - host: "decode-node-1"
    gpu_type: "L40S"
    gpu_count: 4
    role: decode
  - host: "decode-node-2"
    gpu_type: "L40S"
    gpu_count: 4
    role: decode
kv_transfer:
  backend: nixl
  protocol: rdma  # or tcp for non-InfiniBand setups
model: meta-llama/Llama-3.1-70B-Instruct

Step 4: Start prefill workers

bash

python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode prefill \
  --tensor-parallel-size 8 \
  --dtype fp8

Step 5: Start decode workers

The --dtype flag depends on your decode GPU. FP8 is supported on Hopper (H100, H200), Ada Lovelace (L40S), and Blackwell (B200), but not on Ampere (A100). Use bfloat16 for A100 nodes instead.

If your decode nodes are L40S, H100, H200, or B200 (FP8-capable):

bash

python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode decode \
  --tensor-parallel-size 4 \
  --dtype fp8

If your decode nodes are A100 (Ampere, no FP8 support):

bash

python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode decode \
  --tensor-parallel-size 4 \
  --dtype bfloat16

Step 6: Start the Dynamo router

bash

dynamo serve --config dynamo.yaml --port 8000

For smaller deployments without Dynamo, vLLM v0.8+ supports disaggregated prefilling natively via NixlConnector. You can run prefill and decode as separate vLLM instances that transfer KV cache over the network. This works well for 1-2 node setups but lacks Dynamo's load balancing and worker lifecycle management. See the vLLM production deployment guide for the vLLM-only setup path.

Real-World Cost Savings: Before and After

Workload: Llama-3.1-70B-Instruct, 1,000 req/hr, 256-token average input, 512-token average output.

Configuration	GPUs	Cost/hr	Throughput (tok/s)	Cost per 1M tokens
Homogeneous H100 SXM5 x4	4x H100	$10.28	~3,200	~$0.89
Heterogeneous: H100 x1 prefill + L40S x2 decode	1x H100, 2x L40S	$4.01	~2,400	~$0.46
Heterogeneous: H100 x2 prefill + A100 x4 decode	2x H100, 4x A100	$9.30	~4,800	~$0.54

The first heterogeneous config costs 61% less per hour than 4x H100 ($4.01 vs $10.28). It also runs at 25% lower throughput (2,400 vs 3,200 tok/s), so this is not an equivalent-throughput comparison.

On a cost-per-token basis, the reduction is roughly 48% ($0.46 vs $0.89 per 1M tokens). That is the right metric for comparing these configurations on equivalent work.

If your workload can run at 2,400 tok/s, the first config is the right call. If you need to match the 3,200 tok/s baseline, the second config gets you there at nearly the same hourly cost but with 39% better cost-per-token, because the A100 decode nodes are better utilized than H100s doing decode work.

Once you have the hardware mix right, the next step is attributing those costs to the right team or project. The GPU FinOps and cost allocation guide covers how to build per-team chargeback dashboards on heterogeneous clusters using DCGM Exporter and Prometheus.

Note that throughput numbers are representative. Actual results vary by model size, prompt/output ratio, batch size, and network latency between prefill and decode nodes.

Pricing fluctuates based on GPU availability. The prices above are based on 17 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

GPU Pairing Guide: Which Combinations Work Best

Model Size	Prefill GPU	Decode GPU	Notes
7B-13B	H100 SXM5 (x1)	L40S (x1-2)	L40S 48GB fits the model and handles decode bandwidth
30B-70B	H100 SXM5 (x2-4)	L40S (x4) or A100 80G (x2)	70B needs tensor parallelism on prefill
70B-180B	B200 (x2)	A100 80G (x4)	B200 dominates prefill at this scale
405B+	B200 (x4-8)	A100 80G (x8) or H200 (x4)	Large decode VRAM requirement

One case where you should choose H200 for decode instead of L40S or A100: models above 70B where KV cache size is the binding constraint. The H200 has 141GB of HBM3e. The L40S has 48GB. At long context lengths (32K+ tokens), KV cache alone can exceed what an L40S can hold per decode worker, and you would need to shard across more L40S nodes to compensate. For 70B models at normal context lengths, L40S or A100 are the better cost choices.

See Best GPU for AI Inference 2026 for full GPU specs and a workload-based decision guide.

How to Deploy Heterogeneous GPU Clusters on Spheron

Most hyperscalers only offer GPU families in uniform clusters. Mixing H100 and L40S on the same account, with them connected on the same network fabric, requires separate account setups and manual networking on AWS or GCP. Spheron is designed differently: you can mix GPU types in a single deployment without locked-in instance families or minimum cluster sizes.

The workflow: browse the GPU rental catalog, pick H100 instances for your prefill nodes and L40S or A100 instances for your decode nodes, deploy via the dashboard or API, then connect via SSH. All instances on the same region share the same datacenter network fabric for low-latency NIXL KV cache transfers.

For multi-node networking setup, see the Spheron documentation at docs.spheron.ai.

Spheron's per-minute billing means you can scale decode capacity up during peak hours and back down during off-peak. That is not possible with reserved hyperscaler instances, where you pay for the cluster size you committed to regardless of load.

For further cost reduction on decode nodes, spot GPU instances are viable here. Decode worker preemption is handled by the Dynamo router: if a spot decode node gets reclaimed, Dynamo redistributes its pending work to other decode workers. Spot pricing on decode nodes can cut their cost by another 50-70%.

Key Tradeoffs to Watch

Network latency matters. KV cache transfer between prefill and decode nodes adds latency. On InfiniBand at 200Gb/s or higher, this is sub-millisecond for most models. On commodity Ethernet, it can add 5-20ms per request. For interactive use cases where time-to-first-token is critical, verify your inter-node bandwidth before committing to a heterogeneous setup.

Prefill-to-decode ratio determines savings. If your workload produces very short outputs (under 64 tokens), decode is fast on H100 anyway. The savings from cheaper decode GPUs shrink with shorter outputs. Heterogeneous setups are most beneficial for workloads with output length above 256 tokens.

Model sharding complexity. Models above 70B require tensor parallelism across multiple GPUs on both prefill and decode sides. This adds configuration complexity: you need to define tensor parallel groups within each pool, not just across the pool. Start with a 7B-13B model to validate your heterogeneous setup before scaling to larger models.

Spot preemption on decode nodes. Use Dynamo's built-in failover, not custom retry logic. Dynamo tracks KV cache locations and can reroute generation to another decode worker when a node goes down. Custom retry logic at the application layer will not have visibility into where the KV cache lives.

Spheron lets you mix H100, L40S, A100, and B200 instances in a single account with per-minute billing. There are no locked-in instance families or minimum cluster sizes. Start a heterogeneous inference cluster now or view live GPU pricing →.
Get started on Spheron →

FAQ / 05

Frequently Asked Questions

Heterogeneous GPU inference runs different parts of the LLM inference pipeline on different GPU models. In the most common pattern, prefill (processing the input prompt) runs on compute-dense GPUs like H100 or B200, while decode (generating output tokens one at a time) runs on cheaper memory-bandwidth GPUs like L40S or A100. This matches hardware to the compute profile of each phase instead of wasting expensive capacity on mismatched workloads.

Typical savings range from 25-40% compared to running all inference on a homogeneous H100 cluster. The exact saving depends on your prompt-to-output ratio: longer outputs benefit more from cheaper decode GPUs. A Llama-3.1-70B workload with 256-token inputs and 512-token outputs can run prefill on a single H100 and decode on two L40S nodes for roughly 30-35% lower cost at equivalent throughput.

Yes. NVIDIA Dynamo 1.0 (GA at GTC 2026) explicitly supports heterogeneous worker pools. You can assign prefill workers to high-FLOPs GPUs (H100, B200) and decode workers to memory-dense GPUs (L40S, A100, H200) in the same dynamo.yaml configuration. KV cache transfer between workers uses NIXL over InfiniBand or RDMA.

The most cost-effective pairs in 2026 are: H100 SXM5 for prefill + L40S for decode (best cost reduction for mid-sized models up to 70B), B200 for prefill + A100 80G for decode (large models 70B-405B), and H100 for prefill + H200 for decode (maximum throughput with moderate cost improvement). The L40S is particularly strong for decode because its 48GB GDDR6 has excellent bandwidth-per-dollar.

vLLM v0.8+ supports disaggregated prefilling natively via NixlConnector, which allows prefill and decode workers to run on separate instances and transfer KV cache over the network. Dynamo adds an orchestration and routing layer on top of vLLM that handles load balancing, worker lifecycle, and KV cache routing across heterogeneous pools. For production deployments at scale, Dynamo is the recommended approach.

Why Homogeneous GPU Clusters Waste Money on Inference

Prefill vs Decode: The Hardware Mismatch

Architecture Pattern: H100/B200 for Prefill, L40S/A100 for Decode

Setting Up Heterogeneous Inference with NVIDIA Dynamo and vLLM

Step 1: Provision heterogeneous GPU instances on Spheron

Step 2: Install NVIDIA Dynamo and vLLM on each node

Step 3: Configure dynamo.yaml for heterogeneous worker pools

Step 4: Start prefill workers

Step 5: Start decode workers

Step 6: Start the Dynamo router

Real-World Cost Savings: Before and After

GPU Pairing Guide: Which Combinations Work Best

How to Deploy Heterogeneous GPU Clusters on Spheron

Key Tradeoffs to Watch

Frequently Asked Questions

01What is heterogeneous GPU inference?

02How much can heterogeneous GPU inference reduce costs?

03Does NVIDIA Dynamo support mixing different GPU types?

04Which GPU pairs work best for heterogeneous inference?

05Can I use vLLM alone for heterogeneous GPU inference without Dynamo?

Build what's next.

Step 3: Configure `dynamo.yaml` for heterogeneous worker pools