Most teams running LLM inference default to homogeneous H100 clusters. That is the expensive default.
Why Homogeneous GPU Clusters Waste Money on Inference
LLM inference has two phases with completely different hardware requirements. Running both on the same GPU means you are paying for capabilities you cannot use half the time.
| Phase | Bottleneck | What it needs | What a homogeneous cluster gives it |
|---|---|---|---|
| Prefill | FP8 TFLOPS | H100/B200 | Full H100 capacity (good) |
| Decode | HBM bandwidth (TB/s) | Memory-dense GPU | Also a full H100, most of which sits idle |
During decode, an H100's 1,979 FP8 TFLOPS are almost entirely idle. You are paying for compute you cannot use. The GPU spends its time reading KV cache tensors from memory, not running matrix multiplies.
Here is what that looks like in practice. A 512-token output generation on Llama-3.1-70B spends roughly 80% of wall-clock time in decode. On a pure-H100 cluster, 80% of your H100 budget is doing memory reads, not computation. The arithmetic TFLOPS you paid a premium for are sitting unused.
If you are new to the prefill-decode split, see our prefill-decode disaggregation guide for the foundational concepts.
Prefill vs Decode: The Hardware Mismatch
The two phases need fundamentally different hardware. Here is how the relevant GPUs compare:
| GPU | Peak TOPS | Memory BW | VRAM | On-demand price/hr |
|---|---|---|---|---|
| H100 SXM5 | 1,979 FP8 | 3.35 TB/s | 80 GB HBM3 | $2.57 |
| B200 SXM6 | ~4,500 FP8 | 8.0 TB/s | 192 GB HBM3e | $7.43 |
| L40S | 733 FP8 | 864 GB/s | 48 GB GDDR6 | $0.72 |
| A100 80G PCIe | 624 INT8† | ~1.94 TB/s | 80 GB HBM2e | $1.04 |
Pricing fluctuates based on GPU availability. The prices above are based on 17 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
†A100 (Ampere) does not support FP8. The 624 TOPS figure reflects INT8 TOPS. FP8 support was introduced with the Hopper architecture (H100). The ~1.94 TB/s bandwidth figure is for the PCIe variant; the SXM4 variant reaches 2.0 TB/s.
Prefill is bottlenecked by TFLOPS. H100 SXM5 delivers 1,979 FP8 TFLOPS. L40S delivers 733. You do not want an L40S handling prefill on long prompts: it will be 2.7x slower on that phase.
Decode is bottlenecked by memory bandwidth. L40S has 864 GB/s of GDDR6 bandwidth at $0.72/hr. H100 SXM5 has 3.35 TB/s of HBM3 at $2.57/hr. For decode alone, you are paying roughly 3.6x more per hour for bandwidth you do not need on the H100. The L40S bandwidth is sufficient for decode on models up to 70B at moderate concurrency.
Architecture Pattern: H100/B200 for Prefill, L40S/A100 for Decode
The heterogeneous pattern routes each phase to the GPU that handles it well:
Client Request
|
v
[Load Balancer / Router (Dynamo)]
| |
v v
[Prefill Pool] [Decode Pool]
H100 SXM5 x2 L40S x4 (or A100 x2)
| ^
+--KV Cache---->+
(NIXL/RDMA)The critical plumbing here is NVIDIA NIXL (NVIDIA Inference Xfer Library). After a prefill worker processes the input prompt and generates the KV cache, NIXL ships that KV cache over InfiniBand or RDMA to a decode worker. The decode worker picks up generation from there without re-running prefill.
The router assigns each incoming request to a prefill worker. That worker processes the prompt, generates the KV cache, and sends it over the network to a decode worker. The decode worker runs token generation to completion. Neither pool ever blocks the other.
NVIDIA Dynamo 1.0 is the primary open-source framework for this routing layer. See our Dynamo deployment guide for the full setup walkthrough.
Setting Up Heterogeneous Inference with NVIDIA Dynamo and vLLM
Step 1: Provision heterogeneous GPU instances on Spheron
Spheron lets you mix GPU types in a single account without minimum commitments. Provision at least one prefill node (H100 SXM5 or B200) and two decode nodes (L40S or A100 80G).
- Rent H100 → for prefill nodes
- Rent L40S → or Rent A100 → for decode nodes
Nodes need network connectivity. Ideally, deploy them in the same datacenter region to keep KV cache transfer latency low.
Step 2: Install NVIDIA Dynamo and vLLM on each node
# On all nodes
pip install 'vllm>=0.8.0'
pip install nvidia-dynamo
# Or pull from NVIDIA NGC
docker pull nvcr.io/nvidia/dynamo:latestStep 3: Configure dynamo.yaml for heterogeneous worker pools
prefill_workers:
- host: "prefill-node-1"
gpu_type: "H100_SXM5"
gpu_count: 8
role: prefill
decode_workers:
- host: "decode-node-1"
gpu_type: "L40S"
gpu_count: 4
role: decode
- host: "decode-node-2"
gpu_type: "L40S"
gpu_count: 4
role: decode
kv_transfer:
backend: nixl
protocol: rdma # or tcp for non-InfiniBand setups
model: meta-llama/Llama-3.1-70B-InstructStep 4: Start prefill workers
python3 -m dynamo.vllm \
--model meta-llama/Llama-3.1-70B-Instruct \
--disaggregation-mode prefill \
--tensor-parallel-size 8 \
--dtype fp8Step 5: Start decode workers
The --dtype flag depends on your decode GPU. FP8 is supported on Hopper (H100, H200), Ada Lovelace (L40S), and Blackwell (B200), but not on Ampere (A100). Use bfloat16 for A100 nodes instead.
If your decode nodes are L40S, H100, H200, or B200 (FP8-capable):
python3 -m dynamo.vllm \
--model meta-llama/Llama-3.1-70B-Instruct \
--disaggregation-mode decode \
--tensor-parallel-size 4 \
--dtype fp8If your decode nodes are A100 (Ampere, no FP8 support):
python3 -m dynamo.vllm \
--model meta-llama/Llama-3.1-70B-Instruct \
--disaggregation-mode decode \
--tensor-parallel-size 4 \
--dtype bfloat16Step 6: Start the Dynamo router
dynamo serve --config dynamo.yaml --port 8000For smaller deployments without Dynamo, vLLM v0.8+ supports disaggregated prefilling natively via NixlConnector. You can run prefill and decode as separate vLLM instances that transfer KV cache over the network. This works well for 1-2 node setups but lacks Dynamo's load balancing and worker lifecycle management. See the vLLM production deployment guide for the vLLM-only setup path.
Real-World Cost Savings: Before and After
Workload: Llama-3.1-70B-Instruct, 1,000 req/hr, 256-token average input, 512-token average output.
| Configuration | GPUs | Cost/hr | Throughput (tok/s) | Cost per 1M tokens |
|---|---|---|---|---|
| Homogeneous H100 SXM5 x4 | 4x H100 | $10.28 | ~3,200 | ~$0.89 |
| Heterogeneous: H100 x1 prefill + L40S x2 decode | 1x H100, 2x L40S | $4.01 | ~2,400 | ~$0.46 |
| Heterogeneous: H100 x2 prefill + A100 x4 decode | 2x H100, 4x A100 | $9.30 | ~4,800 | ~$0.54 |
The first heterogeneous config costs 61% less per hour than 4x H100 ($4.01 vs $10.28). It also runs at 25% lower throughput (2,400 vs 3,200 tok/s), so this is not an equivalent-throughput comparison.
On a cost-per-token basis, the reduction is roughly 48% ($0.46 vs $0.89 per 1M tokens). That is the right metric for comparing these configurations on equivalent work.
If your workload can run at 2,400 tok/s, the first config is the right call. If you need to match the 3,200 tok/s baseline, the second config gets you there at nearly the same hourly cost but with 39% better cost-per-token, because the A100 decode nodes are better utilized than H100s doing decode work.
Note that throughput numbers are representative. Actual results vary by model size, prompt/output ratio, batch size, and network latency between prefill and decode nodes.
Pricing fluctuates based on GPU availability. The prices above are based on 17 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
GPU Pairing Guide: Which Combinations Work Best
| Model Size | Prefill GPU | Decode GPU | Notes |
|---|---|---|---|
| 7B-13B | H100 SXM5 (x1) | L40S (x1-2) | L40S 48GB fits the model and handles decode bandwidth |
| 30B-70B | H100 SXM5 (x2-4) | L40S (x4) or A100 80G (x2) | 70B needs tensor parallelism on prefill |
| 70B-180B | B200 (x2) | A100 80G (x4) | B200 dominates prefill at this scale |
| 405B+ | B200 (x4-8) | A100 80G (x8) or H200 (x4) | Large decode VRAM requirement |
One case where you should choose H200 for decode instead of L40S or A100: models above 70B where KV cache size is the binding constraint. The H200 has 141GB of HBM3e. The L40S has 48GB. At long context lengths (32K+ tokens), KV cache alone can exceed what an L40S can hold per decode worker, and you would need to shard across more L40S nodes to compensate. For 70B models at normal context lengths, L40S or A100 are the better cost choices.
See Best GPU for AI Inference 2026 for full GPU specs and a workload-based decision guide.
How to Deploy Heterogeneous GPU Clusters on Spheron
Most hyperscalers only offer GPU families in uniform clusters. Mixing H100 and L40S on the same account, with them connected on the same network fabric, requires separate account setups and manual networking on AWS or GCP. Spheron is designed differently: you can mix GPU types in a single deployment without locked-in instance families or minimum cluster sizes.
The workflow: browse the GPU rental catalog, pick H100 instances for your prefill nodes and L40S or A100 instances for your decode nodes, deploy via the dashboard or API, then connect via SSH. All instances on the same region share the same datacenter network fabric for low-latency NIXL KV cache transfers.
For multi-node networking setup, see the Spheron documentation at docs.spheron.ai.
Spheron's per-minute billing means you can scale decode capacity up during peak hours and back down during off-peak. That is not possible with reserved hyperscaler instances, where you pay for the cluster size you committed to regardless of load.
For further cost reduction on decode nodes, spot GPU instances are viable here. Decode worker preemption is handled by the Dynamo router: if a spot decode node gets reclaimed, Dynamo redistributes its pending work to other decode workers. Spot pricing on decode nodes can cut their cost by another 50-70%.
Key Tradeoffs to Watch
Network latency matters. KV cache transfer between prefill and decode nodes adds latency. On InfiniBand at 200Gb/s or higher, this is sub-millisecond for most models. On commodity Ethernet, it can add 5-20ms per request. For interactive use cases where time-to-first-token is critical, verify your inter-node bandwidth before committing to a heterogeneous setup.
Prefill-to-decode ratio determines savings. If your workload produces very short outputs (under 64 tokens), decode is fast on H100 anyway. The savings from cheaper decode GPUs shrink with shorter outputs. Heterogeneous setups are most beneficial for workloads with output length above 256 tokens.
Model sharding complexity. Models above 70B require tensor parallelism across multiple GPUs on both prefill and decode sides. This adds configuration complexity: you need to define tensor parallel groups within each pool, not just across the pool. Start with a 7B-13B model to validate your heterogeneous setup before scaling to larger models.
Spot preemption on decode nodes. Use Dynamo's built-in failover, not custom retry logic. Dynamo tracks KV cache locations and can reroute generation to another decode worker when a node goes down. Custom retry logic at the application layer will not have visibility into where the KV cache lives.
Spheron lets you mix H100, L40S, A100, and B200 instances in a single account with per-minute billing. There are no locked-in instance families or minimum cluster sizes. Start a heterogeneous inference cluster now or view live GPU pricing →.
