Giant MoE models waste 20-40% of GPU cycles on hot-expert queuing while cold experts sit idle. Wide Expert Parallelism (Wide-EP) and the Expert Parallelism Load Balancer (EPLB) fix this by spreading expert routing across more GPUs and dynamically rebalancing assignments as traffic patterns shift. For background on MoE memory planning and expert routing basics, see the MoE inference optimization guide. For the kernel-level dispatch optimizations that work alongside load balancing, see the DeepEP and DeepGEMM guide.
The Hot/Cold Expert Problem
Standard expert parallelism assigns each GPU a fixed slice of the expert pool. On 8 GPUs with 256 experts (DeepSeek V4, top-8 routing), each GPU holds 32 experts. The assumption is that tokens will distribute evenly across all experts. They don't.
In practice, 3-5% of experts absorb 30-50% of token activations in any given batch. These are the "hot" experts: frequently chosen by the router for common token patterns like technical terminology, code constructs, or domain-specific vocabulary. The GPUs that happen to hold these experts run at 95% utilization. The GPUs holding the cold experts run at 20-30%. The hot GPUs become the bottleneck for every forward step.
| Expert group | Share of token activations | GPU utilization (8x EP) |
|---|---|---|
| Top 5% (hot experts) | ~45% | ~92% |
| Middle 50% | ~45% | ~55% avg |
| Bottom 45% (cold experts) | ~10% | ~22% |
At the step level, the entire batch waits for the slowest GPU to finish. That's the hot-GPU. The cold-GPU time is wasted.
The mismatch compounds with routing skew from real-world traffic. A batch of chemistry papers routes heavily through chemistry-vocabulary experts. A batch of Python completions routes through code-structure experts. The hot/cold pattern changes by batch but the GPU assignment is static. EPLB and Wide-EP break this static coupling.
What Wide-EP Does
Standard expert parallelism (narrow EP) confines the expert routing problem to one NVLink domain, typically 8 GPUs. Wide-EP expands that group to 16, 32, or 72 GPUs. Now, instead of having one GPU per expert slice, you can have multiple GPUs that each hold a copy of the hot experts.
The key mechanism: hot experts are replicated. If expert 47 is hot, it gets copies on 3 GPUs instead of 1. The router dispatches to whichever copy is least loaded. Cold experts stay at 1 copy.
The effect on the utilization table:
| Expert group | Share of token activations | GPU utilization (16x Wide-EP + EPLB) |
|---|---|---|
| Top 5% (hot experts, 3 copies each) | ~45% | ~82% |
| Middle 50% | ~45% | ~78% avg |
| Bottom 45% (cold experts, 1 copy) | ~10% | ~68% |
Utilization is more uniform. The hot-GPU bottleneck flattens. Each forward step completes faster because no single GPU is waiting in line for a flood of hot-expert tokens.
The All-to-All Cost Tradeoff
Wide-EP comes with a cost: larger all-to-all communication. At 8x EP within one NVLink domain, all-to-all uses 900 GB/s NVLink bandwidth on H200 SXM5 (Gen4 NVLink); B200 SXM6 nodes with Gen5 NVLink deliver 1.8 TB/s, so the intra-node leg is faster there. At 16x Wide-EP across two NVLink nodes connected by InfiniBand NDR 400G, the inter-node leg drops to ~350-390 Gb/s per port (~44-49 GB/s per HCA) effective. The all-to-all now crosses two layers of fabric.
On a GB200 NVL72 rack with 72 GPUs on one flat NVSwitch fabric at 130 TB/s, this penalty disappears. The 72-GPU all-to-all is as fast as the 8-GPU version because the fabric is non-blocking.
For teams using standard multi-node InfiniBand clusters (the common case on Spheron): Wide-EP pays off above batch size 64. Below that, the all-to-all overhead per-token exceeds the load-balancing savings. The practical rule: if your production serving load is throughput-bound (large batches, tens of concurrent requests), Wide-EP helps on InfiniBand. If you're running single-request latency-optimized serving with TTFT SLOs under 50ms, stay within a single NVLink node.
EPLB Deep Dive
Wide-EP tells you to spread experts across more GPUs. EPLB tells you which experts to put where.
Static vs Online Rebalancing
Static EPLB: collect a calibration trace from representative traffic, run the assignment solver once, fix the placement, deploy. Fast at runtime. The problem is that calibration data is never perfectly representative. Production traffic shifts: morning prompts differ from evening prompts, model usage patterns change week by week. A static assignment optimized for last Tuesday's traffic distribution will degrade as the real distribution drifts.
Online EPLB: every N forward steps, observe the per-expert token counts from the last window, run the solver, update the placement. This adds CPU overhead per rebalance cycle but tracks the actual routing patterns in real time. For production deployments with variable-topic traffic, online EPLB consistently outperforms static.
Starting point: rebalance every N=200 steps. At low-variance traffic (homogeneous batch topics), push to N=500. At high-variance traffic (mixed domains per batch), pull down to N=50. Monitor the rebalance-to-forward-step ratio to keep solver time under 2% of wall time.
Hierarchical vs Global Strategies
Global EPLB: one solver optimizes expert placement across all GPUs in the expert-parallel group. This finds the mathematically optimal assignment. Above 16 GPUs, solver time grows quadratically and starts adding measurable latency to each rebalance cycle.
Hierarchical EPLB: split the GPU group into tiers. For two 8-GPU NVLink nodes, the first tier assigns expert groups to each NVLink domain. The second tier assigns within each domain. Two small solvers instead of one large one. Solver time stays bounded regardless of total GPU count.
Both strategies are available in vLLM 0.19+. Use global below 16 GPUs. Use hierarchical above 16. On a 16-GPU two-node InfiniBand cluster, hierarchical EPLB matches global quality at about 40% of the solver CPU cost.
Redundant vs Duplicated Hot Experts
Two options for handling hot experts:
Redundant placement: Move the hot expert's canonical copy to a less-loaded GPU. It still exists once. The EPLB solver puts it where the load is lower. No extra VRAM cost. The downside: the expert can still become a bottleneck at its new location if demand stays high.
Duplication: Copy the hot expert's weights to 2 or more GPUs. Both copies serve traffic. The router picks the least-loaded copy. This cuts hot-expert queuing to near-zero but costs VRAM for each copy.
For DeepSeek V4: each expert's weight tensor at FP8 is roughly 0.5-1.5 GB depending on the layer. Duplicating the top 5% of hot experts (roughly 13 experts in a 256-expert model) across 2 GPUs costs 6-20 GB of total additional VRAM. On H200 SXM5 with 141 GB per GPU, that's affordable. On H100 SXM5 with 80 GB, check your headroom.
| Replication factor | Extra VRAM (top 5% of DeepSeek V4 experts) | Hot-expert queue reduction |
|---|---|---|
| 1x (no duplication) | 0 GB | 0% |
| 2x | ~8-15 GB total | ~45% |
| 4x | ~16-30 GB total | ~70% |
4x duplication is rarely worth it; the VRAM cost grows while the marginal queue reduction shrinks. 2x is the practical sweet spot for most production deployments.
Wide-EP on Rack-Scale Systems (NVL72-class)
For teams with access to GB200 NVL72 capacity (see the GB200 NVL72 guide for provisioning details): Wide-EP on NVL72 is the idealized case. All 72 B200 GPUs sit on one flat NVSwitch fabric at 130 TB/s. All-to-all communication at 72-GPU scale costs the same as at 8-GPU scale per byte transferred. The Wide-EP penalty that affects InfiniBand multi-node deployments disappears.
SGLang configuration for a full NVL72 span:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V4-Pro \
--tp 72 \
--ep-size 72 \
--enable-dp-attention \
--dp 72 \
--dtype fp8 \
--port 30000For disaggregated prefill/decode at rack scale (covered in the prefill-decode disaggregation guide): dedicate 24 GPUs to prefill workers and 48 to decode workers. The prefill GPUs are compute-bound; the decode GPUs are memory-bandwidth-bound. Both pools stay within the same NVSwitch domain, so KV cache transfer between pools uses NVLink rather than InfiniBand. This eliminates both the prefill-decode contention and the expert imbalance problem.
Hands-On: Enabling EPLB in vLLM, SGLang, and DeepEP
vLLM
vllm serve deepseek-ai/DeepSeek-V4 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--enable-eplb \
--eplb-config '{"step_interval":200}' \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.88 \
--port 8000--enable-expert-parallel enables expert parallelism across GPUs. --enable-eplb activates the load balancer as a separate step; without this flag, expert parallelism runs but no rebalancing occurs. --eplb-config '{"step_interval":200}' sets online rebalancing every 200 steps (default is 3000). Check logs for EPLB: rebalancing experts to confirm online mode is running. Note: verify the current flag names in your vLLM release notes, as they can shift between minor versions.
For multi-node 16-GPU Wide-EP in vLLM, also add:
--pipeline-parallel-size 2 \
--distributed-executor-backend rayThen set tensor-parallel-size 8 (per node) with 2 pipeline stages. This is the current vLLM path for multi-node Wide-EP; native cross-node expert parallelism support is tracked in the vLLM issue tracker.
SGLang with DeepEP
# Install DeepEP first
pip install git+https://github.com/deepseek-ai/DeepEP
# Launch SGLang with Wide-EP (16 GPUs, 2 nodes)
SGLANG_DEEPEP_CHUNK_SIZE=128 \
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V4 \
--tp 16 \
--ep-size 16 \
--moe-a2a-backend deepep \
--enable-dp-attention \
--dp 16 \
--dtype fp8 \
--port 30000--moe-a2a-backend deepep explicitly selects DeepEP for Wide-EP all-to-all dispatch; installing the library alone does not activate it. SGLANG_DEEPEP_CHUNK_SIZE=128 balances latency and throughput; raise to 256 for pure throughput-optimized batch jobs, drop to 64 for latency-sensitive serving.
Framework Comparison
| Framework | EPLB mode | Online rebalance | Min GPUs | Config flag |
|---|---|---|---|---|
| vLLM 0.17+ | Static + online | Yes (--eplb-config) | 2 | --enable-expert-parallel --enable-eplb |
| SGLang 0.4+ | Online via DeepEP | Yes (built-in) | 2 | --ep-size --moe-a2a-backend deepep |
| DeepEP standalone | Online | Yes | 2 | SGLANG_DEEPEP_CHUNK_SIZE |
Measuring Per-Expert Load
Before enabling Wide-EP, verify that you actually have significant imbalance. The math matters: if your top 10% of experts get 40% of tokens, Wide-EP saves you something. If they get 15%, the overhead may exceed the benefit.
SGLang metrics
SGLang exposes a Prometheus endpoint at /metrics. The sglang_expert_token_count histogram tracks per-expert activation counts. To get the imbalance ratio:
# Pull the expert token count histogram
curl http://localhost:30000/metrics | grep sglang_expert_token_count
# Compute top-5 hottest experts and imbalance ratio (Python)
import requests, json
metrics = requests.get("http://localhost:30000/metrics").text
# parse histogram bucket values, sort descending by count
# imbalance_ratio = max_expert_count / mean_expert_count
# ratio > 3x: Wide-EP will help
# ratio < 1.5x: Wide-EP overhead likely exceeds benefitA ratio above 3x is a clear signal. Below 1.5x, profile further before committing to Wide-EP.
vLLM expert stats
vllm serve deepseek-ai/DeepSeek-V4 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--collect-expert-stats \
--dtype fp8 \
--max-model-len 32768--collect-expert-stats (vLLM 0.19+) logs per-expert activation counts every N steps. Check the log file for expert_activation_counts lines. Sort the expert IDs by count; the top few will stand out clearly.
Compare the histogram before and after enabling EPLB. You should see the distribution flatten and the max/mean ratio drop below 2x within a few hundred steps as the rebalancer converges.
Utilization and Cost-per-Token: Before and After
These numbers are representative estimates for a throughput-optimized 16-GPU InfiniBand NDR 400G configuration running DeepSeek V4 at FP8. Actual results vary by batch size, context length, and model.
| Config | Tokens/sec (approx) | GPU util | Dispatch overhead | Cost/M tokens (H200 SXM5) |
|---|---|---|---|---|
| 8x EP, NCCL all-to-all | ~baseline | ~60% avg (hot GPU 95%, cold 25%) | ~18% step time | baseline |
| 8x EP + EPLB (static) | ~+15% | ~78% avg | ~12% step time | -13% |
| 16x Wide-EP + EPLB (online) | ~+30% | ~88% avg | ~8% step time | -23% |
| 16x Wide-EP + DeepEP + EPLB | ~+45% | ~91% avg | ~4% step time | -31% |
Live pricing from Spheron (fetched 03 Jul 2026):
- H200 SXM5: $3.70/hr per GPU on-demand, $3.31/hr spot
- 8x H200 SXM5: $29.58/hr on-demand, $26.51/hr spot
- 16x H200 SXM5: $59.14/hr on-demand, $53.04/hr spot
- B200 SXM6: $8.83/hr per GPU on-demand, $5.34/hr spot
The 31% cost-per-token reduction from the full Wide-EP + DeepEP + EPLB stack on 16x H200 SXM5 makes the second node pay for itself on throughput-bound batch workloads. Spot pricing reduces the baseline cost further, though spot instances can be reclaimed without notice and are better suited for tolerant batch jobs than online serving.
Pricing fluctuates based on GPU availability. The prices above are based on 03 Jul 2026 and may have changed. Check current GPU pricing → for live rates.
Running Wide-EP on Multi-Node GPU Cloud Without an NVL72 Rack
Most teams running giant MoE models won't have access to a NVL72 rack. The practical path is 2-4 standard H200 SXM5 or B200 SXM6 nodes connected by InfiniBand NDR 400G. You get most of the Wide-EP benefit at a fraction of the rack cost.
Configuration checklist for multi-node Wide-EP on InfiniBand:
# 1. Activate RDMA transport (NCCL)
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_0 # replace with your IB adapter name from ibstat
# 2. Verify inter-node bandwidth
ib_write_bw -d mlx5_0 # expect ~350-390 Gb/s per port (~44-49 GB/s) on NDR 400G
# 3. Limit Wide-EP group to 16 GPUs (2 nodes)
# --ep-size 16 keeps all-to-all within a manageable InfiniBand budget
# --ep-size 32+ requires 4+ nodes and a non-blocking switch fabric
# 4. Use hierarchical EPLB for this topology
# vLLM uses hierarchical by default above 16 GPUs
# SGLang respects the topology via --ep-size boundaryAt batch sizes below 32, the InfiniBand all-to-all overhead per-token exceeds the load-balancing savings. For these small-batch workloads, revert to 8x EP within a single NVLink node and skip the cross-node cost. The same NCCL tuning considerations apply here as in any multi-node training setup; see the multi-node GPU training NCCL flag reference for tuning details.
To provision multi-node clusters on Spheron, use https://docs.spheron.ai/ for SSH setup, InfiniBand configuration, and networking verification steps.
When NOT to Use Wide-EP
Wide-EP is not a universal upgrade. Skip it if:
Small expert count. Models with fewer than 16 experts (early Mixtral 8x7B, some research MoE checkpoints) don't have enough routing skew to justify the all-to-all expansion. The hot/cold delta is small in absolute terms.
Single-GPU or PCIe multi-GPU. Wide-EP requires NVLink or InfiniBand between GPUs. PCIe-connected multi-GPU setups lack the bandwidth for the expanded all-to-all to be worthwhile.
Latency-critical low-batch serving. If your SLO is TTFT under 50ms on single requests, Wide-EP adds communication latency that can push you over budget even as it improves throughput. See the vLLM vs SGLang comparison for latency-focused configuration trade-offs.
Very small batches. Below batch size 32 on InfiniBand, the per-token all-to-all cost exceeds the load-balancing savings. Wait until your serving traffic grows before enabling Wide-EP.
The right time to enable Wide-EP: you have a throughput-bound production deployment, 16+ GPUs across 2+ NVLink nodes with InfiniBand between them, batch sizes consistently above 64, and a measured hot/cold imbalance ratio above 2.5x. If all four are true, Wide-EP plus online EPLB will cut your cost-per-token meaningfully.
Running DeepSeek V4, GLM-5.2, or Llama 4 Maverick in production? Spheron's multi-node H200 SXM5 and B200 SXM6 clusters include InfiniBand interconnect for Wide-EP and EPLB workloads, at a fraction of hyperscaler rack rental cost.
Quick Setup Guide
Run 'nvidia-smi topo -m' to confirm NVLink fabric within each node. For multi-node Wide-EP, run 'ibstat' and confirm InfiniBand NDR 400G (or RoCE at 400G) is present and links are ACTIVE. Identify your NVLink domain size (8 for standard HGX B200/H200 nodes) and your inter-node bandwidth. Wide-EP all-to-all scales quadratically with GPU count, so confirm your fabric is non-blocking before increasing expert-parallel-size beyond 16.
Install vLLM v0.17+. Launch with --enable-expert-parallel to activate expert parallelism, and add --enable-eplb to turn on the load balancer. For online rebalancing, set the rebalance interval via --eplb-config '{"step_interval":200}' (steps between rebalance solves; start with 200, lower to 50 for high-variance traffic). Example: vllm serve deepseek-ai/DeepSeek-V4 --tensor-parallel-size 8 --enable-expert-parallel --enable-eplb --eplb-config '{"step_interval":200}' --dtype fp8 --max-model-len 32768 --port 8000. Check logs for 'EPLB: rebalancing experts' to confirm online mode is active. Note: verify the current flag names in your vLLM release notes, as they can shift between minor releases.
Install DeepEP (pip install git+https://github.com/deepseek-ai/DeepEP) and launch SGLang with --ep-size equal to your total GPU count (e.g., 16 for two 8-GPU NVLink nodes) and --moe-a2a-backend deepep to route Wide-EP dispatch through DeepEP: python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V4 --tp 16 --ep-size 16 --moe-a2a-backend deepep --enable-dp-attention --dp 16 --dtype fp8 --port 30000. DeepEP is not activated by installation alone; the --moe-a2a-backend deepep flag is required. Set SGLANG_DEEPEP_CHUNK_SIZE=128 for balanced latency/throughput; raise to 256 for throughput-only batch jobs.
Capture per-expert token counts by enabling SGLang's expert metrics endpoint (/metrics, look for sglang_expert_token_count histogram) or by running a calibration trace with vLLM's --collect-expert-stats flag (available in vLLM 0.19+). Sort experts by token count descending. If the top 10% of experts receive more than 40% of tokens, you have significant imbalance and EPLB will help. If the distribution is within 2x from top to bottom, Wide-EP overhead may exceed the benefit for your traffic.
Log in to app.spheron.ai and provision 2 or more H200 SXM5 or B200 SXM6 nodes. Request InfiniBand or high-speed RoCE interconnect between nodes. Verify inter-node bandwidth with 'ib_write_bw -d mlx5_0' (expect ~350-390 Gb/s per port for NDR 400G, which is roughly 44-49 GB/s per HCA). SSH setup and cluster configuration are at https://docs.spheron.ai. For models requiring 16+ GPUs (DeepSeek V4-Pro at FP8), provision across 2-3 H200 SXM5 nodes and configure vLLM pipeline parallelism in combination with Wide-EP.
For production throughput-optimized serving, combine Wide-EP with prefill-decode disaggregation. Assign compute-dense nodes (H100 SXM5 or B200 SXM6) as prefill workers and memory-dense nodes (H200 SXM5) as decode workers. Wide-EP applies to both pools independently. SGLang's dp-attention mode handles the dispatch topology; vLLM with NixlConnector handles KV cache transfer. This combination eliminates both the prefill-decode GPU contention and the expert imbalance problem simultaneously.
Frequently Asked Questions
Wide-EP assigns expert groups across a larger set of GPUs than a single NVLink domain, spreading the expert routing problem over 16 to 72 GPUs instead of 8. This increases the chance that a token can be routed to a lightly loaded replica of a hot expert on any of those GPUs, not just the one that happens to own the canonical copy. The downside is larger all-to-all communication, which requires high-bandwidth interconnect (NVLink or InfiniBand NDR 400G) to avoid adding more latency than the load balancing saves.
EPLB is a vLLM subsystem that rebalances which GPUs hold which experts during serving. vLLM's EPLB can run in static mode (compute the optimal assignment once from calibration data and fix it) or online mode (re-solve the assignment every N steps based on observed token counts per expert in the current request window). Online EPLB is the better choice for production MoE serving because real traffic is rarely as uniform as calibration data; the router's hot-expert distribution shifts with prompt topic.
Yes. The NVL72 rack (72 B200 GPUs, 130 TB/s NVSwitch fabric) is the purpose-built Wide-EP substrate, but Wide-EP over standard InfiniBand NDR 400G multi-node clusters running H200 SXM5 or B200 SXM6 nodes gives most of the load-balancing benefit at significantly lower cost. You accept higher per-step communication latency versus NVLink, which narrows the benefit on small batch sizes but still saves 15-30% of per-step wall time on throughput-optimized batch jobs.
Models with large expert counts and heavy routing skew benefit most: DeepSeek V4 (256 experts, top-8 routing), DeepSeek V4-Pro (384 experts, top-16 routing), GLM-5.2 (128+ experts), and Llama 4 Maverick (128 experts, top-1 routing), as configured at time of writing. Models with fewer than 16 experts or perfectly uniform routing (rare in practice) see minimal benefit because the hot/cold imbalance is small in absolute terms.
Global EPLB runs one solver that optimizes expert placement across every GPU in the expert-parallel group. This finds the best possible assignment but requires O(N^2) solver time as GPU count grows. Hierarchical EPLB splits the GPU group into tiers (e.g., NVLink domains of 8, then InfiniBand domains of N such groups) and runs a two-level solver: first assign experts to NVLink domains, then assign within each domain. Hierarchical is the production choice above 16 GPUs because solver latency is bounded, and the intra-domain placement remains NVLink-local.
