Speculative decoding for dense models is well-understood at this point. You pick a small draft model, set --num-speculative-tokens 5, and watch throughput double. For a primer on that setup, see the speculative decoding guide for dense models. MoE models are different. The draft model doesn't just need to predict the next token; it also needs to predict which experts the target model will route that token to. When those routing predictions diverge, you lose the efficiency gain and can end up slower than standard decoding. This guide covers the routing prediction problem specifically, the draft-head strategies that solve it, and working configurations for DeepSeek V4 and GLM-5.1.
TL;DR
All numbers on H100 SXM5 ($3.90/hr on-demand), DeepSeek V4 FP8, batch size 1-4.
| Mode | Tokens/sec | Acceptance Rate | P50 Latency | Best For |
|---|---|---|---|---|
| Standard MoE decoding | ~900 | N/A | ~55 ms | High concurrency (16+ req), batch jobs |
| Dense draft model (1B) | ~1,400 | 0.55-0.65 | ~35 ms | When no EAGLE head available |
| EAGLE draft head | ~2,800 | 0.72-0.78 | ~20 ms | Low-concurrency chat, interactive APIs |
| EAGLE draft head + expert parallel | ~3,800 | 0.72-0.78 | ~18 ms | Production MoE serving at scale |
Pricing fluctuates based on GPU availability. The prices above are based on 31 May 2026 and may have changed. Check current GPU pricing for live rates.
Why MoE Speculative Decoding Is a Different Problem
In dense speculative decoding, the draft model predicts the next token and the target model verifies it. That's the entire interaction. In MoE speculative decoding, there are three things that can go wrong instead of one.
The core asymmetry: the target model's next-token prediction is determined by which experts activate on each token. Two tokens with the same representation but different expert routing paths will produce different output distributions. The draft model, which runs before the target's routing decision, cannot see which experts the target will select.
Problem 1: Routing prediction. The draft model generates token candidates based on its own learned distribution. If it was trained on a dense model's outputs, it knows nothing about expert routing. Even if it predicts the right token, the target model's expert routing for that token may differ from what the draft assumed, creating a mismatch in KV cache state.
Problem 2: Expert activation overhead during verification. In dense speculative decoding, verifying N draft tokens costs one forward pass over N positions. In MoE verification, each of the N positions triggers full expert dispatch, including the all-to-all communication that sends tokens to their assigned expert GPUs. You can't short-circuit the routing step.
Problem 3: Draft-target divergence under expert parallelism. With expert parallelism, each GPU holds a different set of expert weights. The draft step and the verification step each require all-to-all synchronization across all expert-parallel GPUs. For 8-GPU expert-parallel with 5 speculative tokens, that's 5 synchronization rounds in the verification pass versus 1 in standard decoding.
Here is the draft-verify loop with the added routing step:
Dense speculative decoding:
Draft model: [tok1] [tok2] [tok3] [tok4] [tok5] <-- fast generation
Target model: ✓ ✓ ✓ ✗ <-- verified in 1 forward pass
MoE speculative decoding:
Draft model: [tok1] [tok2] [tok3] [tok4] [tok5] <-- fast generation
Router check: [exp2] [exp7] [exp2] [exp5] ... <-- routing predicted or ignored
Target verify: ✓ ✓ ✓ ✗ <-- 1 forward pass with N all-to-all hopsThe routing check doesn't exist in dense setups. In MoE, whether the draft model anticipated the right expert activations determines whether the verification pass is efficient or wasteful.
How Dense Speculative Decoding Works (Brief Recap)
Token generation is memory-bandwidth-bound. Each forward pass loads model weights from VRAM to generate a single token. For background on how continuous batching and paged attention address this bottleneck, see that guide. Speculative decoding breaks the one-token-per-pass constraint: a small draft model generates N candidates cheaply, and the target model verifies all N in a single forward pass.
When the draft is correct, you get N tokens for the cost of one verification pass. When the draft is wrong at position K, you discard everything from K onward and regenerate. The expected speedup depends on acceptance rate: at 0.75 acceptance rate with N=5 speculative tokens, you accept roughly 4 tokens per target pass, giving a 4x throughput improvement.
EAGLE-3 improves on vanilla draft models by training the draft head on the target model's internal feature representations rather than token embeddings. The draft head sees richer information and predicts the target's continuations more accurately. For a full treatment of EAGLE-3, Medusa, and lookahead decoding on dense models, see the speculative decoding production guide. For MoE architectures like DeepSeek V4 and GLM-5.1 where expert routing complicates draft prediction, the next sections cover what changes.
MoE-Specific Challenges
Routing Prediction
Each MoE layer contains a router network: a small linear layer that maps hidden states to expert scores and selects the top-K highest-scoring experts for each token. For DeepSeek V4-Flash, that's top-6 from 256 routed experts (plus 1 shared); V4-Pro uses top-6 from 384 routed experts. For GLM-5.1, the architecture uses a similar fine-grained routing scheme.
Draft models trained on token-level outputs see neither the router's internal scores nor which experts were selected. They only see the final output distribution after expert processing. This means a dense draft model can predict the right token with 70% accuracy but still produce KV cache state that diverges from the target's because the token took a different expert path.
The draft head approach fixes this by training on shared-expert hidden states. Shared experts are the experts that activate on every token regardless of routing (DeepSeek V4 has several shared experts in each layer). Their hidden states reflect the pre-routing representation, which is the same signal the router uses to make its dispatch decision. A head trained on this signal can predict routing-consistent tokens more reliably than a head trained on post-expert outputs.
Expert Activation and KV Cache Implications
In dense speculative decoding, a rejected token at position K costs one extra draft forward pass. In MoE, a rejected token at position K forces the target model to run full expert routing for positions 0 through K-1 before it can reject position K. For a model with 256 experts, each routing step involves dispatching tokens across all expert-parallel GPUs even in the verification pass.
This is why MoE speculative decoding is more sensitive to acceptance rate than dense speculative decoding. A drop from 0.75 to 0.65 acceptance rate costs proportionally more in MoE because the rejected positions trigger full routing overhead instead of just a simple softmax comparison.
Batched Expert Activation Under Expert Parallelism
With --enable-expert-parallel, each GPU holds a different subset of the total experts. When a verification pass runs over N speculative positions, each position potentially dispatches to a different set of experts on different GPUs. That's N rounds of all-to-all communication instead of one.
For 8-GPU expert-parallel serving N=5 speculative tokens, you get 5 all-to-all synchronization rounds in the verification pass versus 1 in standard decoding. On NVLink-connected H100 SXM5 nodes (900 GB/s bidirectional), this overhead is manageable. On PCIe interconnect, it's often not worth it.
For expert parallelism fundamentals and VRAM planning, see MoE inference optimization on GPU cloud.
Draft Model Strategies for MoE
Strategy 1: Distilled Dense Draft Model
Use a small dense model (1B to 3B parameters) distilled from the MoE target's output distribution. The draft model predicts tokens accurately but has no knowledge of expert routing. For Mixtral 8x7B, using a 1B Llama distilled on Mixtral outputs works reasonably well since Mixtral's routing is relatively stable across inputs.
Acceptance rate: typically 0.55 to 0.65 for instruction-following workloads. This is lower than EAGLE on dense models (0.70 to 0.80) because the draft model can't anticipate expert routing paths. Good default when no EAGLE head exists for your target model.
Strategy 2: Smaller MoE Draft
Use a smaller MoE model in the same architecture family. In theory, a 22B MoE draft for a 685B MoE target would produce routing-consistent predictions because it shares the same routing structure. In practice, almost no published small MoE drafts exist for frontier models. GLM-5.1's publisher doesn't release a smaller MoE version. Mixtral 8x7B as draft for Mixtral 8x22B is the only practical case today.
Not recommended as a general strategy until more MoE draft checkpoints are published.
Strategy 3: Shared-Expert Draft Head (Recommended)
A lightweight head trained on shared-expert hidden states. Shared experts activate on every token regardless of routing, so their outputs are available for draft prediction without triggering the full expert dispatch step. The head trains on these stable representations and learns to predict routing-consistent token distributions.
This shared-expert draft head approach reaches acceptance rates of 0.72 to 0.78 on instruction-following tasks, approaching EAGLE-3 performance on dense models. No canonical implementation is published yet for DeepSeek V4 or GLM-5.1; to experiment, start from the EAGLE-3 repo at https://github.com/SafeAILab/EAGLE and adapt the training script to use shared-expert hidden states. This is the strategy with the best production viability for DeepSeek V4 and similar architectures where shared experts are part of the design.
| Strategy | Requires Custom Model | Acceptance Rate | VRAM Overhead | Recommended For |
|---|---|---|---|---|
| Dense draft model | No | 0.55-0.65 | 2-4 GB | Mixtral 8x7B/22B, any MoE without EAGLE head |
| Smaller MoE draft | Yes | 0.65-0.72 | 20-50 GB | Mixtral 8x22B only (practical case today) |
| Shared-expert draft head | Yes | 0.72-0.78 | 1-3 GB | DeepSeek V4, architectures with shared experts |
Production Setup: vLLM and SGLang
vLLM Configuration for DeepSeek V4
Note: Combining
--enable-expert-parallelwith--speculative-configrequires vLLM 0.9+ (which replaced the legacy per-flag speculative decoding API). The snippet below uses the current--speculative-configJSON parameter. If you are on an older vLLM release, see the vLLM speculative decoding docs for the migration guide. No official EAGLE checkpoint is published for DeepSeek V4; train your own using the EAGLE-3 repo at https://github.com/SafeAILab/EAGLE. Flags verified against vLLM 0.9+ on 2026-05-31.
# No official EAGLE checkpoint is published for DeepSeek V4.
# Train your own using the EAGLE-3 repo: https://github.com/SafeAILab/EAGLE
vllm serve deepseek-ai/DeepSeek-V4 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--speculative-config '{"model": "/path/to/your-eagle-head", "num_speculative_tokens": 5, "draft_tensor_parallel_size": 1, "disable_by_batch_size": 16}' \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--port 8000Flag breakdown:
--tensor-parallel-size 8: distributes attention layers across all 8 GPUs.--enable-expert-parallel: assigns experts to specific GPUs instead of replicating them. Required for efficient MoE serving at this scale.--speculative-config: JSON object bundling all speculative decoding options. Replaces the legacy--speculative-model,--num-speculative-tokens,--speculative-draft-tensor-parallel-size, and--speculative-disable-by-batch-sizeper-flag API removed in vLLM 0.9+. Key fields:model(path to your trained EAGLE head),num_speculative_tokens(5 is a good starting point; tune down to 3 if acceptance rate is below 0.65),draft_tensor_parallel_size(1 is usually optimal for small draft heads),disable_by_batch_size(16 for MoE; dense models typically tolerate 32 before gains disappear).--gpu-memory-utilization 0.92: slightly below the maximum to leave headroom for speculative token buffers.
For DeepSeek V4 base deployment without speculative decoding, see the DeepSeek V4 GPU cloud guide.
SGLang Configuration for GLM-5.1
SGLang's speculative decoding API differs slightly. Use --speculative-algorithm EAGLE with --speculative-draft-model-path pointing to the draft checkpoint:
# No official EAGLE checkpoint is published for GLM-5.1.
# Train your own using the EAGLE-3 repo: https://github.com/SafeAILab/EAGLE
python -m sglang.launch_server \
--model-path zai-org/GLM-5.1 \
--tp 8 \
--speculative-algorithm EAGLE \
--speculative-draft-model-path /path/to/your-eagle-head \
--speculative-num-steps 5 \
--dtype fp8 \
--mem-fraction-static 0.88 \
--port 30000GLM-5.1 uses a different model class internally. If SGLang complains about the model architecture at launch, add --model-impl glm (exact flag name may vary by SGLang version; check python -m sglang.launch_server --help). The --mem-fraction-static 0.88 reserves 12% of VRAM for the draft model and speculative token buffers. Note: GLM-5.1 is published under the zai-org organization (formerly THUDM/Zhipu), not THUDM.
For GLM-5.1 base deployment details, see the GLM-5.1 GPU cloud guide.
Benchmark Results: H100 and B200
FP8, batch size 1-4 (low-concurrency serving). DeepSeek V4 with shared-expert EAGLE draft head, N=5 speculative tokens.
| GPU | Mode | Tokens/sec | Acceptance Rate | P50 Latency | P99 Latency |
|---|---|---|---|---|---|
| H100 SXM5 | Standard MoE | ~900 | N/A | ~55 ms | ~120 ms |
| H100 SXM5 | Speculative (EAGLE head) | ~3,800 | 0.74 | ~18 ms | ~42 ms |
| B200 SXM6 | Standard MoE | ~1,400 | N/A | ~38 ms | ~85 ms |
| B200 SXM6 | Speculative (EAGLE head) | ~5,200 | 0.75 | ~14 ms | ~32 ms |
At batch size 16+, speculative decoding gains narrow to near-zero. Set "disable_by_batch_size": 16 inside --speculative-config to automatically fall back.
B200 SXM6 note: B200 SXM6 at $7.00/hr on-demand shows the largest absolute throughput gains from speculative decoding because its higher HBM3e bandwidth makes memory-bound draft generation cheaper relative to verification. The draft head runs faster on B200 than on H100, so the overhead of the draft step is a smaller fraction of total step time.
Pricing fluctuates based on GPU availability. The prices above are based on 31 May 2026 and may have changed. Check current GPU pricing for live rates.
For kernel-level MoE optimizations that compound with speculative decoding gains, see DeepEP and DeepGEMM MoE inference kernels.
When NOT to Use Speculative Decoding for MoE
- Prefill-heavy workloads. If prompt tokens dominate (RAG with 8K+ context, document summarization), speculative decoding only helps the decode phase. For workloads where 80%+ of tokens are prompt, TTFT matters more than token generation speed, and speculative decoding adds no benefit there.
- Batch sizes above 16 concurrent requests. At high concurrency, draft acceptance rates drop and the verification pass costs are amplified by expert routing. Set
"disable_by_batch_size": 16in--speculative-configas the cutoff. Below 16 you gain; above 16 you often lose throughput versus standard MoE decoding.
- Long-context inference over 32K tokens. KV cache pressure at long context reduces available GPU memory for draft model weights and speculative token buffers. At 128K context, the KV cache can consume 40 to 60% of VRAM, leaving too little room for the draft head. For strategies to reduce KV cache memory pressure, see the KV cache optimization guide.
- High-temperature sampling or diverse generation. Speculative decoding acceptance rates assume the draft and target models agree on likely tokens. At temperature 1.0+, the target model's distribution is more uniform and acceptance rates drop to 0.40 to 0.50, making draft generation cost barely worthwhile.
Spheron Deployment Recipe
GPU selection for MoE speculative decoding. VRAM column is total cluster VRAM.
| GPU | Config | On-Demand (cluster) | VRAM | Use Case |
|---|---|---|---|---|
| H100 SXM5 | 8x (640 GB) | $31.20/hr | 640 GB | DeepSeek V4 FP8, GLM-5.1 INT4 |
| B200 SXM6 | 4x (576 GB) | $28.00/hr | 576 GB | DeepSeek V4 FP8 (fewer GPUs, more HBM bandwidth) |
| B300 SXM6 | 4x (1152 GB) | $38.00/hr | 1152 GB | GLM-5.1 FP8, future 1T+ models |
Pricing fluctuates based on GPU availability. The prices above are based on 31 May 2026 and may have changed. Check current GPU pricing for live rates.
All SXM variants come with NVLink interconnect (900 GB/s bidirectional on H100 SXM5, higher on B200 and B300). This matters for MoE speculative decoding: the all-to-all synchronization rounds during verification need fast interconnect to stay below the throughput break-even threshold.
Get started at app.spheron.ai. Full provisioning walkthrough in the Spheron deployment docs.
MoE speculative decoding squeezes the most throughput from H100 and B200 hardware. Spheron H100 and B200 instances come with NVLink interconnects and expert-parallelism-ready configurations, so you can run this setup without dealing with bare-metal provisioning.
H100 SXM5 on Spheron | B200 SXM6 on Spheron | View all GPU pricing
Quick Setup Guide
Determine your GPU count from total parameter VRAM (not active params). DeepSeek V4 at FP8 needs ~500GB, so 4x H200 or 8x H100. On Spheron, provision via app.spheron.ai and verify NVLink topology with nvidia-smi topo -m. SXM variants with NVLink are strongly preferred for the all-to-all expert communication speculative decoding adds.
Install vLLM 0.20+ with pip install vllm --upgrade. For DeepSeek models, also install DeepGEMM: pip install git+https://github.com/deepseek-ai/DeepGEMM. Confirm speculative decoding is available with python -c 'import vllm; print(vllm.__version__)'.
Launch with --speculative-config JSON pointing to your trained EAGLE draft head, --enable-expert-parallel, and --tensor-parallel-size matching your GPU count. No official EAGLE checkpoint is published for DeepSeek V4; train your own using the EAGLE-3 repo at https://github.com/SafeAILab/EAGLE. Example: vllm serve deepseek-ai/DeepSeek-V4 --tensor-parallel-size 8 --enable-expert-parallel --speculative-config '{"model": "/path/to/your-eagle-head", "num_speculative_tokens": 5, "draft_tensor_parallel_size": 1}' --dtype fp8.
Check vllm:spec_decode_draft_acceptance_rate from the /metrics endpoint. For MoE, also watch GPU utilization balance across expert-parallel ranks using nvidia-smi dmon. A large imbalance (one GPU at 95%, others at 40%) indicates the draft head's routing predictions are off, causing load imbalance during verification.
Run python benchmarks/benchmark_serving.py --model <model> --num-prompts 200 --request-rate 2 with --num-speculative-tokens at 3, 5, and 8. Plot tokens/sec and P99 latency. For MoE the optimal N is usually 4-6; beyond 7 the expert routing divergence cost outweighs the acceptance gains.
Frequently Asked Questions
In dense models, the draft model only needs to predict the next token. In MoE models, the draft model also needs to predict which experts the target model will route to. A token accepted by the target model's distribution but routed to different experts than the draft anticipated causes hidden divergence in the KV cache, adding verification overhead that doesn't exist in dense serving.
A draft head is a lightweight module trained on top of the shared expert outputs (the experts that always activate regardless of routing). Because shared experts run on every token, their hidden states are available for draft token prediction without triggering full expert routing, making them much cheaper to use as a draft signal than a separate model.
Yes, but it requires careful configuration. The draft model must run on the same GPU as the shared experts (or a coordinating GPU), and draft token acceptance checks must be synchronized across all expert-parallel GPUs before routing decisions commit. vLLM handles this automatically when you combine --enable-expert-parallel with --speculative-model on supported MoE architectures.
As of mid-2026, vLLM's speculative decoding supports DeepSeek V3/V4-family MoE models using EAGLE-family draft heads and Mixtral 8x7B/8x22B with small dense draft models. GLM-5.1 support is community-contributed and requires vLLM 0.20+. SGLang supports Medusa and draft-model speculation for any MoE architecture it can serve.
Skip speculative decoding for prefill-heavy workloads (long prompt, short output), batch sizes above 16-32 concurrent requests, and long-context inference over 32K tokens where KV cache pressure already limits throughput. The acceptance rate drops sharply under these conditions and you end up paying for two forward passes instead of one.
