Token generation in LLMs is memory-bandwidth-bound at low batch sizes: each forward pass loads the full model weights from VRAM to produce exactly one token. Multi-Token Prediction (MTP) breaks that limit without a separate draft model. The prediction heads ship inside the checkpoint, share the trunk's residual stream, and generate 2-4 tokens per forward pass at inference. DeepSeek V4 and GLM-5.1 both include MTP heads in their published weights. For the broader speculative decoding landscape, see the speculative decoding production guide, the Eagle-3 deployment guide, and the DFlash block diffusion guide.
TL;DR
| Mode | Tokens/sec (H200 SXM5) | TTFT p50 | Cost per 1M tokens | Best for |
|---|---|---|---|---|
| Standard decoding | ~1,700 | ~38 ms | ~$0.54 | High concurrency, batch jobs |
| EAGLE-3 | ~6,200 | ~38 ms | ~$0.15 | Low batch, no MTP checkpoint available |
| DFlash | ~9,000 | ~8 ms | ~$0.10 | Best throughput, DFlash checkpoint required |
| MTP on (DeepSeek V4, 1 head) | ~3,400 | ~38 ms | ~$0.27 | MTP-trained models, any batch size |
| MTP + chunked prefill | ~3,400 | ~8 ms | ~$0.27 | Latency-sensitive MTP workloads |
DeepSeek V4 FP8, H200 SXM5, batch size 1-4. Cost formula: ($/hr / 3600) / (tokens/sec / 1_000_000). Cost column uses H200 SXM5 spot rate of $3.31/hr. MTP + chunked prefill shares the same tokens/sec as MTP alone but reduces TTFT by 70-80% via prefill-decode overlap.
Pricing fluctuates based on GPU availability. The prices above are based on 05 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
What Multi-Token Prediction Is
Standard autoregressive decoding generates one token per forward pass. For each step, the transformer runs the full forward pass through all N layers to produce a single probability distribution, samples one token, appends it to the sequence, and repeats. Every token costs one full model pass.
MTP is a training-time change. During training, the model is taught to predict not just the next token (token+1) but also token+2, token+3, and token+4 simultaneously. This is done by adding N lightweight "future token heads" on top of the main trunk. Each head is a single transformer decoder block that shares the trunk's residual stream. The heads are trained jointly with the main model: the primary head predicts token+1 as usual, head 2 predicts token+2, and so on.
At inference, you get the future token heads for free. After the trunk runs, head 1 proposes token+1 and head 2 proposes token+2 in the same forward pass. The main model still verifies each candidate (the verification is a parallel pass across all proposals), so there is no quality tradeoff. Accepted tokens accumulate at 2-4x the standard rate.
The key detail: the heads are baked into the model checkpoint. There is no second model to load, no tokenizer alignment issue, and no draft checkpoint to maintain. Check whether a model supports MTP by inspecting its config.json:
import json
cfg = json.load(open("config.json"))
print("MTP heads:", cfg.get("num_nextn_predict_layers", 0))A value of 0 means the model was trained without MTP heads. A value of 1 or 2 means 1 or 2 future-token heads are present. DeepSeek V4 ships with 1 MTP head (num_nextn_predict_layers: 1), which enables 2-token generation per forward pass. GLM-5.1 ships with 1 head.
If the field is absent from config.json, check generation_config.json for speculative_decoding settings, or read the model card README for MTP support confirmation. Some DeepSeek forks on Hugging Face omit this key but still include MTP weights.
MTP vs EAGLE-3 vs Medusa vs DFlash
| Method | Draft source | Extra GPU memory | Composes with prefix cache | Acceptance rate range | Best batch size |
|---|---|---|---|---|---|
| MTP | Heads in main checkpoint | 0 GB (shared trunk) | Yes, natively | 0.65-0.80 | 1-64 |
| EAGLE-3 | Separate 1-3B draft head checkpoint | 2-4 GB | Yes | 0.80-0.88 | 1-32 |
| Medusa | Parallel heads on target model | 1-2 GB | Partial | 0.60-0.75 | 1-16 |
| DFlash | Block diffusion draft checkpoint | 3-6 GB | Yes | 0.75-0.85 | 1-32 |
EAGLE-3 achieves higher acceptance rates than MTP on most tasks because its draft head was trained specifically on the target model's activations after training. MTP heads are trained jointly with the model and see less task-specific supervision. The tradeoff: EAGLE-3 needs a matching draft checkpoint per model, while MTP needs nothing extra. See the Eagle-3 deployment guide for EAGLE-3 setup and the DFlash guide for DFlash's block diffusion approach.
For models where EAGLE-3 heads exist, EAGLE-3 delivers more throughput. For models where only MTP heads are available (or where loading a draft model into VRAM is not feasible), MTP is the right default.
Why MTP Composes with Continuous Batching
The main practical advantage of MTP over runtime speculative decoding is that MTP has no second model in VRAM.
EAGLE-3 and DFlash require a separate draft checkpoint to be loaded alongside the target model. At high batch sizes, the draft model's memory footprint competes with the target model's KV cache. Continuous batching in vLLM uses paged attention to pack requests efficiently, but that paging strategy has to account for the draft model's own KV cache too.
MTP heads share the main model's KV cache without a separate KV structure. The future-token heads read from the same key-value tensors that the trunk already computed. No additional KV pages are allocated per request for MTP. No tokenizer alignment issue exists because there is no second tokenizer. The paged attention mechanism is unchanged.
The result: at batch sizes of 32-64, where EAGLE-3 acceptance rates start to fall and the draft model's cache pressure becomes noticeable, MTP keeps a consistent 2x throughput gain because it adds no memory pressure.
MTP-Enabled Model Architecture
The MTP architecture is a shared trunk with N parallel future-token heads attached at the final layer.
The trunk runs as normal: input tokens pass through all transformer layers, producing a sequence of hidden states. The final hidden state at each position is the trunk's compressed representation of what comes next. In standard decoding, that state feeds directly into the vocabulary projection (lm_head) to produce token+1's probability distribution.
In an MTP model, the final hidden state also feeds into head 1 (a single lightweight decoder block). Head 1 produces a transformed representation, which feeds into its own vocabulary projection to predict token+2. If there are 2 MTP heads, head 2 takes head 1's output and predicts token+3. Each head is a single-layer transformer decoder block with approximately 1B parameters.
During inference, the framework runs: trunk → lm_head (token+1) + head1 (token+2) + head2 (token+3) in a single forward pass. Then it verifies all proposed tokens against the reference distribution from the trunk. Accepted tokens are appended and the next step begins.
How vLLM, SGLang, and TensorRT-LLM Detect MTP Heads
vLLM
vLLM detects MTP heads automatically by reading num_nextn_predict_layers from config.json at model load time. You activate MTP inference by passing --speculative-config as a JSON object with method: "mtp".
Minimum vLLM version: v0.6.0 for DeepSeek V4 MTP support. For GLM-5.1 MTP heads, use v0.8.0 nightly (the stable v0.8.0 release may not include GLM-5.1 MTP head handling). Check the vLLM changelog before deploying GLM-5.1.
docker run --gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model deepseek-ai/DeepSeek-V4 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max-model-len 32768 \
--enable-chunked-prefill \
--speculative-disable-by-batch-size 64The --speculative-disable-by-batch-size 64 flag tells vLLM to automatically fall back to standard decoding when the batch size exceeds 64, where MTP's per-token gain can be outweighed by verification overhead.
SGLang
SGLang handles MTP via its speculative algorithm configuration. Use --speculative-algorithm MTP with --speculative-num-steps set to the number of future heads in the model.
For GLM-5.1 with 1 MTP head:
python -m sglang.launch_server \
--model-path THUDM/GLM-5.1 \
--speculative-algorithm MTP \
--speculative-num-steps 1 \
--speculative-eagle-topk 4 \
--tensor-parallel-size 4 \
--port 30000The speculative-eagle-topk 4 controls how many candidate token sequences are evaluated per step. Higher values increase acceptance rate at the cost of more compute per verification pass. For GLM-5.1 at batch size 1-8, topk 4 with 1 step hits the best acceptance rate balance.
TensorRT-LLM
TensorRT-LLM handles MTP heads through the Medusa decoding path. Set decoding_config.decoding_type = "MEDUSA" in the TRT-LLM engine build config, pointing to the model's own MTP head weights. The engine builder reads num_nextn_predict_layers to determine how many head layers to fuse. TRT-LLM's fused MTP implementation achieves lower per-step latency than vLLM on the same hardware because the CUDA kernels for head verification are compiled at build time. For the full TRT-LLM engine build pipeline, see the TensorRT-LLM production deployment guide.
Deploying DeepSeek V4 with MTP on Spheron H200
DeepSeek V4 is a 671B MoE model with 37B active parameters. At FP8 precision with 4-GPU tensor parallelism, it fits on four H200 SXM5 GPUs (141 GB each) with room for the KV cache.
Instance Setup
Provision a 4x H200 SXM5 instance through H200 GPU rental on Spheron. After launch, verify GPU access:
nvidia-smi
# Expected: 4x H200 SXM5 visible, each 141 GBInstall Docker with the NVIDIA container runtime if not pre-installed on the image. Pull the vLLM image:
docker pull vllm/vllm-openai:latestLaunch Command
docker run --gpus all \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model deepseek-ai/DeepSeek-V4 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.92 \
--max-model-len 32768 \
--enable-chunked-prefill \
--max-num-batched-tokens 4096 \
--speculative-disable-by-batch-size 64DeepSeek V4 ships with 1 MTP head (num_nextn_predict_layers: 1), so num_speculative_tokens: 1 is the correct setting. Setting it to 2 with only 1 head causes an error at model load time.
Validation
After the server starts, check acceptance rate via /metrics:
curl http://localhost:8000/metrics | grep spec_decode_draft_acceptance_rateTarget acceptance rate: above 0.70 on instruction-following prompts. For DeepSeek V4 on chat and coding tasks, expect 0.72-0.78. For math reasoning tasks with varied output distributions, expect 0.65-0.72.
Run a throughput benchmark to confirm the MTP gain:
# With MTP
python benchmark_serving.py --model deepseek-ai/DeepSeek-V4 --num-prompts 100 --request-rate 4
# Without MTP (for comparison): remove --speculative-config from the server launch commandExpected: 1.8-2.2x tokens/sec improvement over standard decoding at batch size 1-4.
Pricing
| Billing mode | H200 SXM5 rate | MTP tok/s | Cost per 1M tokens |
|---|---|---|---|
| On-demand | $5.92/hr | ~3,400 | ~$0.48 |
| Spot | $3.31/hr | ~3,400 | ~$0.27 |
Pricing fluctuates based on GPU availability. The prices above are based on 05 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Deploying GLM-5.1 with MTP on Spheron B200
GLM-5.1 (the 72B-Chat variant, THUDM/GLM-5.1-72B-Chat) has 1 MTP head. On B200 on Spheron (192 GB HBM3e), the model fits comfortably on two B200 SXM6 GPUs at FP8 with space for long-context KV cache.
Launch Command (SGLang)
python -m sglang.launch_server \
--model-path THUDM/GLM-5.1-72B-Chat \
--speculative-algorithm MTP \
--speculative-num-steps 1 \
--speculative-eagle-topk 4 \
--tensor-parallel-size 2 \
--port 30000 \
--mem-fraction-static 0.88GLM-5.1's 1 MTP head allows --speculative-num-steps 1. The head proposes one future token, so the server generates up to 2 tokens per forward pass (1 from the trunk + 1 from the MTP head).
Tuning Prediction Depth
Set speculative-num-steps to match the model's num_nextn_predict_layers. For GLM-5.1-72B-Chat with 1 MTP head, steps: 1 is both the maximum and the right default:
steps: 1(token+2 only): acceptance rate ~0.75, ~1.7x throughputsteps: 2or higher: only supported ifnum_nextn_predict_layers >= 2in config
Start at steps: 1 and monitor the acceptance rate from /metrics. If it falls below 0.55 on your workload, your distribution may be high-entropy and MTP will add limited benefit.
Pricing
| Billing mode | B200 SXM6 rate | MTP tok/s (GLM-5.1) | Cost per 1M tokens |
|---|---|---|---|
| On-demand | $8.61/hr | ~4,600 | ~$0.52 |
| Spot | $5.34/hr | ~4,600 | ~$0.32 |
Pricing fluctuates based on GPU availability. The prices above are based on 05 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Latency and Throughput Benchmarks
The benchmarks below are at batch size 1-4 (the sweet spot for MTP). At batch size 32+, MTP throughput gain compresses toward 1.2-1.4x.
| GPU | Model | Batch | Standard tok/s | MTP tok/s | Speedup | MTP TTFT p50 | Cost/1M tokens (MTP, spot) |
|---|---|---|---|---|---|---|---|
| H200 SXM5 | DeepSeek V4 FP8 | 1 | ~1,700 | ~3,400 | 2.0x | ~38 ms | ~$0.27 |
| H200 SXM5 | DeepSeek V4 FP8 | 4 | ~2,800 | ~4,200 | 1.5x | ~42 ms | ~$0.22 |
| H200 SXM5 | GLM-5.1 FP8 | 1 | ~2,100 | ~3,600¹ | ~1.7x | ~35 ms | ~$0.26¹ |
| B200 SXM6 | DeepSeek V4 FP8 | 1 | ~2,200 | ~4,400 | 2.0x | ~28 ms | ~$0.34 |
| B200 SXM6 | GLM-5.1 FP8 | 1 | ~2,700 | ~4,600¹ | ~1.7x | ~25 ms | ~$0.32¹ |
¹ GLM-5.1 figures reflect 1 MTP step (num_nextn_predict_layers: 1). Actual throughput and cost depend on acceptance rate for your workload.
MTP improves decode throughput, not prefill throughput. The TTFT values above are without chunked prefill. Pair MTP with --enable-chunked-prefill and --max-num-batched-tokens 4096 to reduce TTFT by 70-80% on long-context inputs without changing the tok/s numbers.
Pricing fluctuates based on GPU availability. The prices above are based on 05 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Acceptance Rates: When MTP Pays Off
MTP acceptance rate is the fraction of proposed future tokens that the verification pass accepts. Higher acceptance means more tokens per forward pass. Lower acceptance means overhead from verification with fewer actual gains.
High-acceptance scenarios (MTP delivers well):
- Instruction following and chat with predictable response patterns
- Structured JSON generation with known schemas
- Chain-of-thought reasoning where the model follows a consistent template
- Code completion with repetitive syntax
Low-acceptance scenarios (MTP may help marginally or not at all):
- Sampling from high-temperature distributions (temperature > 1.0)
- Creative writing with wide vocabulary exploration
- Long-tail token generation (uncommon terms, multilingual outputs with mixed scripts)
- Code generation with unusual syntax or libraries the model rarely uses
Rule of thumb: if your acceptance rate is below 0.60, reduce num_speculative_tokens before disabling MTP entirely. Dropping from 2 speculative tokens to 1 often keeps 70% of the throughput gain while raising acceptance to 0.68-0.72.
Check acceptance rate live via Prometheus PromQL:
rate(vllm:spec_decode_num_accepted_tokens[1m])
/ rate(vllm:spec_decode_num_draft_tokens[1m])Or via direct metric scrape:
curl -s http://localhost:8000/metrics \
| grep spec_decode_draft_acceptance_rateThe --speculative-disable-by-batch-size 64 flag (vLLM) handles the automatic fallback for high-concurrency traffic. At batch sizes above 64, vLLM silently reverts to standard decoding, so your throughput does not regress on burst traffic.
Combining MTP with KV Cache Offloading and PD-Disaggregation
MTP + KV Cache Offloading
MTP composes cleanly with CPU KV cache offloading. The MTP heads read from GPU-resident trunk activations only; the KV cache layout for the main model is unchanged. Configure KV transfer alongside MTP in vLLM using --kv-transfer-config:
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
--kv-transfer-config '{"kv_connector": "SharedStorageConnector", "kv_role": "kv_both"}'The KV offloading config is independent of the MTP config; they do not interact at the memory management layer.
MTP + Prefill-Decode Disaggregation
In a prefill-decode disaggregated architecture, prefill nodes handle the prompt encoding step and decode nodes handle token generation. MTP only accelerates decode. Prefill nodes run standard forward passes and send KV caches to decode nodes.
Enable MTP only on decode nodes. Prefill nodes should omit the --speculative-config flag entirely. This prevents MTP head overhead on prefill-heavy workloads where the accept-reject cycle would add latency without benefit.
For the full PD-disaggregation architecture on Spheron, see the vLLM production deployment guide.
Spheron Pricing Breakdown: Cost per Million Tokens
| GPU | OD price | Spot price | MTP tok/s | Cost/1M OD | Cost/1M spot | vs AWS H100 est. |
|---|---|---|---|---|---|---|
| H200 SXM5 (4x) | $5.92/hr | $3.31/hr | ~3,400 | ~$0.48 | ~$0.27 | ~4x cheaper |
| B200 SXM6 (2x) | $8.61/hr | $5.34/hr | ~4,600 | ~$0.52 | ~$0.32 | ~3.7x cheaper |
AWS H100 baseline derived from AWS p5.48xlarge list price at the time of writing. The AWS figure represents on-demand H100 without MTP (MTP heads are not available on managed inference APIs), so the per-token comparison accounts for both the hardware price difference and the MTP throughput multiplier.
Spheron provisions compute from 5+ providers through data center partners globally, giving you access to H200 and B200 inventory across multiple regions. See full GPU pricing → for live rates across all GPU types.
Pricing fluctuates based on GPU availability. The prices above are based on 05 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Migration Path from EAGLE-3 or Medusa
If you are currently running EAGLE-3 or Medusa and want to evaluate MTP, the decision depends on what your current model supports.
Your model has MTP heads (DeepSeek V4, GLM-5.1):
Disable the external draft model and enable MTP first. MTP heads and external draft models conflict in vLLM: you cannot run --speculative-model (EAGLE-3 checkpoint) and --speculative-config '{"method": "mtp"}' simultaneously. Remove the --speculative-model flag and replace with the MTP config. The VRAM freed by dropping the draft model can go to a larger KV cache or higher --gpu-memory-utilization.
Your model has no MTP heads:
Stay on EAGLE-3 if a draft checkpoint exists for your model. EAGLE-3 delivers higher acceptance rates than MTP on most tasks. If no EAGLE-3 checkpoint exists for your model, try Medusa or wait for MTP-trained versions of your model to ship. The Eagle-3 deployment guide covers how to train a custom EAGLE-3 head in 2-4 hours on 4xH100 instances.
You want to combine both:
Not currently supported. Running MTP heads and an external draft model simultaneously is not a valid vLLM configuration. Pick one approach per model deployment. For a comparison of all active acceleration methods, including EAGLE-3, DFlash, and MTP, see the speculative decoding production guide.
MTP turns the multi-token heads already in your model weights into a direct throughput multiplier. On Spheron H200 and B200 instances, it cuts cost per million tokens by 2x with no quality tradeoff and zero extra VRAM overhead.
H200 SXM5 availability → | B200 on Spheron → | Get started on Spheron →
Quick Setup Guide
Launch an H200 SXM5 (141 GB) or B200 SXM6 (192 GB) instance on Spheron. Verify GPU access with nvidia-smi. Install Docker with the NVIDIA container runtime.
Pull DeepSeek V4 or GLM-5.1 from Hugging Face. Confirm the model has MTP heads by inspecting config.json for num_nextn_predict_layers > 0. Run: python -c "import json; cfg=json.load(open('config.json')); print(cfg.get('num_nextn_predict_layers', 0))".
Launch vLLM with --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' (for DeepSeek V4, which has 1 MTP head) or adjust num_speculative_tokens to match the model's num_nextn_predict_layers. Add --enable-chunked-prefill for TTFT reduction.
Pass --speculative-algorithm MTP and --speculative-num-steps 1 in the SGLang launch command. For GLM-5.1, set speculative_eagle_topk 4 and speculative_num_steps 1 to match its single MTP head.
Query /metrics and check vllm:spec_decode_draft_acceptance_rate. Target above 0.70. Run benchmark_serving.py with and without MTP enabled to measure actual throughput delta. Acceptance rates below 0.55 indicate a distribution mismatch; reduce num_speculative_tokens first.
Frequently Asked Questions
Multi-Token Prediction (MTP) is a training-time technique where the model learns to predict multiple future tokens simultaneously using additional output heads that share the trunk. At inference, these extra heads generate 2-4 tokens per forward pass instead of one. Unlike speculative decoding, MTP requires no separate draft model - the prediction heads are baked into the main model weights and activate at no extra GPU cost.
Speculative decoding (EAGLE-3, Medusa, DFlash) uses a separate draft model or head trained after the fact to propose candidates that the target model verifies. MTP trains prediction heads jointly with the main model from scratch. MTP composes cleanly with continuous batching and prefix caching with no additional GPU, while speculative decoding needs a second model loaded into VRAM.
As of 2026, DeepSeek V3.x, DeepSeek V4, GLM-5.1, and several Llama 4 variants include MTP heads in their published weights. The MTP heads are standard transformer decoder layers sharing the trunk's residual stream. Check the model's config.json for a num_nextn_predict_layers field to confirm MTP support.
Yes. vLLM detects MTP heads automatically from the model config and activates them when --speculative-config with method: mtp is passed. No separate checkpoint is needed. SGLang similarly supports MTP via speculative_algorithm MTP in its configuration.
MTP acceptance rates fall on high-entropy distributions: long-tail tokens, heavily creative prompts, and code generation with unusual syntax. At batch sizes above 64, the batched verification cost can reduce the per-token gain. For those workloads, MTP may add marginal overhead. The vLLM --speculative-disable-by-batch-size flag handles automatic fallback.
Yes. MTP composes with CPU KV cache offloading and prefix caching because the extra MTP heads operate on the GPU-resident trunk activations only. The KV cache management is unchanged. Configure --kv-transfer-config as you would for a standard model; MTP heads do not alter the KV cache layout.
