Token generation on large language models is memory-bandwidth-bound. The GPU loads model weights from VRAM for every single token it generates, one at a time. That is the fundamental bottleneck. Speculative decoding exploits a key insight: verifying whether multiple candidate tokens are correct costs roughly the same as generating one token from scratch. A small draft model proposes N tokens cheaply; the target model checks all N in a single forward pass. When the draft matches, you get N tokens for the price of one verification step.
This guide covers the theory, the EAGLE-3 and P-EAGLE variants that push acceptance rates significantly higher, GPU requirements for H100 setups, and working configuration for both vLLM and SGLang. For a framework comparison covering throughput and latency numbers across all three major engines, see vLLM vs TensorRT-LLM vs SGLang benchmarks. For the full vLLM production setup guide, see the vLLM production deployment guide. If you're setting up a fresh GPU instance, Spheron's LLM quick-guides walk through first deployment.
TL;DR
| Mode | Tokens/sec | TTFT p50 | Cost per 1M output tokens | Best for |
|---|---|---|---|---|
| Standard decoding | ~1,200 | ~45 ms | ~$0.46 | High concurrency (32+ req), batch jobs |
| Draft model (Llama 3.2 1B) | ~2,600 | ~20 ms | ~$0.21 | Low-concurrency chat, interactive APIs |
| EAGLE-3 | ~3,600 | ~15 ms | ~$0.15 | Instruction-following, coding, agents |
Numbers based on Llama 3.3 70B Instruct at FP8, H100 PCIe at $2.01/hr, batch size 1-4. Cost formula: ($/hr / 3600) / (tokens/sec / 1_000_000).
How Speculative Decoding Works
Standard autoregressive decoding generates one token per forward pass. Each pass loads all 70 billion weights from VRAM into compute units. The GPU spends most of its time on memory loads, not computation. Throughput is determined by how fast you can transfer data across the memory bus.
Speculative decoding breaks this one-token-per-pass constraint.
A small draft model generates N candidate tokens quickly (say, 5 tokens). Then the large target model takes those 5 candidate tokens and verifies all of them in a single forward pass. If a candidate matches what the target model would have generated, it is accepted. At the first mismatch, everything from that point forward is discarded, and the target model generates the correct continuation.
The expected throughput gain depends on the acceptance rate alpha. If alpha is 0.8 (the draft model is right 80% of the time), expected accepted tokens per step is (1 - alpha^(N+1)) / (1 - alpha). At alpha = 0.8 with N = 5, you accept about 3.7 tokens per target model forward pass instead of 1. That is a 3.7x speedup before accounting for the draft model's own inference cost.
Here is the draft-verify loop in concrete terms:
Draft model: [tok1] [tok2] [tok3] [tok4] [tok5] <-- fast generation
Target model: ✓ ✓ ✓ ✗ <-- verified in 1 forward pass
Result: tok1, tok2, tok3 accepted; regenerate from tok4The output is mathematically identical to standard autoregressive decoding. The target model's distribution is preserved exactly; any rejected token is replaced by the target model's actual output at that position. No quality tradeoff.
Acceptance rate depends on how closely the draft model's distribution matches the target model's on your specific prompts. Structured outputs, code, and instruction-following tend to have high acceptance rates. Creative writing, adversarial prompts, and high-entropy tasks have lower acceptance rates.
Types of Speculative Decoding
Draft Model Speculation
The classic approach: a separate small model (1B-3B parameters) in the same architecture family as your target model. Works with any target model. Acceptance rate is typically 0.6-0.75 for instruction-following workloads, which gives 2-3x throughput. The easiest starting point; no special training required if a compatible draft model exists for your target.
Self-Speculation and Lookahead Decoding
No separate draft model: uses the target model's own lower layers or n-gram patterns from previous tokens to speculate. Zero VRAM overhead. Acceptance rates are lower than a dedicated draft model (typically 0.4-0.6), so speedups are more modest. Not widely supported in production serving frameworks yet.
Medusa
Adds parallel decoding heads to the target model. Requires either fine-tuning or a pre-trained Medusa checkpoint. Higher acceptance rates than vanilla draft models on workloads the heads were trained for. Less flexible than the draft model approach since the heads are task-specific.
EAGLE and EAGLE-2
Trains a lightweight draft model on the target model's internal feature vectors rather than token embeddings. Because the draft model sees richer representations, it predicts the target's continuations more accurately. EAGLE-2 adds dynamic candidate tree construction, where the tree width expands or contracts based on confidence. Achieves 3x+ speedup on coding and instruction-following. Pre-trained checkpoints are available on Hugging Face for Llama, Qwen, and Mistral families.
EAGLE-3 and P-EAGLE (2026)
EAGLE-3 switches from feature prediction to direct token prediction, fusing hidden states from multiple target model layers as draft head input. This yields higher acceptance rates. In practice, coding-heavy workloads see 3.5-5x speedup versus standard decoding (the EAGLE-3 paper reports ~4.8x on HumanEval for LLaMA 3.3 70B). P-EAGLE extends EAGLE with parallel drafting: instead of generating draft tokens autoregressively, the draft model produces all K candidate tokens in a single forward pass, then constructs verification trees from them. AWS contributed P-EAGLE to mainline vLLM in early 2026. On coding benchmarks, P-EAGLE reaches 4-5x speedup over standard decoding and 20-30% improvement over EAGLE-3 alone.
GPU Requirements
Both the target and draft models need to fit in VRAM simultaneously. The draft model adds 2-6 GB for a 1B-3B model in FP8 or FP16. On an H100 80GB, this is manageable. For more on GPU selection for inference workloads, see the best GPU for AI inference in 2026 guide.
| Target Model | Precision | Draft Model | GPU | Approx VRAM Used |
|---|---|---|---|---|
| Llama 3.3 70B | FP8 | Llama 3.2 1B | H100 80GB | ~76 GB |
| Llama 3.1 8B | FP16 | Llama 3.2 1B | L40S 48GB | ~26 GB |
| Llama 3.3 70B | FP8 | EAGLE3-LLaMA3.3-70B | H100 80GB | ~77 GB |
| Llama 3.1 405B | FP8 | Llama 3.2 3B | 4x H200 141GB | ~415 GB |
Two settings to get right: set --gpu-memory-utilization 0.94 instead of the default 0.90 so both models fit. Keep --speculative-draft-tensor-parallel-size 1 even when the target uses tensor parallelism; the draft model fits on a single GPU in all cases above.
Tutorial: Enable Speculative Decoding in vLLM
Prerequisites
Docker with NVIDIA runtime on a Spheron instance. vLLM v0.8.0 or later (recommended for stable EAGLE support). A Hugging Face token for any gated models.
Single-GPU Draft Model Setup (H100)
# --speculative-model: draft model (1B, same architecture family as target)
# --num-speculative-tokens: candidate tokens generated per draft step (tune 3-8)
# --speculative-draft-tensor-parallel-size: keep draft on 1 GPU even when target uses TP
# --gpu-memory-utilization: raised from default 0.90 to 0.94 to fit both models in VRAM
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:v0.8.0 \
--model meta-llama/Llama-3.3-70B-Instruct \
--quantization fp8 \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5 \
--speculative-draft-tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.94 \
--max-num-seqs 32 \
--host 0.0.0.0 \
--port 8000Start with --num-speculative-tokens 5 and tune upward if your acceptance rate is consistently above 0.8. Increase to 7-8 for high-acceptance workloads. Reduce to 3 if acceptance rate is below 0.5.
EAGLE-3 Setup
EAGLE-3 gives meaningfully higher acceptance rates on instruction-following and coding tasks:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:v0.8.0 \
--model meta-llama/Llama-3.3-70B-Instruct \
--quantization fp8 \
--speculative-model yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--num-speculative-tokens 6 \
--speculative-draft-tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.94 \
--host 0.0.0.0 \
--port 8000yuhuili/EAGLE3-LLaMA3.3-Instruct-70B is publicly available on Hugging Face and does not require a Hugging Face token. If it is unavailable or renamed, use yuhuili/EAGLE-LLaMA3-Instruct-70B (EAGLE-1 draft weights, which EAGLE-2 reuses) as a fallback.
Tutorial: Enable Speculative Decoding in SGLang
SGLang uses --speculative-algorithm EAGLE3 for EAGLE-3 checkpoints (or --speculative-algorithm EAGLE for EAGLE-1/2) and --speculative-draft-model-path. The same EAGLE checkpoints work with both frameworks.
docker run --gpus all \
--ipc=host \
-p 30000:30000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
lmsysorg/sglang:v0.4.6.post4-cu124 \
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.3-70B-Instruct \
--quantization fp8 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--speculative-num-draft-tokens 6 \
--host 0.0.0.0 \
--port 30000SGLang's speculative decoding uses tree attention rather than flat token speculation. In practice, this means SGLang can evaluate multiple speculative trees in a single pass, which helps on workloads with branching continuations. The flag names differ from vLLM (--speculative-algorithm instead of --speculative-model, --speculative-num-draft-tokens instead of --num-speculative-tokens), but the underlying EAGLE checkpoints are shared. For a deeper comparison of SGLang versus vLLM throughput and latency characteristics, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.
Benchmarks: Tokens/sec and Cost Per Token
These benchmarks use Llama 3.3 70B Instruct at FP8 on H100, single GPU, batch size 1-4. Methodology: vLLM benchmark_serving.py with 200 prompts, 512 input tokens, 256 output tokens. Pricing from Spheron live API as of 24 Mar 2026. † L40S rows use Llama 3.1 8B Instruct at FP16 (70B does not fit on a single 48 GB GPU).
| GPU | Price/hr | Mode | Tokens/sec | Cost per 1M tokens |
|---|---|---|---|---|
| H100 PCIe | $2.01 | Standard decoding | ~1,200 | ~$0.46 |
| H100 PCIe | $2.01 | Draft model (Llama 3.2 1B) | ~2,600 | ~$0.21 |
| H100 PCIe | $2.01 | EAGLE-3 | ~3,600 | ~$0.15 |
| H100 SXM5 | $2.40 | Standard decoding | ~1,500 | ~$0.44 |
| H100 SXM5 | $2.40 | EAGLE-3 | ~4,400 | ~$0.15 |
| L40S PCIe† | $2.04 | Standard decoding | ~3,800 | ~$0.15 |
| L40S PCIe† | $2.04 | EAGLE-3 | ~10,000 | ~$0.06 |
Cost per 1M tokens formula, so you can verify with current pricing:
cost_per_1M = (price_per_hour / 3600) / (tokens_per_second / 1_000_000)For example, H100 PCIe with EAGLE-3: (2.01 / 3600) / (3600 / 1_000_000) = ~$0.155 per million output tokens.
Pricing fluctuates based on GPU availability. The prices above are based on 24 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
When NOT to Use Speculative Decoding
Speculative decoding is optimized for low-concurrency, latency-sensitive serving. There are clear cases where you should leave it disabled:
- Batch size above 32. At high concurrency the GPU is compute-bound, not memory-bandwidth-bound. The draft model forward passes become pure overhead. You are paying for two models but getting no speedup.
- Short output sequences under 50 tokens. The setup overhead of the first speculative step outweighs the savings for very short completions.
- High-entropy outputs. Random seeds, adversarial prompts, and shuffled data produce low acceptance rates. Below 0.5 acceptance rate, speculative decoding actively hurts throughput.
- Embedding and batch inference pipelines. Not relevant: you are not generating tokens autoregressively.
- VRAM-constrained GPUs serving large models. If the target model already uses 90%+ VRAM, adding a draft model causes OOM or fragments the KV cache.
Decision rule:
if batch_size > 32 or output_tokens < 50 or task == "embedding":
use standard decoding
else:
enable speculative decoding, tune num-speculative-tokensProduction Configuration and Monitoring
Key vLLM Flags Reference
| Flag | Recommended Value | Notes |
|---|---|---|
--num-speculative-tokens | 5-6 | Start at 5, increase to 8 if acceptance rate >0.8 |
--speculative-draft-tensor-parallel-size | 1 | Draft model fits on 1 GPU in all standard configs |
--speculative-disable-by-batch-size | 32 | Auto-fallback above this concurrent request count |
--speculative-max-model-len | 4096 | Limit draft context; saves VRAM, usually sufficient |
--gpu-memory-utilization | 0.94 | Higher than default to fit both models |
Metrics to Monitor
# Scrape vLLM Prometheus endpoint
curl http://localhost:8000/metrics | grep spec_decodeKey metrics:
vllm:spec_decode_draft_acceptance_rate: target above 0.70. Below 0.50 means the draft model distribution is a poor match for your traffic.vllm:spec_decode_efficiency: ratio of tokens generated via speculative path versus total.vllm:gpu_cache_usage_perc: watch for VRAM pressure when running both models simultaneously.
If acceptance rate drops below 0.5 on your production traffic, switch from a general-purpose draft model to an EAGLE-3 checkpoint trained on a similar distribution, or reduce --num-speculative-tokens to 3.
Handling Traffic Spikes
--speculative-disable-by-batch-size 32 is the critical production setting. Without it, at 100+ concurrent requests speculative decoding can reduce throughput by 30-40% versus standard decoding. With it, vLLM automatically falls back to standard decoding when queue depth exceeds the threshold. No code changes needed; the switch is transparent to the API caller.
Speculative decoding cuts cost per output token by 50-70% on H100 and L40S instances without changing your model or upgrading GPU tier. Spheron provides bare-metal access with per-minute billing and no long-term commitment.
Rent H100 → | Rent L40S → | View all GPU pricing → | Get started on Spheron →
