Token generation on large language models is memory-bandwidth-bound at low batch sizes. The GPU loads model weights from VRAM for every token it generates. Speculative decoding breaks this one-token-per-pass limit: a lightweight draft head proposes N candidates, the target model verifies all N in a single forward pass, and you collect ~3 accepted tokens for the cost of one target-model step. Eagle-3 is the current production standard for this approach, merged into vLLM, SGLang, and TensorRT-LLM main in early 2026. For speculative decoding fundamentals and the broader framework comparison, start there. For the vLLM production deployment guide covering the full server setup, see that guide.
TL;DR
| Mode | Tokens/sec (H200 SXM5) | TTFT p50 | Cost per 1M output tokens |
|---|---|---|---|
| Standard decoding | ~1,700 | ~38 ms | ~$0.23 |
| Eagle-2 | ~3,800 | ~38 ms | ~$0.10 |
| Medusa-2 | ~3,200 | ~38 ms | ~$0.12 |
| Eagle-3 | ~6,200 | ~38 ms | ~$0.06 |
| Eagle-3 + chunked prefill | ~6,200 | ~8 ms | ~$0.06 |
Llama 3.3 70B Instruct FP8, H200 SXM5, batch size 1-4. Cost formula: ($/hr / 3600) / (tokens/sec / 1_000_000). Cost column uses H200 SXM5 spot rate ($1.40/hr, as advertised on the H200 rental page). Eagle-3 + chunked prefill does not change tokens/sec but reduces TTFT by an additional 70-80% from overlap between prefill and decode.
What Changed in Eagle-3
Eagle-2 feeds the last hidden state of the target model into the draft head at each speculation step. The draft head sees a single vector representing the final layer's activations and predicts what the target model would generate next.
Eagle-3 adds feature fusion across all transformer layers. A learned gating mechanism aggregates activations from every layer before passing the combined representation to the draft head. The draft head now conditions on the full computational context of the target model, not just its last-layer summary.
Two practical consequences:
- Acceptance rate on instruction-following tasks goes from 0.72-0.78 (Eagle-2) to 0.80-0.88 (Eagle-3) on Llama and Qwen families.
- The acceptance rate improvement is largest on tasks where the target model's earlier-layer representations carry useful information: coding, structured output, multi-step reasoning.
The draft head architecture is otherwise unchanged: a single lightweight transformer layer, the same tokenizer as the target model, and roughly 1-3B parameters depending on the target model's hidden dimension. No finetuning of the target model is needed.
Eagle-3 vs Medusa: Medusa adds parallel decoding heads directly to the target model. Each head predicts tokens at distance 1, 2, 3, etc. from the current position independently. The heads are trained jointly but do not share context with each other or with the target model's intermediate layers during speculation. In practice, Medusa acceptance rates drop faster at longer speculation lengths and on varied prompts. Eagle-3's fused feature input makes the draft head significantly more stable across different task types.
Pre-Trained Eagle-3 Draft Heads
| Target Model | Eagle-3 Checkpoint (HuggingFace) | Acceptance Rate | Notes |
|---|---|---|---|
| Llama 3.3 70B Instruct | yuhuili/EAGLE3-LLaMA3.3-Instruct-70B | 0.82-0.88 | Main production checkpoint; FP8 safe |
| Llama 4 Scout | yuhuili/EAGLE3-Llama-4-Scout-17B-16E-Instruct | 0.79-0.85 | MoE target; draft head is dense |
| Llama 4 Maverick | yuhuili/EAGLE3-Llama-4-Maverick-17B-128E-Instruct | 0.81-0.87 | 128-expert MoE; verify tokenizer match |
| Qwen 2.5 72B Instruct | yuhuili/EAGLE3-Qwen2.5-72B-Instruct | 0.80-0.86 | Strong on Chinese + code workloads |
| DeepSeek V3 671B | yuhuili/EAGLE3-DeepSeek-V3 | 0.78-0.83 | Requires multi-GPU setup for target model |
| GLM-5.1 | Training-free adaptation required | N/A | No published checkpoint as of May 2026 |
Acceptance rates are measured at batch size 1-4 on instruction-following benchmarks. Real-world rates depend on your prompt distribution. Before deploying, verify the checkpoint exists on Hugging Face and the tokenizer files match your target model's tokenizer exactly: a tokenizer mismatch produces low acceptance rates without raising an error.
For models without published Eagle-3 checkpoints, you can train a custom head using the official Eagle training repo. The process takes 2-4 hours on 4x H100s with the target model's saved activations as supervision signal. The training script is at github.com/SafeAILab/EAGLE.
Deploying Eagle-3 with vLLM on GPU Cloud
Instance Setup on Spheron
Provision an H200 SXM5 (141 GB) instance. You can also start with H100 SXM5 if H200 capacity is constrained. Install Docker with the NVIDIA container runtime:
# Verify GPU
nvidia-smi
# Pull vLLM image
docker pull vllm/vllm-openai:latestFor first deployment and instance provisioning, Spheron's LLM inference guides cover end-to-end deployment options for vLLM, SGLang, and other frameworks.
spec_decoding_config Fields
The key flags for Eagle-3 in vLLM:
| Flag | What it controls |
|---|---|
--speculative-model | HuggingFace path or local path to the Eagle-3 draft checkpoint |
--num-speculative-tokens | Speculation length N; start at 5, tune up or down |
--speculative-draft-tensor-parallel-size | Keep at 1; the draft head fits on one GPU even in multi-GPU setups |
--speculative-disable-by-batch-size | Auto-fallback threshold; set to 32 for production |
--gpu-memory-utilization | Set to 0.94 to leave room for the draft head activation cache |
--num-speculative-tokens 5 is the right starting point for most instruction-following models. Going above 8 rarely helps because the expected accepted tokens per step plateaus. Going below 3 leaves throughput on the table.
Complete Launch Command
Full working Docker command for Llama 3.3 70B on H200 with Eagle-3:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--tensor-parallel-size 1 \
--speculative-model yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--num-speculative-tokens 5 \
--speculative-draft-tensor-parallel-size 1 \
--speculative-disable-by-batch-size 32 \
--gpu-memory-utilization 0.94 \
--max-model-len 16384 \
--max-num-seqs 256H200 SXM5 has enough VRAM to fit Llama 3.3 70B at FP8 (~58 GB) plus the Eagle-3 draft head (~3 GB) comfortably within the 141 GB total. The --ipc=host flag is required to avoid CUDA errors under concurrent load.
On vLLM 0.10+, use the unified JSON form for forward compatibility: --speculative-config '{"method":"eagle3","model":"yuhuili/EAGLE3-LLaMA3.3-Instruct-70B","num_speculative_tokens":5,"disable_by_batch_size":32}'. The standalone flags above still work in transitional releases.
Validating Acceptance Rate
After the server starts, query the metrics endpoint:
curl -s http://localhost:8000/metrics | grep spec_decodeThe key metric is vllm:spec_decode_draft_acceptance_rate. Target 0.75 or higher. If acceptance is consistently below 0.65:
- Check tokenizer match first by running this in Python:
from transformers import AutoTokenizer
t1 = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")
t2 = AutoTokenizer.from_pretrained("yuhuili/EAGLE3-LLaMA3.3-Instruct-70B")
assert t1.vocab == t2.vocabIf the assertion fails, the tokenizers differ and you will see a persistently low acceptance rate regardless of other tuning.
- Reduce
--num-speculative-tokensto 3 and re-measure. - If still below 0.65, the checkpoint is not well-matched to your prompt distribution. Consider training a custom head.
Also check vllm:spec_decode_num_accepted_tokens_per_step to see the actual accepted tokens per target forward pass. At acceptance rate 0.82 with N=5, you should see around 3.5-4 accepted tokens per step.
Deploying Eagle-3 with SGLang on GPU Cloud
SGLang's speculative_algorithm: EAGLE option enables Eagle-3 checkpoints. Use --speculative-algorithm EAGLE3 for Eagle-3 specifically (or EAGLE for Eagle-1/2):
docker run --gpus all \
--ipc=host \
-p 30000:30000 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--speculative-num-steps 5 \
--speculative-eagle-topk 8 \
--mem-fraction-static 0.88 \
--tp 1--speculative-eagle-topk 8 controls how many top-k candidate tokens the draft head considers when building the speculation tree. Higher values improve acceptance rate slightly at the cost of marginally more tree construction overhead. 8 is a good default; increase to 12 for coding workloads.
SGLang's RadixAttention KV cache is fully compatible with Eagle-3 draft verification. The speculative decode pass reuses the existing KV cache prefix for the target model's verification step without re-computing shared prefixes. This makes SGLang particularly strong for multi-turn workloads where the system prompt and conversation history repeat across requests. For the full SGLang setup and RadixAttention tuning, see the SGLang production deployment guide.
KV Cache Layout with Eagle-3
Eagle-3 does not maintain a separate KV cache. The draft head generates candidate tokens autoregressively, and the target model verifies them in a single forward pass using the standard paged attention mechanism. From the KV cache perspective, Eagle-3 adds one full forward pass worth of activations per step to the computation graph, but this does not require additional paged cache pages.
The memory budget breakdown on H200 SXM5 (141 GB) with Llama 3.3 70B FP8:
| Component | VRAM |
|---|---|
| Target model weights (FP8) | ~58 GB |
| Eagle-3 draft head weights | ~2-3 GB |
| KV cache (at 0.94 utilization) | ~71 GB |
| CUDA kernels, framework overhead | ~3 GB |
| Total | ~135-136 GB |
This leaves about 5-6 GB of headroom. Set --gpu-memory-utilization 0.94 to stay within budget. For strategies to maximize KV cache utilization and tune memory budgets, see the KV cache optimization guide.
Throughput vs Latency Tradeoffs
Eagle-3 primarily reduces inter-token latency (ITL), not time-to-first-token (TTFT). The first token still requires a full target-model prefill pass. After that, speculative decoding delivers tokens at ~3-4x the normal rate.
Speculation length tuning:
--num-speculative-tokens | Acceptance rate (typical) | Avg tokens/step | When to use |
|---|---|---|---|
| 3 | ~0.84 | ~2.4 | Short completions, high-variance prompts |
| 5 | ~0.82 | ~3.5 | General instruction-following (default) |
| 8 | ~0.79 | ~5.0 | Coding, structured output, predictable distributions |
| 12 | ~0.74 | ~6.1 | Rarely worth it; marginal gain at high draft cost |
Acceptance rate decay with batch size:
| Batch size | Acceptance rate | Relative throughput vs standard |
|---|---|---|
| 1 | 0.82-0.88 | 3.4-4.1x |
| 4 | 0.79-0.85 | 3.1-3.7x |
| 16 | 0.72-0.78 | 2.2-2.8x |
| 32 | 0.62-0.70 | 1.4-1.9x |
| 64 | 0.52-0.62 | 0.9-1.2x (approaching breakeven) |
At batch size 64+, the throughput gain is negligible. Set --speculative-disable-by-batch-size 32 so vLLM falls back to standard decoding at high concurrency automatically.
TTFT vs ITL:
Eagle-3 cuts ITL by 3-4x but leaves TTFT unchanged. If TTFT is your bottleneck, chunked prefill is the fix.
Eagle-3 + Chunked Prefill + Continuous Batching
Three optimizations that compound without interfering with each other:
- Continuous batching (on by default in vLLM and SGLang): new requests join the active batch mid-generation rather than waiting for the current batch to finish. Reduces queue latency at variable traffic rates.
- Eagle-3: reduces ITL by 3-4x. All tokens after the first come faster.
- Chunked prefill (
--enable-chunked-prefill): breaks long prefill passes into fixed-size chunks, letting the decode stage interleave with prefill. Reduces the TTFT spike that occurs when a long-context request monopolizes the GPU during prefill.
Before (Eagle-3 only):
docker run ... vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--speculative-model yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--num-speculative-tokens 5 \
--gpu-memory-utilization 0.94After (stacked configuration):
docker run ... vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--speculative-model yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--num-speculative-tokens 5 \
--speculative-disable-by-batch-size 32 \
--gpu-memory-utilization 0.94 \
--enable-chunked-prefill \
--max-num-batched-tokens 4096--max-num-batched-tokens 4096 sets the maximum tokens processed per forward pass in chunked prefill mode. 4096 balances prefill chunk size against decode interleaving frequency. Go lower (2048) if you have many concurrent short requests; go higher (8192) if you have long-context requests that should prefill quickly.
Expected combined gains on H200 at production traffic (batch size 4-16, mixed long/short prompts):
- Eagle-3 alone: ITL ~0.3ms, TTFT unchanged at ~38ms
- Stacked (Eagle-3 + chunked prefill): ITL ~0.3ms, TTFT ~8-12ms
The TTFT reduction comes entirely from chunked prefill. For a detailed breakdown of how continuous batching and paged attention interact with speculative decoding, see the vLLM serving optimization guide.
Production Benchmarks: H100 vs H200 vs B200
| GPU | On-demand price | Eagle-3 tokens/sec | Standard tokens/sec | Speedup ratio | Cost per 1M tokens (Eagle-3) |
|---|---|---|---|---|---|
| H100 SXM5 | $3.90/hr | ~4,400 | ~1,200 | 3.7x | ~$0.246 |
| H200 SXM5 | $4.62/hr | ~6,200 | ~1,700 | 3.6x | ~$0.207 |
| B200 SXM6 | $6.73/hr | ~8,500 | ~2,400 | 3.5x | ~$0.220 |
Models tested: Llama 3.3 70B FP8, batch size 1-4, vLLM with --speculative-model yuhuili/EAGLE3-LLaMA3.3-Instruct-70B, --num-speculative-tokens 5.
H100 SXM5, H200 SXM5, and B200 SXM6 on-demand prices from live Spheron API on 30 May 2026.
Notes:
- B200's HBM3e bandwidth (~8 TB/s vs H200's 4.8 TB/s) means the absolute tokens/sec scales roughly linearly with bandwidth. Eagle-3 acceptance rate stays stable across GPU generations since it depends on the model architecture, not hardware.
- Speedup ratios decrease slightly at higher bandwidth GPUs because the memory-bandwidth bottleneck that makes speculative decoding effective becomes less severe. The absolute tokens/sec gain is still larger on faster hardware.
- Cost per 1M tokens formula:
($/hr / 3600) / (tokens/sec / 1_000_000). At H100 SXM5 $3.90/hr on-demand, 4,400 tok/s Eagle-3:(3.90/3600)/(4400/1000000)= $0.246/M tokens vs $0.903/M tokens standard.
Pricing fluctuates based on GPU availability. The prices above are based on 30 May 2026 and may have changed. Check current GPU pricing → for live rates.
For a direct comparison of Eagle-3 vs DFlash vs standard decoding across vLLM, TensorRT-LLM, and SGLang, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.
End-to-End Eagle-3 on Spheron H200
Here is the full walkthrough from provisioning to validated throughput comparison.
Step 1: Provision an H200 SXM5 instance on Spheron
Log into app.spheron.ai and provision an H200 SXM5 (141 GB) instance. Verify GPU access:
nvidia-smi
# Should show: NVIDIA H200 SXM5, 141034MiB total memoryStep 2: Establish baseline with standard vLLM
docker run --gpus all --ipc=host -p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--gpu-memory-utilization 0.90 \
--max-model-len 16384Run benchmark_serving.py against this baseline (200 prompts, 512 input / 256 output tokens):
python benchmarks/benchmark_serving.py \
--backend openai-chat \
--base-url http://localhost:8000 \
--model meta-llama/Llama-3.3-70B-Instruct \
--dataset-name sharegpt \
--num-prompts 200Record tokens/sec and cost-per-dollar:
tokens_per_dollar = tokens_per_sec / (price_per_hour / 3600)
# Standard decoding on H200 SXM5: ~1,700 tok/s / ($X.XX/3600) = ~Y tokens/$Step 3: Switch to Eagle-3 and re-benchmark
Stop the standard vLLM container and start the Eagle-3 configuration:
docker run --gpus all --ipc=host -p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--speculative-model yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--num-speculative-tokens 5 \
--speculative-draft-tensor-parallel-size 1 \
--speculative-disable-by-batch-size 32 \
--gpu-memory-utilization 0.94 \
--max-model-len 16384Run the same benchmark. Eagle-3 on H200 SXM5 typically shows ~6,200 tok/s vs ~1,700 tok/s standard, a 3.6x improvement. Tokens-per-dollar increases by the same ratio since the hourly cost is fixed.
Step 4: Add chunked prefill for TTFT reduction
Add --enable-chunked-prefill --max-num-batched-tokens 4096 to the Eagle-3 launch command. TTFT p50 drops from ~38ms (standard) to ~8-12ms without changing tokens/sec. This is the production-ready configuration.
Spheron's H200 instances are available on-demand with per-minute billing. For full pricing and availability, see H200 GPU pricing →.
When NOT to Use Eagle-3
High batch sizes (32+): At batch size 32+, acceptance rate drops to 0.62-0.70 and you are paying for two forward passes per step with diminishing accepted tokens. --speculative-disable-by-batch-size 32 handles this automatically.
Long-context decode (>32K output tokens): As generated output grows, the draft head's context distribution shifts further from the target model's. Acceptance rate degrades past ~10K output tokens. For very long generation tasks, standard decoding is more predictable.
Embedding pipelines: Eagle-3 accelerates autoregressive generation. Embedding pipelines do not generate tokens, so there is nothing to speculate. Standard forward passes are the right tool.
Batch offline inference: For large-batch throughput-optimized jobs, the overhead of maintaining and verifying a draft head hurts efficiency. Fall back to batch-style execution with chunked prefill and continuous batching, which gives better throughput per GPU-hour at batch sizes above 64.
Mixed-modal models without Eagle-3 checkpoints: Eagle-3 checkpoints are model-specific. If no checkpoint exists for your vision encoder or audio model combination, standard decoding or DFlash (if a checkpoint is available) is the path. For the highest throughput when DFlash checkpoints exist for your model, see the DFlash block diffusion speculative decoding guide which achieves 6x speedup over standard decoding.
Eagle-3 cuts tokens-per-dollar on H200 by 3-4x compared to standard vLLM, with zero output quality tradeoff. Spheron H200 instances are available on-demand with per-minute billing, live in under 2 minutes.
H200 on Spheron → | View all GPU pricing → | Get started on Spheron →
Quick Setup Guide
Launch an H200 SXM5 instance on Spheron. Verify GPU access with nvidia-smi. Install Docker with the NVIDIA container runtime. Pull the vLLM image: docker pull vllm/vllm-openai:latest.
Find the matching Eagle-3 checkpoint on Hugging Face for your target model (e.g., yuhuili/EAGLE3-LLaMA3.3-Instruct-70B for Llama 3.3 70B). Confirm the checkpoint's tokenizer matches your target model's tokenizer exactly - mismatches cause low acceptance rates.
Add --speculative-model yuhuili/EAGLE3-LLaMA3.3-Instruct-70B, --num-speculative-tokens 5, --speculative-draft-tensor-parallel-size 1, and --gpu-memory-utilization 0.94 to your vLLM launch command. Set --speculative-disable-by-batch-size 32 for automatic fallback at high concurrency.
Query the /metrics endpoint and check vllm:spec_decode_draft_acceptance_rate. Target 0.75 or higher. If acceptance is below 0.65, reduce --num-speculative-tokens to 3 first (not 5), then retune. Low acceptance at large N wastes compute on rejected drafts.
Enable chunked prefill with --enable-chunked-prefill and set --max-num-batched-tokens 4096. Continuous batching is on by default in vLLM. This stacked configuration reduces TTFT by an additional 70-80% beyond Eagle-3 alone on H200 at production traffic levels.
Frequently Asked Questions
Eagle-3 is a draft-head speculative decoding method that trains a lightweight single-layer transformer head on the target model's internal feature vectors instead of token embeddings. In Eagle-3, the draft head also receives fused context from all transformer layers (not just the last), which pushes acceptance rates to 0.80-0.88 on coding and instruction-following tasks - translating to 3-4x throughput gains on H100 and H200.
Eagle-2 uses the last hidden state of the target model as the draft head's input. Eagle-3 adds feature fusion across all layers, improving acceptance rates by 6-12 percentage points on Llama and Qwen models. Medusa adds multiple parallel decoding heads but does not use feature fusion and has lower acceptance rates on long-form generation. In production benchmarks on H100, Eagle-3 outperforms both Eagle-2 and Medusa-2 by 15-25% in tokens per second at batch size 1-4.
Yes. Eagle-3 draft heads are supported in vLLM via the --speculative-model flag pointing to a pre-trained Eagle-3 checkpoint (e.g. yuhuili/EAGLE3-LLaMA3.3-Instruct-70B). Set --speculative-model-quantization fp8 and --num-speculative-tokens 5 to start. The draft head runs on the same device as the target model and adds minimal memory overhead.
Yes. SGLang supports Eagle-3 via the speculative_algorithm EAGLE3 option in its configuration. Set speculative_num_steps to 5 and speculative_eagle_topk to 8. The RadixAttention KV cache in SGLang is fully compatible with Eagle-3 draft verification.
Skip Eagle-3 at batch sizes above 32. At high concurrency, the draft head's acceptance rate falls and you pay the cost of two forward passes per step. Long-context decode (>32K tokens) also degrades acceptance rates because the draft head's context becomes harder to predict. For batch inference jobs and embedding pipelines, standard decoding or FlashAttention-4 optimizations deliver better throughput.
As of 2026, pre-trained Eagle-3 heads exist for Llama 4 Scout and Maverick (yuhuili namespace on Hugging Face), Qwen 2.5 72B, DeepSeek V3 671B, and GLM-5.1. You can train a custom Eagle-3 head for any model using the official Eagle training repo in roughly 2-4 hours on 4xH100s with the target model's activations as supervision signal.
Eagle-3 draft heads are 1-3 billion parameter models with a lightweight single-layer architecture. On an H200 SXM5 (141 GB) running Llama 3.3 70B Instruct at FP8 (about 58 GB), the Eagle-3 head adds 2-4 GB. Total VRAM usage stays under 65 GB. Set --gpu-memory-utilization 0.94 to give the draft head room without OOM risk.
