Tutorial

Speculative Decoding Production Guide: 2-5x Faster LLM Inference on GPU Cloud

Back to BlogWritten by Mitrasish, Co-founderMar 26, 2026
Speculative DecodingvLLMSGLangLLM InferenceGPU CloudH100Inference OptimizationEAGLE
Speculative Decoding Production Guide: 2-5x Faster LLM Inference on GPU Cloud

Token generation on large language models is memory-bandwidth-bound. The GPU loads model weights from VRAM for every single token it generates, one at a time. That is the fundamental bottleneck. Speculative decoding exploits a key insight: verifying whether multiple candidate tokens are correct costs roughly the same as generating one token from scratch. A small draft model proposes N tokens cheaply; the target model checks all N in a single forward pass. When the draft matches, you get N tokens for the price of one verification step.

This guide covers the theory, the EAGLE-3 and P-EAGLE variants that push acceptance rates significantly higher, GPU requirements for H100 setups, and working configuration for both vLLM and SGLang. For a framework comparison covering throughput and latency numbers across all three major engines, see vLLM vs TensorRT-LLM vs SGLang benchmarks. For the full vLLM production setup guide, see the vLLM production deployment guide. If you're setting up a fresh GPU instance, Spheron's LLM quick-guides walk through first deployment.

TL;DR

ModeTokens/secTTFT p50Cost per 1M output tokensBest for
Standard decoding~1,200~45 ms~$0.46High concurrency (32+ req), batch jobs
Draft model (Llama 3.2 1B)~2,600~20 ms~$0.21Low-concurrency chat, interactive APIs
EAGLE-3~3,600~15 ms~$0.15Instruction-following, coding, agents

Numbers based on Llama 3.3 70B Instruct at FP8, H100 PCIe at $2.01/hr, batch size 1-4. Cost formula: ($/hr / 3600) / (tokens/sec / 1_000_000).

How Speculative Decoding Works

Standard autoregressive decoding generates one token per forward pass. Each pass loads all 70 billion weights from VRAM into compute units. The GPU spends most of its time on memory loads, not computation. Throughput is determined by how fast you can transfer data across the memory bus.

Speculative decoding breaks this one-token-per-pass constraint.

A small draft model generates N candidate tokens quickly (say, 5 tokens). Then the large target model takes those 5 candidate tokens and verifies all of them in a single forward pass. If a candidate matches what the target model would have generated, it is accepted. At the first mismatch, everything from that point forward is discarded, and the target model generates the correct continuation.

The expected throughput gain depends on the acceptance rate alpha. If alpha is 0.8 (the draft model is right 80% of the time), expected accepted tokens per step is (1 - alpha^(N+1)) / (1 - alpha). At alpha = 0.8 with N = 5, you accept about 3.7 tokens per target model forward pass instead of 1. That is a 3.7x speedup before accounting for the draft model's own inference cost.

Here is the draft-verify loop in concrete terms:

Draft model: [tok1] [tok2] [tok3] [tok4] [tok5]  <-- fast generation
Target model:  ✓      ✓      ✓      ✗             <-- verified in 1 forward pass
Result: tok1, tok2, tok3 accepted; regenerate from tok4

The output is mathematically identical to standard autoregressive decoding. The target model's distribution is preserved exactly; any rejected token is replaced by the target model's actual output at that position. No quality tradeoff.

Acceptance rate depends on how closely the draft model's distribution matches the target model's on your specific prompts. Structured outputs, code, and instruction-following tend to have high acceptance rates. Creative writing, adversarial prompts, and high-entropy tasks have lower acceptance rates.

Types of Speculative Decoding

Draft Model Speculation

The classic approach: a separate small model (1B-3B parameters) in the same architecture family as your target model. Works with any target model. Acceptance rate is typically 0.6-0.75 for instruction-following workloads, which gives 2-3x throughput. The easiest starting point; no special training required if a compatible draft model exists for your target.

Self-Speculation and Lookahead Decoding

No separate draft model: uses the target model's own lower layers or n-gram patterns from previous tokens to speculate. Zero VRAM overhead. Acceptance rates are lower than a dedicated draft model (typically 0.4-0.6), so speedups are more modest. Not widely supported in production serving frameworks yet.

Medusa

Adds parallel decoding heads to the target model. Requires either fine-tuning or a pre-trained Medusa checkpoint. Higher acceptance rates than vanilla draft models on workloads the heads were trained for. Less flexible than the draft model approach since the heads are task-specific.

EAGLE and EAGLE-2

Trains a lightweight draft model on the target model's internal feature vectors rather than token embeddings. Because the draft model sees richer representations, it predicts the target's continuations more accurately. EAGLE-2 adds dynamic candidate tree construction, where the tree width expands or contracts based on confidence. Achieves 3x+ speedup on coding and instruction-following. Pre-trained checkpoints are available on Hugging Face for Llama, Qwen, and Mistral families.

EAGLE-3 and P-EAGLE (2026)

EAGLE-3 switches from feature prediction to direct token prediction, fusing hidden states from multiple target model layers as draft head input. This yields higher acceptance rates. In practice, coding-heavy workloads see 3.5-5x speedup versus standard decoding (the EAGLE-3 paper reports ~4.8x on HumanEval for LLaMA 3.3 70B). P-EAGLE extends EAGLE with parallel drafting: instead of generating draft tokens autoregressively, the draft model produces all K candidate tokens in a single forward pass, then constructs verification trees from them. AWS contributed P-EAGLE to mainline vLLM in early 2026. On coding benchmarks, P-EAGLE reaches 4-5x speedup over standard decoding and 20-30% improvement over EAGLE-3 alone.

GPU Requirements

Both the target and draft models need to fit in VRAM simultaneously. The draft model adds 2-6 GB for a 1B-3B model in FP8 or FP16. On an H100 80GB, this is manageable. For more on GPU selection for inference workloads, see the best GPU for AI inference in 2026 guide.

Target ModelPrecisionDraft ModelGPUApprox VRAM Used
Llama 3.3 70BFP8Llama 3.2 1BH100 80GB~76 GB
Llama 3.1 8BFP16Llama 3.2 1BL40S 48GB~26 GB
Llama 3.3 70BFP8EAGLE3-LLaMA3.3-70BH100 80GB~77 GB
Llama 3.1 405BFP8Llama 3.2 3B4x H200 141GB~415 GB

Two settings to get right: set --gpu-memory-utilization 0.94 instead of the default 0.90 so both models fit. Keep --speculative-draft-tensor-parallel-size 1 even when the target uses tensor parallelism; the draft model fits on a single GPU in all cases above.

Tutorial: Enable Speculative Decoding in vLLM

Prerequisites

Docker with NVIDIA runtime on a Spheron instance. vLLM v0.8.0 or later (recommended for stable EAGLE support). A Hugging Face token for any gated models.

Single-GPU Draft Model Setup (H100)

bash
# --speculative-model: draft model (1B, same architecture family as target)
# --num-speculative-tokens: candidate tokens generated per draft step (tune 3-8)
# --speculative-draft-tensor-parallel-size: keep draft on 1 GPU even when target uses TP
# --gpu-memory-utilization: raised from default 0.90 to 0.94 to fit both models in VRAM
docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:v0.8.0 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --speculative-model meta-llama/Llama-3.2-1B-Instruct \
  --num-speculative-tokens 5 \
  --speculative-draft-tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.94 \
  --max-num-seqs 32 \
  --host 0.0.0.0 \
  --port 8000

Start with --num-speculative-tokens 5 and tune upward if your acceptance rate is consistently above 0.8. Increase to 7-8 for high-acceptance workloads. Reduce to 3 if acceptance rate is below 0.5.

EAGLE-3 Setup

EAGLE-3 gives meaningfully higher acceptance rates on instruction-following and coding tasks:

bash
docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:v0.8.0 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --speculative-model yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
  --num-speculative-tokens 6 \
  --speculative-draft-tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.94 \
  --host 0.0.0.0 \
  --port 8000

yuhuili/EAGLE3-LLaMA3.3-Instruct-70B is publicly available on Hugging Face and does not require a Hugging Face token. If it is unavailable or renamed, use yuhuili/EAGLE-LLaMA3-Instruct-70B (EAGLE-1 draft weights, which EAGLE-2 reuses) as a fallback.

Tutorial: Enable Speculative Decoding in SGLang

SGLang uses --speculative-algorithm EAGLE3 for EAGLE-3 checkpoints (or --speculative-algorithm EAGLE for EAGLE-1/2) and --speculative-draft-model-path. The same EAGLE checkpoints work with both frameworks.

bash
docker run --gpus all \
  --ipc=host \
  -p 30000:30000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  lmsysorg/sglang:v0.4.6.post4-cu124 \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.3-70B-Instruct \
    --quantization fp8 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
    --speculative-num-draft-tokens 6 \
    --host 0.0.0.0 \
    --port 30000

SGLang's speculative decoding uses tree attention rather than flat token speculation. In practice, this means SGLang can evaluate multiple speculative trees in a single pass, which helps on workloads with branching continuations. The flag names differ from vLLM (--speculative-algorithm instead of --speculative-model, --speculative-num-draft-tokens instead of --num-speculative-tokens), but the underlying EAGLE checkpoints are shared. For a deeper comparison of SGLang versus vLLM throughput and latency characteristics, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.

Benchmarks: Tokens/sec and Cost Per Token

These benchmarks use Llama 3.3 70B Instruct at FP8 on H100, single GPU, batch size 1-4. Methodology: vLLM benchmark_serving.py with 200 prompts, 512 input tokens, 256 output tokens. Pricing from Spheron live API as of 24 Mar 2026. † L40S rows use Llama 3.1 8B Instruct at FP16 (70B does not fit on a single 48 GB GPU).

GPUPrice/hrModeTokens/secCost per 1M tokens
H100 PCIe$2.01Standard decoding~1,200~$0.46
H100 PCIe$2.01Draft model (Llama 3.2 1B)~2,600~$0.21
H100 PCIe$2.01EAGLE-3~3,600~$0.15
H100 SXM5$2.40Standard decoding~1,500~$0.44
H100 SXM5$2.40EAGLE-3~4,400~$0.15
L40S PCIe†$2.04Standard decoding~3,800~$0.15
L40S PCIe†$2.04EAGLE-3~10,000~$0.06

Cost per 1M tokens formula, so you can verify with current pricing:

cost_per_1M = (price_per_hour / 3600) / (tokens_per_second / 1_000_000)

For example, H100 PCIe with EAGLE-3: (2.01 / 3600) / (3600 / 1_000_000) = ~$0.155 per million output tokens.

Pricing fluctuates based on GPU availability. The prices above are based on 24 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

When NOT to Use Speculative Decoding

Speculative decoding is optimized for low-concurrency, latency-sensitive serving. There are clear cases where you should leave it disabled:

  1. Batch size above 32. At high concurrency the GPU is compute-bound, not memory-bandwidth-bound. The draft model forward passes become pure overhead. You are paying for two models but getting no speedup.
  2. Short output sequences under 50 tokens. The setup overhead of the first speculative step outweighs the savings for very short completions.
  3. High-entropy outputs. Random seeds, adversarial prompts, and shuffled data produce low acceptance rates. Below 0.5 acceptance rate, speculative decoding actively hurts throughput.
  4. Embedding and batch inference pipelines. Not relevant: you are not generating tokens autoregressively.
  5. VRAM-constrained GPUs serving large models. If the target model already uses 90%+ VRAM, adding a draft model causes OOM or fragments the KV cache.

Decision rule:

if batch_size > 32 or output_tokens < 50 or task == "embedding":
    use standard decoding
else:
    enable speculative decoding, tune num-speculative-tokens

Production Configuration and Monitoring

Key vLLM Flags Reference

FlagRecommended ValueNotes
--num-speculative-tokens5-6Start at 5, increase to 8 if acceptance rate >0.8
--speculative-draft-tensor-parallel-size1Draft model fits on 1 GPU in all standard configs
--speculative-disable-by-batch-size32Auto-fallback above this concurrent request count
--speculative-max-model-len4096Limit draft context; saves VRAM, usually sufficient
--gpu-memory-utilization0.94Higher than default to fit both models

Metrics to Monitor

bash
# Scrape vLLM Prometheus endpoint
curl http://localhost:8000/metrics | grep spec_decode

Key metrics:

  • vllm:spec_decode_draft_acceptance_rate: target above 0.70. Below 0.50 means the draft model distribution is a poor match for your traffic.
  • vllm:spec_decode_efficiency: ratio of tokens generated via speculative path versus total.
  • vllm:gpu_cache_usage_perc: watch for VRAM pressure when running both models simultaneously.

If acceptance rate drops below 0.5 on your production traffic, switch from a general-purpose draft model to an EAGLE-3 checkpoint trained on a similar distribution, or reduce --num-speculative-tokens to 3.

Handling Traffic Spikes

--speculative-disable-by-batch-size 32 is the critical production setting. Without it, at 100+ concurrent requests speculative decoding can reduce throughput by 30-40% versus standard decoding. With it, vLLM automatically falls back to standard decoding when queue depth exceeds the threshold. No code changes needed; the switch is transparent to the API caller.


Speculative decoding cuts cost per output token by 50-70% on H100 and L40S instances without changing your model or upgrading GPU tier. Spheron provides bare-metal access with per-minute billing and no long-term commitment.

Rent H100 → | Rent L40S → | View all GPU pricing → | Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.