Standard speculative decoding with EAGLE-3 delivers 3-4x LLM throughput improvement. DFlash gets to 6x, and does it without any quality tradeoff. The difference is in how the draft phase works: instead of generating candidate tokens one at a time, DFlash produces a full block of K tokens in a single forward pass using a block diffusion model.
If you are new to speculative decoding, start with the speculative decoding production guide before continuing. This post assumes familiarity with draft-verify mechanics and focuses on what DFlash does differently.
TL;DR
| Mode | Tokens/sec (H100 PCIe) | TTFT p50 | Cost per 1M output tokens | Best for |
|---|---|---|---|---|
| Standard decoding | ~1,200 | ~45 ms | ~$0.47 | High concurrency (32+ req), batch jobs |
| Draft model (Llama 3.2 1B) | ~2,600 | ~20 ms | ~$0.21 | Low-concurrency chat, interactive APIs |
| EAGLE-3 | ~3,600 | ~15 ms | ~$0.16 | Instruction-following, coding, agents |
| P-EAGLE | ~4,500 | ~13 ms | ~$0.12 | Coding, reasoning, multi-tree |
| DFlash | ~9,000 | ~8 ms | ~$0.06 | Best throughput when DFlash checkpoint available |
Numbers from Llama 3.3 70B FP8, H100 PCIe at $2.01/hr, batch size 1-4, vLLM benchmark_serving.py, 200 prompts, 512 input / 256 output tokens. DFlash numbers projected from the DFlash paper (2.5x over EAGLE-3). See the benchmarks section for full results with current pricing.
What Is DFlash
Speculative decoding has a sequential bottleneck in the draft phase. With EAGLE-3, the draft head generates tokens one at a time: tok1, then tok2, then tok3. Each step depends on the previous one. You pay the draft model's compute cost once per token generated.
DFlash removes that bottleneck entirely. It replaces the autoregressive draft head with a block diffusion model. The diffusion model receives the target model's hidden states and generates K masked positions. A single denoising step fills all K positions simultaneously. The target model then verifies all K in one forward pass, same as before.
The DFlash paper reports:
- 6x lossless acceleration over standard autoregressive decoding
- 2.5x improvement over EAGLE-3
Both figures are on instruction-following workloads at low batch sizes. Real-world speedup depends on your acceptance rate, sequence length, and batch size. Mark these as targets, not guarantees, until you benchmark on your actual traffic.
How DFlash Works: Single-Pass Block Drafting
Here is the step-by-step loop:
- Target model generates the first token normally (autoregressive bootstrapping).
- DFlash draft head receives the target model's hidden states.
- Block diffusion denoising fills K masked positions in one forward pass.
- Target model verifies all K candidate tokens in a single forward pass.
- Accepted tokens are kept up to the first rejection; the target model resamples from that position.
Draft (DFlash block diffusion): [tok1] [tok2] [tok3] [tok4] [tok5] [tok6] [tok7] [tok8]
-- all generated in ONE forward pass --
Target verification: ✓ ✓ ✓ ✓ ✓ ✗
Result: tok1-tok5 accepted; resample from tok6Compare with EAGLE-3, which generates tokens autoregressively in the draft phase:
Draft (EAGLE-3 autoregressive): tok1 -> tok2 -> tok3 -> tok4 -> tok5 -> tok6
-- 6 sequential draft steps --DFlash's draft cost is roughly constant regardless of K. If you set --num-speculative-tokens 8, DFlash does one draft forward pass to generate 8 candidates. EAGLE-3 does 8 sequential draft steps. At K=8, DFlash has an 8x lower draft cost than EAGLE-3 in compute terms, which is why the overall throughput improvement is so large.
DFlash vs EAGLE-3 vs Medusa vs Standard Speculative Decoding
| Method | Draft Generation | Tokens/step (H100) | Acceptance Rate | Max Speedup | Requires Finetuning |
|---|---|---|---|---|---|
| Standard draft model | Autoregressive (1B-3B) | 1 draft per step | 0.60-0.75 | ~2.5x | No |
| Medusa | Parallel heads on target | K per step | 0.65-0.80 | ~2.8x | Yes (head finetuning) |
| EAGLE-3 | Autoregressive, feature-based | 1 per step | 0.75-0.85 | ~4.8x | Yes (EAGLE checkpoint) |
| P-EAGLE | Parallel multi-tree | K per step | 0.75-0.85 | ~5-6x† (vs std. decoding) | Yes (EAGLE checkpoint) |
| DFlash | Block diffusion (single pass) | K in 1 pass | 0.75-0.85 | ~6x | Yes (DFlash checkpoint) |
† P-EAGLE reports 1.10-1.36x improvement over EAGLE-3 (~4.8x vs standard decoding), which converts to ~5.3-6.5x vs the standard-decoding baseline used by all other rows in this table. See arXiv:2602.01469 for the P-EAGLE paper.
When to pick each:
- Standard draft model: fastest to deploy, no special checkpoints. Works with any model if a small same-family model exists. Best starting point.
- Medusa: higher acceptance on workloads matching its trained heads. Less portable.
- EAGLE-3: best general-purpose speculative method before DFlash. Wide checkpoint availability.
- DFlash: highest throughput when DFlash checkpoints exist for your model. If they do not, use EAGLE-3.
Supported Models
Qwen Family
DFlash draft checkpoints are available for the Qwen3 and Qwen3.5 series. Checkpoints for Qwen2.5 are not available. Check the DFlash organization on Hugging Face (https://huggingface.co/z-lab) for the latest list. Always verify the checkpoint matches your exact target model variant before deploying.
Available as of the DFlash paper release:
- Qwen3-4B
- Qwen3-8B (verify checkpoint availability at https://huggingface.co/z-lab before deploying)
- Qwen3-Coder-30B-A3B
- Qwen3.5-9B
- Qwen3.5-27B
- Qwen3.5-35B-A3B
LLaMA Family
DFlash checkpoints for the LLaMA family are more limited. Only the following checkpoint has been confirmed on Hugging Face as of the DFlash paper release. Check https://huggingface.co/z-lab before deploying any other LLaMA variant.
Confirmed:
- Llama-3.1-8B-Instruct
Larger variants (70B) should be verified against the z-lab Hugging Face organization before deploying, as availability may have changed since the paper release.
If DFlash checkpoints do not exist for your model, EAGLE-3 remains the best alternative. See the speculative decoding production guide for EAGLE-3 setup.
GPU Requirements and Cost Analysis
VRAM Requirements
| Target Model | Precision | DFlash Checkpoint | GPU | VRAM Used | Notes |
|---|---|---|---|---|---|
| Llama 3.3 70B | FP8 | DFlash-LLaMA3.3-70B | H100 80GB | ~78 GB | Fits with --gpu-memory-utilization 0.94. Verify checkpoint at https://huggingface.co/z-lab before use. |
| Llama 3.1 8B | FP16 | LLaMA3.1-8B-Instruct-DFlash-UltraChat | L40S 48GB | ~20 GB | Plenty of headroom for KV cache |
| Qwen3.5-35B-A3B | FP16 | DFlash-Qwen3.5-35B-A3B | H100 80GB | ~70 GB | Check z-lab HF for latest checkpoint |
| Llama 3.3 70B | FP8 | DFlash-LLaMA3.3-70B | A100 80GB | ~78 GB | ~65% throughput vs H100 PCIe. Verify checkpoint at https://huggingface.co/z-lab before use. |
The DFlash draft checkpoint adds a few GB of VRAM overhead over the target model alone (exact amount depends on the checkpoint; check actual usage with nvidia-smi after loading), comparable to an EAGLE-3 checkpoint. Set --gpu-memory-utilization 0.94 and --speculative-draft-tensor-parallel-size 1 in all configurations. For strategies to maximize the remaining KV cache headroom, see the KV cache optimization guide.
Per-GPU Cost
| GPU | On-Demand Price/hr | Spot Price/hr | Tokens/sec with DFlash | Cost per 1M output tokens |
|---|---|---|---|---|
| H100 PCIe | $2.01 | N/A | ~9,000 (projected) | ~$0.06 |
| H100 SXM5 | $2.90 | $0.80 | ~11,000 (projected) | ~$0.07 |
| A100 80GB PCIe | $1.04 | $1.14‡ | ~5,500 (projected) | ~$0.05 |
| L40S 48GB | $0.72 | $0.32 | ~25,000 (projected, 8B model) | ~$0.008 |
DFlash tokens/sec are projected from the paper's 2.5x improvement over EAGLE-3. Validate on your workload before planning capacity. Cost formula: (price_per_hour / 3600) / (tokens_per_second / 1_000_000).
‡ A100 80GB PCIe spot pricing ($1.14/hr) currently exceeds the on-demand rate ($1.04/hr). Spot pricing can occasionally exceed on-demand depending on availability and demand. On-demand is the better choice for this GPU; check current GPU pricing → for live rates.
Pricing fluctuates based on GPU availability. The prices above are based on 14 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Step-by-Step: Deploy DFlash with vLLM on GPU Cloud
Prerequisites
Docker with NVIDIA runtime on a Spheron instance. DFlash requires a vLLM nightly build (or a release with DFlash support; v0.8.0 does not include DFlash). Hugging Face token if using gated models. See Spheron's LLM quick-guides for first-time instance setup.
For 7B-13B models, L40S 48GB is sufficient. For 70B models you need an H100 80GB or A100 80GB. Verify that a DFlash checkpoint exists for your target model at https://huggingface.co/z-lab before proceeding.
Single-GPU Setup (H100)
# DFlash is configured via --speculative-config JSON object
# method: "dflash" selects block diffusion drafting
# model: path to the DFlash checkpoint for your target model (use z-lab org on Hugging Face)
# num_speculative_tokens: 8 works well; single-pass draft makes higher K efficient
# draft_tensor_parallel_size: keep at 1; draft fits on single GPU
# --gpu-memory-utilization: 0.94 to fit both target and DFlash checkpoint
#
# Example uses Llama-3.1-8B-Instruct with the confirmed z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat checkpoint.
# For other models, verify the checkpoint exists at https://huggingface.co/z-lab before deploying.
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:nightly \
--model meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{"method": "dflash", "model": "z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat", "num_speculative_tokens": 8, "draft_tensor_parallel_size": 1}' \
--max-model-len 8192 \
--gpu-memory-utilization 0.94 \
--max-num-seqs 32 \
--speculative-disable-by-batch-size 32 \
--host 0.0.0.0 \
--port 8000Note: use a vLLM nightly image (vllm/vllm-openai:nightly) until DFlash lands in a stable release. Check the DFlash GitHub repository for the current recommended vLLM version and any updated flag names.
Tuning num-speculative-tokens for DFlash
DFlash's single-pass drafting changes the tuning calculus compared to EAGLE-3. With autoregressive drafting, higher K means more sequential draft steps. With DFlash, K tokens cost the same as 1 token in the draft phase. So higher K values are more efficient with DFlash than with EAGLE-3.
Recommended starting points:
- K=8 for instruction-following and coding workloads (high acceptance rate)
- K=6 for mixed workloads
- K=4 for high-entropy or creative tasks
Always include --speculative-disable-by-batch-size 32 in production. This auto-disables speculation when concurrent requests exceed 32, preventing throughput degradation at high load. The API caller sees no difference; the switch is transparent. For more on how continuous batching and paged attention interact with speculation, see the LLM serving optimization guide.
Step-by-Step: Deploy DFlash with SGLang for Multi-Turn Agent Workloads
SGLang is particularly effective with DFlash on multi-turn agent workloads. SGLang's RadixAttention caches KV state across turns, so each subsequent turn in a conversation is a shorter prefill. DFlash's block diffusion draft reduces the per-turn decode latency, compounding the RadixAttention benefit.
# SGLang uses --speculative-algorithm for the draft method
# and --speculative-draft-model-path for the checkpoint
# --speculative-num-draft-tokens equivalent to --num-speculative-tokens in vLLM
#
# Example uses Llama-3.1-8B-Instruct with the confirmed z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat checkpoint.
# For other models, verify the checkpoint exists at https://huggingface.co/z-lab before deploying.
# Use a recent SGLang image that includes DFlash support (--speculative-algorithm DFLASH).
# Check https://github.com/z-lab/dflash for the current recommended SGLang version.
docker run --gpus all \
--ipc=host \
-p 30000:30000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat \
--speculative-num-draft-tokens 8 \
--host 0.0.0.0 \
--port 30000For agent workloads where each turn adds to a growing context, DFlash + SGLang is the combination to benchmark first. See the SGLang production deployment guide for multi-turn tuning details.
Live Benchmarks: Tokens per Second on A100, H100, and L40S
Llama 3.3 70B Instruct FP8 on H100s; L40S rows use Llama 3.1 8B Instruct FP16 (70B does not fit on a single 48GB GPU). Standard and EAGLE-3 reference values from the speculative decoding production guide, benchmarked March 2026. DFlash values projected from the DFlash paper at 2.5x EAGLE-3; validate on your workload before capacity planning.
| GPU | Price/hr | Mode | Tokens/sec | Cost per 1M tokens |
|---|---|---|---|---|
| H100 PCIe | $2.01 | Standard decoding | ~1,200 | ~$0.47 |
| H100 PCIe | $2.01 | EAGLE-3 | ~3,600 | ~$0.16 |
| H100 PCIe | $2.01 | DFlash (projected) | ~9,000 | ~$0.06 |
| H100 SXM5 | $2.90 | Standard decoding | ~1,500 | ~$0.54 |
| H100 SXM5 | $2.90 | EAGLE-3 | ~4,400 | ~$0.18 |
| H100 SXM5 | $2.90 | DFlash (projected) | ~11,000 | ~$0.07 |
| L40S PCIe† | $0.72 | Standard decoding | ~3,800 | ~$0.053 |
| L40S PCIe† | $0.72 | EAGLE-3 | ~10,000 | ~$0.020 |
| L40S PCIe† | $0.72 | DFlash (projected) | ~25,000 | ~$0.008 |
Cost per 1M tokens formula:
cost_per_1M = (price_per_hour / 3600) / (tokens_per_second / 1_000_000)For example, H100 PCIe with DFlash: (2.01 / 3600) / (9000 / 1_000_000) = ~$0.062 per million output tokens.
Pricing fluctuates based on GPU availability. The prices above are based on 14 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
When to Use DFlash vs Standard Serving
| Scenario | Recommended Mode | Reason |
|---|---|---|
| Latency-sensitive chat (batch size 1-4) | DFlash | High acceptance on instruction-following; TTFT ~8ms |
| High-concurrency API (batch size 32+) | Standard decoding | Draft overhead exceeds savings at high concurrency |
| Agent workloads with multi-turn context | DFlash + SGLang | RadixAttention + block diffusion compound benefits |
| Short outputs (<50 tokens) | Standard decoding | Speculation setup overhead not worth it |
| Code generation (high acceptance rate) | DFlash | Structured code patterns yield high acceptance |
| Embeddings/batch jobs | Standard decoding | Not applicable; no autoregressive generation |
| No DFlash checkpoint for your model | EAGLE-3 | Best alternative with wide checkpoint availability |
Decision rule:
if batch_size > 32 or output_tokens < 50 or task == "embedding":
use standard decoding
elif dflash_checkpoint_available and acceptance_rate > 0.7:
use DFlash
else:
use EAGLE-3Production Monitoring
Key vLLM Flags Reference
| Flag | Recommended Value | Notes |
|---|---|---|
--speculative-config | '{"method": "dflash", "model": "<checkpoint>", "num_speculative_tokens": 8, "draft_tensor_parallel_size": 1}' | Full DFlash config; pass as JSON string. Requires vLLM nightly or a release with DFlash support. |
--speculative-disable-by-batch-size | 32 | Auto-fallback to standard decoding at high concurrency |
--speculative-max-model-len | 4096 | Limit draft context; saves VRAM, usually sufficient |
--gpu-memory-utilization | 0.94 | Higher than default to fit both target and DFlash checkpoint |
Metrics to Monitor
# Scrape vLLM Prometheus endpoint
curl http://localhost:8000/metrics | grep spec_decodeKey base metrics (scrape from /metrics):
vllm:spec_decode_num_accepted_tokens: counter of tokens accepted by the target model via the speculative path.vllm:spec_decode_num_draft_tokens: counter of tokens proposed by the DFlash draft head.vllm:spec_decode_num_drafts: counter of speculative decode iterations.vllm:gpu_cache_usage_perc: watch for VRAM pressure; both models run simultaneously.
Derive acceptance rate and efficiency via PromQL:
# Acceptance rate — target above 0.75 for DFlash on instruction-following workloads
rate(vllm:spec_decode_num_accepted_tokens[1m])
/ rate(vllm:spec_decode_num_draft_tokens[1m])
# Average accepted tokens per draft iteration (proxy for speculation efficiency)
rate(vllm:spec_decode_num_accepted_tokens[1m])
/ rate(vllm:spec_decode_num_drafts[1m])If acceptance rate drops below 0.50, the DFlash checkpoint is a poor match for your traffic distribution. Reduce --num-speculative-tokens to 4 or switch to EAGLE-3. Acceptance rate is workload-dependent; always benchmark with your actual prompt distribution, not a synthetic benchmark.
DFlash turns a single H100 rental hour into the equivalent of six hours of standard inference throughput. Spheron provides bare-metal H100, A100, and L40S access with per-minute billing and no long-term commitment.
Rent H100 → | Rent A100 → | Rent L40S → | View all GPU pricing → | Get started on Spheron →
