DFlash on GPU Cloud: 6x Faster LLM Inference with Block Diffusion Speculative Decoding (2026)

Standard speculative decoding with EAGLE-3 delivers 3-4x LLM throughput improvement. DFlash gets to 6x, and does it without any quality tradeoff. The difference is in how the draft phase works: instead of generating candidate tokens one at a time, DFlash produces a full block of K tokens in a single forward pass using a block diffusion model.

If you are new to speculative decoding, start with the speculative decoding production guide before continuing. This post assumes familiarity with draft-verify mechanics and focuses on what DFlash does differently.

TL;DR

Mode	Tokens/sec (H100 PCIe)	TTFT p50	Cost per 1M output tokens	Best for
Standard decoding	~1,200	~45 ms	~$0.47	High concurrency (32+ req), batch jobs
Draft model (Llama 3.2 1B)	~2,600	~20 ms	~$0.21	Low-concurrency chat, interactive APIs
EAGLE-3	~3,600	~15 ms	~$0.16	Instruction-following, coding, agents
P-EAGLE	~4,500	~13 ms	~$0.12	Coding, reasoning, multi-tree
DFlash	~9,000	~8 ms	~$0.06	Best throughput when DFlash checkpoint available

Numbers from Llama 3.3 70B FP8, H100 PCIe at $2.01/hr, batch size 1-4, vLLM benchmark_serving.py, 200 prompts, 512 input / 256 output tokens. DFlash numbers projected from the DFlash paper (2.5x over EAGLE-3). See the benchmarks section for full results with current pricing.

What Is DFlash

Speculative decoding has a sequential bottleneck in the draft phase. With EAGLE-3, the draft head generates tokens one at a time: tok1, then tok2, then tok3. Each step depends on the previous one. You pay the draft model's compute cost once per token generated.

DFlash removes that bottleneck entirely. It replaces the autoregressive draft head with a block diffusion model. The diffusion model receives the target model's hidden states and generates K masked positions. A single denoising step fills all K positions simultaneously. The target model then verifies all K in one forward pass, same as before.

Block diffusion models are also being used as full inference engines in their own right, not just as draft heads. See deploying diffusion language models on GPU cloud for the LLaDA 2 and Mercury deployment guide.

The DFlash paper reports:

6x lossless acceleration over standard autoregressive decoding
2.5x improvement over EAGLE-3

Both figures are on instruction-following workloads at low batch sizes. Real-world speedup depends on your acceptance rate, sequence length, and batch size. Mark these as targets, not guarantees, until you benchmark on your actual traffic.

How DFlash Works: Single-Pass Block Drafting

Here is the step-by-step loop:

Target model generates the first token normally (autoregressive bootstrapping).
DFlash draft head receives the target model's hidden states.
Block diffusion denoising fills K masked positions in one forward pass.
Target model verifies all K candidate tokens in a single forward pass.
Accepted tokens are kept up to the first rejection; the target model resamples from that position.

Draft (DFlash block diffusion): [tok1] [tok2] [tok3] [tok4] [tok5] [tok6] [tok7] [tok8]
                                  -- all generated in ONE forward pass --
Target verification:              ✓      ✓      ✓      ✓      ✓      ✗
Result: tok1-tok5 accepted; resample from tok6

Compare with EAGLE-3, which generates tokens autoregressively in the draft phase:

Draft (EAGLE-3 autoregressive): tok1 -> tok2 -> tok3 -> tok4 -> tok5 -> tok6
                                 -- 6 sequential draft steps --

DFlash's draft cost is roughly constant regardless of K. If you set --num-speculative-tokens 8, DFlash does one draft forward pass to generate 8 candidates. EAGLE-3 does 8 sequential draft steps. At K=8, DFlash has an 8x lower draft cost than EAGLE-3 in compute terms, which is why the overall throughput improvement is so large.

DFlash vs EAGLE-3 vs Medusa vs Standard Speculative Decoding

Method	Draft Generation	Tokens/step (H100)	Acceptance Rate	Max Speedup	Requires Finetuning
Standard draft model	Autoregressive (1B-3B)	1 draft per step	0.60-0.75	~2.5x	No
Medusa	Parallel heads on target	K per step	0.65-0.80	~2.8x	Yes (head finetuning)
EAGLE-3	Autoregressive, feature-based	1 per step	0.75-0.85	~4.8x	Yes (EAGLE checkpoint)
P-EAGLE	Parallel multi-tree	K per step	0.75-0.85	~5-6x† (vs std. decoding)	Yes (EAGLE checkpoint)
DFlash	Block diffusion (single pass)	K in 1 pass	0.75-0.85	~6x	Yes (DFlash checkpoint)

† P-EAGLE reports 1.10-1.36x improvement over EAGLE-3 (~4.8x vs standard decoding), which converts to ~5.3-6.5x vs the standard-decoding baseline used by all other rows in this table. See arXiv:2602.01469 for the P-EAGLE paper.

When to pick each:

Standard draft model: fastest to deploy, no special checkpoints. Works with any model if a small same-family model exists. Best starting point.
Medusa: higher acceptance on workloads matching its trained heads. Less portable.
EAGLE-3: best general-purpose speculative method before DFlash. Wide checkpoint availability. For a full production setup guide for Eagle-3 on GPU cloud including draft head selection and H200 benchmarks, see Eagle-3 speculative decoding on GPU cloud.
DFlash: highest throughput when DFlash checkpoints exist for your model. If they do not, use EAGLE-3.

Supported Models

Qwen Family

DFlash draft checkpoints are available for the Qwen3 and Qwen3.5 series. Checkpoints for Qwen2.5 are not available. Check the DFlash organization on Hugging Face (https://huggingface.co/z-lab) for the latest list. Always verify the checkpoint matches your exact target model variant before deploying.

Available as of the DFlash paper release:

Qwen3-4B
Qwen3-8B (verify checkpoint availability at https://huggingface.co/z-lab before deploying)
Qwen3-Coder-30B-A3B
Qwen3.5-9B
Qwen3.5-27B
Qwen3.5-35B-A3B

LLaMA Family

DFlash checkpoints for the LLaMA family are more limited. Only the following checkpoint has been confirmed on Hugging Face as of the DFlash paper release. Check https://huggingface.co/z-lab before deploying any other LLaMA variant.

Confirmed:

Llama-3.1-8B-Instruct

Larger variants (70B) should be verified against the z-lab Hugging Face organization before deploying, as availability may have changed since the paper release.

If DFlash checkpoints do not exist for your model, EAGLE-3 remains the best alternative. See the speculative decoding production guide for EAGLE-3 setup.

GPU Requirements and Cost Analysis

VRAM Requirements

Target Model	Precision	DFlash Checkpoint	GPU	VRAM Used	Notes
Llama 3.3 70B	FP8	DFlash-LLaMA3.3-70B	H100 80GB	~78 GB	Fits with --gpu-memory-utilization 0.94. Verify checkpoint at https://huggingface.co/z-lab before use.
Llama 3.1 8B	FP16	LLaMA3.1-8B-Instruct-DFlash-UltraChat	L40S 48GB	~20 GB	Plenty of headroom for KV cache
Qwen3.5-35B-A3B	FP16	DFlash-Qwen3.5-35B-A3B	H100 80GB	~70 GB	Check z-lab HF for latest checkpoint
Llama 3.3 70B	FP8	DFlash-LLaMA3.3-70B	A100 80GB	~78 GB	~65% throughput vs H100 PCIe. Verify checkpoint at https://huggingface.co/z-lab before use.

The DFlash draft checkpoint adds a few GB of VRAM overhead over the target model alone (exact amount depends on the checkpoint; check actual usage with nvidia-smi after loading), comparable to an EAGLE-3 checkpoint. Set --gpu-memory-utilization 0.94 and --speculative-draft-tensor-parallel-size 1 in all configurations. For strategies to maximize the remaining KV cache headroom, see the KV cache optimization guide.

Per-GPU Cost

GPU	On-Demand Price/hr	Spot Price/hr	Tokens/sec with DFlash	Cost per 1M output tokens
H100 PCIe	$2.01	N/A	~9,000 (projected)	~$0.06
H100 SXM5	$2.90	$0.80	~11,000 (projected)	~$0.07
A100 80GB PCIe	$1.04	$1.14‡	~5,500 (projected)	~$0.05
L40S 48GB	$0.72	$0.32	~25,000 (projected, 8B model)	~$0.008

DFlash tokens/sec are projected from the paper's 2.5x improvement over EAGLE-3. Validate on your workload before planning capacity. Cost formula: (price_per_hour / 3600) / (tokens_per_second / 1_000_000).

‡ A100 80GB PCIe spot pricing ($1.14/hr) currently exceeds the on-demand rate ($1.04/hr). Spot pricing can occasionally exceed on-demand depending on availability and demand. On-demand is the better choice for this GPU; check current GPU pricing → for live rates.

Pricing fluctuates based on GPU availability. The prices above are based on 14 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Step-by-Step: Deploy DFlash with vLLM on GPU Cloud

Prerequisites

Docker with NVIDIA runtime on a Spheron instance. DFlash requires a vLLM nightly build (or a release with DFlash support; v0.8.0 does not include DFlash). Hugging Face token if using gated models. See Spheron's LLM quick-guides for first-time instance setup.

For 7B-13B models, L40S 48GB is sufficient. For 70B models you need an H100 80GB or A100 80GB. Verify that a DFlash checkpoint exists for your target model at https://huggingface.co/z-lab before proceeding.

Single-GPU Setup (H100)

bash

# DFlash is configured via --speculative-config JSON object
# method: "dflash" selects block diffusion drafting
# model: path to the DFlash checkpoint for your target model (use z-lab org on Hugging Face)
# num_speculative_tokens: 8 works well; single-pass draft makes higher K efficient
# draft_tensor_parallel_size: keep at 1; draft fits on single GPU
# --gpu-memory-utilization: 0.94 to fit both target and DFlash checkpoint
#
# Example uses Llama-3.1-8B-Instruct with the confirmed z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat checkpoint.
# For other models, verify the checkpoint exists at https://huggingface.co/z-lab before deploying.
docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:nightly \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-config '{"method": "dflash", "model": "z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat", "num_speculative_tokens": 8, "draft_tensor_parallel_size": 1}' \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.94 \
  --max-num-seqs 32 \
  --speculative-disable-by-batch-size 32 \
  --host 0.0.0.0 \
  --port 8000

Note: use a vLLM nightly image (vllm/vllm-openai:nightly) until DFlash lands in a stable release. Check the DFlash GitHub repository for the current recommended vLLM version and any updated flag names.

Tuning num-speculative-tokens for DFlash

DFlash's single-pass drafting changes the tuning calculus compared to EAGLE-3. With autoregressive drafting, higher K means more sequential draft steps. With DFlash, K tokens cost the same as 1 token in the draft phase. So higher K values are more efficient with DFlash than with EAGLE-3.

Recommended starting points:

K=8 for instruction-following and coding workloads (high acceptance rate)
K=6 for mixed workloads
K=4 for high-entropy or creative tasks

Always include --speculative-disable-by-batch-size 32 in production. This auto-disables speculation when concurrent requests exceed 32, preventing throughput degradation at high load. The API caller sees no difference; the switch is transparent. For more on how continuous batching and paged attention interact with speculation, see the LLM serving optimization guide.

Step-by-Step: Deploy DFlash with SGLang for Multi-Turn Agent Workloads

SGLang is particularly effective with DFlash on multi-turn agent workloads. SGLang's RadixAttention caches KV state across turns, so each subsequent turn in a conversation is a shorter prefill. DFlash's block diffusion draft reduces the per-turn decode latency, compounding the RadixAttention benefit.

bash

# SGLang uses --speculative-algorithm for the draft method
# and --speculative-draft-model-path for the checkpoint
# --speculative-num-draft-tokens equivalent to --num-speculative-tokens in vLLM
#
# Example uses Llama-3.1-8B-Instruct with the confirmed z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat checkpoint.
# For other models, verify the checkpoint exists at https://huggingface.co/z-lab before deploying.
# Use a recent SGLang image that includes DFlash support (--speculative-algorithm DFLASH).
# Check https://github.com/z-lab/dflash for the current recommended SGLang version.
docker run --gpus all \
  --ipc=host \
  -p 30000:30000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat \
    --speculative-num-draft-tokens 8 \
    --host 0.0.0.0 \
    --port 30000

For agent workloads where each turn adds to a growing context, DFlash + SGLang is the combination to benchmark first. See the SGLang production deployment guide for multi-turn tuning details.

Live Benchmarks: Tokens per Second on A100, H100, and L40S

Llama 3.3 70B Instruct FP8 on H100s; L40S rows use Llama 3.1 8B Instruct FP16 (70B does not fit on a single 48GB GPU). Standard and EAGLE-3 reference values from the speculative decoding production guide, benchmarked March 2026. DFlash values projected from the DFlash paper at 2.5x EAGLE-3; validate on your workload before capacity planning.

GPU	Price/hr	Mode	Tokens/sec	Cost per 1M tokens
H100 PCIe	$2.01	Standard decoding	~1,200	~$0.47
H100 PCIe	$2.01	EAGLE-3	~3,600	~$0.16
H100 PCIe	$2.01	DFlash (projected)	~9,000	~$0.06
H100 SXM5	$2.90	Standard decoding	~1,500	~$0.54
H100 SXM5	$2.90	EAGLE-3	~4,400	~$0.18
H100 SXM5	$2.90	DFlash (projected)	~11,000	~$0.07
L40S PCIe†	$0.72	Standard decoding	~3,800	~$0.053
L40S PCIe†	$0.72	EAGLE-3	~10,000	~$0.020
L40S PCIe†	$0.72	DFlash (projected)	~25,000	~$0.008

Cost per 1M tokens formula:

cost_per_1M = (price_per_hour / 3600) / (tokens_per_second / 1_000_000)

For example, H100 PCIe with DFlash: (2.01 / 3600) / (9000 / 1_000_000) = ~$0.062 per million output tokens.

Pricing fluctuates based on GPU availability. The prices above are based on 14 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

When to Use DFlash vs Standard Serving

Scenario	Recommended Mode	Reason
Latency-sensitive chat (batch size 1-4)	DFlash	High acceptance on instruction-following; TTFT ~8ms
High-concurrency API (batch size 32+)	Standard decoding	Draft overhead exceeds savings at high concurrency
Agent workloads with multi-turn context	DFlash + SGLang	RadixAttention + block diffusion compound benefits
Short outputs (<50 tokens)	Standard decoding	Speculation setup overhead not worth it
Code generation (high acceptance rate)	DFlash	Structured code patterns yield high acceptance
Embeddings/batch jobs	Standard decoding	Not applicable; no autoregressive generation
No DFlash checkpoint for your model	EAGLE-3	Best alternative with wide checkpoint availability

Decision rule:

if batch_size > 32 or output_tokens < 50 or task == "embedding":
    use standard decoding
elif dflash_checkpoint_available and acceptance_rate > 0.7:
    use DFlash
else:
    use EAGLE-3

Production Monitoring

Key vLLM Flags Reference

Flag	Recommended Value	Notes
`--speculative-config`	`'{"method": "dflash", "model": "<checkpoint>", "num_speculative_tokens": 8, "draft_tensor_parallel_size": 1}'`	Full DFlash config; pass as JSON string. Requires vLLM nightly or a release with DFlash support.
`--speculative-disable-by-batch-size`	32	Auto-fallback to standard decoding at high concurrency
`--speculative-max-model-len`	4096	Limit draft context; saves VRAM, usually sufficient
`--gpu-memory-utilization`	0.94	Higher than default to fit both target and DFlash checkpoint

Metrics to Monitor

bash

# Scrape vLLM Prometheus endpoint
curl http://localhost:8000/metrics | grep spec_decode

Key base metrics (scrape from /metrics):

vllm:spec_decode_num_accepted_tokens: counter of tokens accepted by the target model via the speculative path.
vllm:spec_decode_num_draft_tokens: counter of tokens proposed by the DFlash draft head.
vllm:spec_decode_num_drafts: counter of speculative decode iterations.
vllm:gpu_cache_usage_perc: watch for VRAM pressure; both models run simultaneously.

Derive acceptance rate and efficiency via PromQL:

promql

# Acceptance rate — target above 0.75 for DFlash on instruction-following workloads
rate(vllm:spec_decode_num_accepted_tokens[1m])
  / rate(vllm:spec_decode_num_draft_tokens[1m])

# Average accepted tokens per draft iteration (proxy for speculation efficiency)
rate(vllm:spec_decode_num_accepted_tokens[1m])
  / rate(vllm:spec_decode_num_drafts[1m])

If acceptance rate drops below 0.50, the DFlash checkpoint is a poor match for your traffic distribution. Reduce --num-speculative-tokens to 4 or switch to EAGLE-3. Acceptance rate is workload-dependent; always benchmark with your actual prompt distribution, not a synthetic benchmark.

DFlash turns a single H100 rental hour into the equivalent of six hours of standard inference throughput. Spheron provides bare-metal H100, A100, and L40S access with per-minute billing and no long-term commitment.
Spheron H100 → | Spheron A100 → | Spheron L40S → | View all GPU pricing → | Get started on Spheron →

STEPS / 05

Quick Setup Guide

Provision a GPU instance on Spheron
Launch an H100 PCIe, A100, or L40S instance on Spheron. Install Docker with the NVIDIA runtime. Verify GPU access with nvidia-smi. For 7B-13B models an L40S 48GB is sufficient; for 70B models you need an 80 GB GPU.
Download the DFlash draft checkpoint
DFlash draft checkpoints are available on Hugging Face at https://huggingface.co/z-lab. For Llama-3.1-8B-Instruct, use the confirmed z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat checkpoint. For other model sizes, verify the checkpoint exists at https://huggingface.co/z-lab before pulling. Verify the checkpoint is complete before starting the server.
Enable DFlash in vLLM
Pass --speculative-config as a JSON object with method set to dflash, model pointing to the DFlash checkpoint path (e.g., z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat), and num_speculative_tokens set to 8. Use a vLLM nightly build or v0.18.2+ since DFlash is not available in v0.8.0. Set --gpu-memory-utilization 0.94 to fit both models.
Validate throughput gain
Query /metrics and derive acceptance rate via PromQL: rate(vllm:spec_decode_num_accepted_tokens[1m]) / rate(vllm:spec_decode_num_draft_tokens[1m]). DFlash targets acceptance rates above 0.75 on instruction-following workloads. Run benchmark_serving.py before and after to measure actual tokens/sec improvement.
Enable DFlash in SGLang for multi-turn workloads
Use --speculative-algorithm DFLASH with --speculative-draft-model-path pointing to your DFlash checkpoint. SGLang's tree attention combines well with DFlash's block generation on multi-turn agent workloads where context accumulates quickly.

FAQ / 05

Frequently Asked Questions

DFlash is a speculative decoding algorithm that uses a block diffusion model as the draft network. Instead of autoregressively generating candidate tokens one at a time, DFlash generates a full block of K candidate tokens in a single forward pass. The target model then verifies all K tokens in parallel. This yields 6x lossless throughput improvement over standard autoregressive decoding on instruction-following workloads.

DFlash achieves 2.5x better throughput than EAGLE-3 on comparable workloads. EAGLE-3 generates draft tokens autoregressively using a lightweight feature-based head. DFlash replaces that draft head with a block diffusion model that generates all K tokens simultaneously, which improves both draft quality and removes the sequential cost of the draft phase.

DFlash has published checkpoints for the Qwen3, Qwen3.5, and LLaMA model families. Confirmed Qwen checkpoints include Qwen3-4B, Qwen3-8B, Qwen3-Coder-30B-A3B, Qwen3.5-9B, Qwen3.5-27B, and Qwen3.5-35B-A3B. For LLaMA, Llama-3.1-8B-Instruct is confirmed. Check https://huggingface.co/z-lab for the current list before deploying.

Yes. DFlash integrates with both vLLM and SGLang through the speculative decoding API. You pass the DFlash draft checkpoint path with the same flags used for EAGLE-3, with one additional flag to specify block diffusion mode. The serving APIs remain OpenAI-compatible.

DFlash runs on any GPU that supports your target model. For Llama 3.3 70B at FP8, an H100 80GB is the standard choice. A100 80GB handles 70B models at FP8 with slightly lower throughput. The DFlash draft head adds a few GB of VRAM overhead over the base model; verify exact usage with nvidia-smi after loading. L40S 48GB works well for 7B-13B models.

TL;DR

What Is DFlash

How DFlash Works: Single-Pass Block Drafting

DFlash vs EAGLE-3 vs Medusa vs Standard Speculative Decoding

Supported Models

Qwen Family

LLaMA Family

GPU Requirements and Cost Analysis

VRAM Requirements

Per-GPU Cost

Step-by-Step: Deploy DFlash with vLLM on GPU Cloud

Prerequisites

Single-GPU Setup (H100)

Tuning num-speculative-tokens for DFlash

Step-by-Step: Deploy DFlash with SGLang for Multi-Turn Agent Workloads

Live Benchmarks: Tokens per Second on A100, H100, and L40S

When to Use DFlash vs Standard Serving

Production Monitoring

Key vLLM Flags Reference

Metrics to Monitor

Quick Setup Guide

Provision a GPU instance on Spheron

Download the DFlash draft checkpoint

Enable DFlash in vLLM

Validate throughput gain

Enable DFlash in SGLang for multi-turn workloads

Frequently Asked Questions

01What is DFlash speculative decoding?

02How does DFlash compare to EAGLE-3?

03Which models does DFlash support?

04Does DFlash work with vLLM and SGLang?

05What GPU do I need for DFlash?

Build what's next.