DiffusionGemma is Google DeepMind's first open-weight text diffusion language model, released June 10, 2026 under Apache 2.0. It uses a 26B Mixture-of-Experts architecture and generates up to 4x faster than autoregressive models of comparable size by refining 256-token blocks in parallel instead of one token at a time. This guide covers VRAM sizing, step-by-step deployment on Spheron GPU cloud, and a cost-per-token comparison against autoregressive Gemma 4.
For background on how text diffusion models work at the architecture level, see the diffusion LLM deployment guide, which covers the masked diffusion approach, parallel denoising mechanics, and the compute-bound vs memory-bandwidth-bound distinction that shapes GPU selection. For the autoregressive Gemma 4 comparison, see the Gemma 4 deployment guide for side-by-side vLLM setup and cost analysis.
TL;DR
| Aspect | Autoregressive Gemma 4 26B MoE | DiffusionGemma 26B MoE |
|---|---|---|
| Architecture | Causal transformer, MoE | Text diffusion, MoE |
| Active parameters per step | 3.8B | 3.8B |
| Generation speed (batch 1-4) | Baseline | 2-4x faster |
| VRAM footprint (BF16) | ~52 GB | ~52 GB |
| Ecosystem maturity | Wide (vLLM, SGLang, TensorRT-LLM) | Growing (vLLM, HF Transformers, MLX, Unsloth, NeMo) |
| Best for | High-concurrency APIs, long context | Interactive chat, speed-critical tasks |
What Is DiffusionGemma
DiffusionGemma is a text diffusion model built on the same Mixture-of-Experts architecture as Gemma 4. Like Gemma 4's 26B MoE variant, it routes each token through a subset of expert layers, activating approximately 3.8B parameters per step rather than all 26B simultaneously.
What makes DiffusionGemma different is the generation mechanism. Autoregressive models like Gemma 4 produce one token at a time, left-to-right, with each step conditioned on all previous tokens. DiffusionGemma starts with a fully masked 256-token output block and iteratively unmaskes it over T denoising steps. Each step uses bidirectional attention, meaning every output position attends to every other position simultaneously, not just the tokens to its left.
This bidirectional attention is key. It is what enables the parallel refinement and what makes DiffusionGemma architecturally closer to BERT than to GPT. The tradeoff is that you cannot do prefix caching or incremental decoding the way vLLM does with standard KV caches. For the background on how dLLMs like LLaDA 2 implement this same masked diffusion approach, see the diffusion LLM deployment guide.
The release is Apache 2.0 with no gated access on HuggingFace, which makes it straightforward to pull and run.
Why DiffusionGemma Is Fast
The speed advantage comes from four properties working together.
Bidirectional attention. Every position in the 256-token block attends to every other position at each denoising step. The GPU processes all output positions in a single matrix multiply, not 256 sequential forward passes.
Parallel block refinement. Generating 256 tokens requires T forward passes on a 256-position sequence. An autoregressive model doing the same output requires 256 forward passes on sequences growing from 1 to 256 positions. For T=20, that is 20 forward passes vs 256. At T=10, it drops to 10. Latency scales with T, not with output length.
The 1,000 tokens/sec headline. Published DiffusionGemma benchmarks show up to 1,000 tokens/sec on H100 hardware at T=10 with batch size 4-8. Google also reports 700+ tokens/sec on RTX 5090. This is a best-case number at low T and medium batch. At T=20, practical throughput is lower, but still 3-5x ahead of autoregressive Gemma 4 at the same batch size.
Compute-bound, not memory-bandwidth-bound. Autoregressive decoding at low batch sizes is memory-bandwidth-bound: the GPU spends most of its time loading weights from HBM to compute units, one token at a time. DiffusionGemma's forward pass runs dense matrix multiplications across all 256 positions simultaneously, which saturates tensor cores instead of memory bandwidth. This shifts the optimal GPU selection: H200's 4.8 TB/s bandwidth premium over H100 matters less; raw FLOPS matters more. The RTX PRO 6000 Blackwell's higher compute density per dollar makes it particularly well-suited.
DiffusionGemma VRAM Requirements
The 26B MoE architecture loads all expert weights into VRAM at startup, even though only ~3.8B activate per forward pass. This is the same behavior as Gemma 4 26B MoE: you pay the full 26B VRAM cost regardless of active parameter count.
| Precision | VRAM Required | Min GPU | Recommended GPU |
|---|---|---|---|
| BF16 | ~52 GB | H100 SXM5 80GB | RTX PRO 6000 96GB |
| INT8 | ~28 GB | L40S 48GB | A100 80GB |
| NVFP4 (official 4-bit) | ~18 GB | RTX 4090 24GB | L40S 48GB |
| INT4 / GGUF Q4 (llama.cpp, arriving soon) | ~14-16 GB | RTX 4090 24GB | L40S 48GB |
VRAM estimates include ~15% framework overhead on top of model weights. Add headroom for batch-level activations during generation. For 4-bit precision, NVFP4 (NVIDIA FP4) is Google's officially supported quantization path, fitting DiffusionGemma in approximately 18GB. GGUF/llama.cpp support is arriving soon but not yet confirmed.
The L40S 48GB can run DiffusionGemma at INT8. At BF16, the 52GB footprint exceeds the L40S's VRAM and requires either INT8 quantization or a higher-VRAM GPU. The RTX PRO 6000 Blackwell at 96GB fits BF16 with 44GB of headroom for large batches and long context windows. For detailed VRAM sizing methodology for MoE models, see the GPU memory requirements guide.
Deploy DiffusionGemma on L40S and RTX PRO 6000
Both options run well on Spheron. The L40S is the cost-efficient mid-tier choice for INT8; the RTX PRO 6000 handles BF16 with room to spare.
L40S 48GB, INT8
Teams that want bare-metal performance without configuration overhead can spin up an L40S instance on Spheron and have DiffusionGemma serving in under 10 minutes with INT8 quantization.
Environment setup:
python3 -m venv diffgemma
source diffgemma/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install transformers accelerate bitsandbytes fastapi uvicorn huggingface_hubDownload the checkpoint (verify the repo name on HuggingFace before running):
huggingface-cli download google/diffusiongemma-26B-A4B-it \
--local-dir ./models/diffusiongemmaLoad with INT8 quantization and serve:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
quant_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"./models/diffusiongemma",
quantization_config=quant_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("./models/diffusiongemma")
app = FastAPI()
class ChatRequest(BaseModel):
model: str
messages: list
max_tokens: int = 256
num_inference_steps: int = 20
@app.post("/v1/chat/completions")
def chat(req: ChatRequest):
prompt = tokenizer.apply_chat_template(req.messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=req.max_tokens,
num_inference_steps=req.num_inference_steps,
)
text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
return {"choices": [{"message": {"role": "assistant", "content": text}}]}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)Start the server:
python server.pyRTX PRO 6000 96GB, BF16
The Spheron RTX PRO 6000 Blackwell variant gives 96GB GDDR7, which means running DiffusionGemma in BF16 with significant headroom for large context batches.
Load in full BF16 precision:
model = AutoModelForCausalLM.from_pretrained(
"./models/diffusiongemma",
torch_dtype=torch.bfloat16,
device_map="auto",
)With 96GB available, you can increase batch size and context window substantially. Set --max-batch-size 8 and num_inference_steps=20 as a starting point, then tune upward based on memory monitoring (nvidia-smi dmon -s um).
Recommended batch parameters for RTX PRO 6000:
outputs = model.generate(
**inputs,
max_new_tokens=512,
num_inference_steps=20,
do_sample=True,
temperature=0.7,
)RTX 4090 24GB, INT4 / GGUF Q4
The 4090's 24GB is tight for DiffusionGemma. Google's officially supported 4-bit path is NVFP4, which fits the model in approximately 18GB and leaves around 6GB for activations. This works for single-request development use. GGUF/llama.cpp support is arriving soon but not yet available.
Load with GGUF INT4 via llama-cpp-python (verify DiffusionGemma GGUF support before using):
pip install llama-cpp-python --extra-index-url \
https://abetlen.github.io/llama-cpp-python/whl/cu124Check the DiffusionGemma HuggingFace model card and GitHub for current GGUF/llama.cpp compatibility before relying on this path. llama.cpp GGUF support for DiffusionGemma is arriving soon but not yet available as of June 2026. Use NVFP4 via HuggingFace Transformers as the supported 4-bit path in the meantime.
vLLM and framework support. DiffusionGemma shipped with day-zero native vLLM support, along with support for HuggingFace Transformers, MLX, Unsloth, and NVIDIA NeMo. For production serving, vLLM is the recommended stack given its continuous batching and memory efficiency. The FastAPI + transformers approach in this guide is a lightweight alternative for custom configurations or development use. For a full guide to setting up vLLM and other frameworks on Spheron, see the Spheron LLM inference docs.
Throughput Benchmarks and Limits
The following estimates are based on architecture analysis of DiffusionGemma's 26B MoE structure with ~3.8B active parameters per step and 256-token parallel block refinement. Actual measured throughput will vary; benchmark on your hardware before committing to production sizing. The methodology is total_output_tokens / total_wall_time throughout, which is the correct comparison metric against autoregressive inference.
| Model | Precision | T (steps) | Batch Size | Tokens/sec (est.) | GPU |
|---|---|---|---|---|---|
| DiffusionGemma 26B MoE | INT8 | 10 | 1 | ~480 | L40S 48GB |
| DiffusionGemma 26B MoE | INT8 | 20 | 1 | ~270 | L40S 48GB |
| DiffusionGemma 26B MoE | INT8 | 50 | 1 | ~110 | L40S 48GB |
| DiffusionGemma 26B MoE | INT8 | 10 | 4 | ~1,400 | L40S 48GB |
| DiffusionGemma 26B MoE | INT8 | 20 | 4 | ~780 | L40S 48GB |
| DiffusionGemma 26B MoE | BF16 | 10 | 1 | ~750 | RTX PRO 6000 96GB |
| DiffusionGemma 26B MoE | BF16 | 20 | 1 | ~420 | RTX PRO 6000 96GB |
| DiffusionGemma 26B MoE | BF16 | 50 | 1 | ~170 | RTX PRO 6000 96GB |
| DiffusionGemma 26B MoE | BF16 | 10 | 4 | ~2,100 | RTX PRO 6000 96GB |
| DiffusionGemma 26B MoE | BF16 | 20 | 4 | ~1,200 | RTX PRO 6000 96GB |
| Gemma 4 26B MoE (AR) | BF16 | n/a | 1 | ~70 | L40S 48GB |
| Gemma 4 26B MoE (AR) | BF16 | n/a | 4 | ~180 | L40S 48GB |
With continuous batching now available in vLLM for both DiffusionGemma and autoregressive Gemma 4, the throughput gap narrows at high batch sizes. At batch size 32+, autoregressive models pull ahead through KV cache reuse across concurrent requests, an advantage DiffusionGemma's bidirectional attention architecture cannot replicate. DiffusionGemma's latency advantage is strongest at batch sizes 1-8, which covers most interactive chat and single-user inference scenarios.
The throughput figures for batch size 4 at T=10 on RTX PRO 6000 approach the 1,000+ tokens/sec headline number from Google's published benchmarks, which were measured on H100 class hardware at low T.
Where DiffusionGemma Wins vs Where Autoregressive Leads
| Scenario | Choose DiffusionGemma | Choose Autoregressive Gemma 4 |
|---|---|---|
| Interactive chat (low batch) | Yes - 3-4x lower latency | No |
| Throughput-critical API | Yes at batch 1-8 | Better at batch 32+ |
| Long-context tasks (>4K tokens) | No - 256-token blocks limit context | Yes |
| Structured output (JSON schema) | Caution - less mature tooling | Yes |
| Streaming token-by-token display | No - generates full block at once | Yes |
| Cost-sensitivity (single GPU) | Better at T=10-20 | Better at high batch |
The streaming limitation is real: DiffusionGemma outputs a 256-token block at once rather than streaming individual tokens. This affects UX for chat interfaces that display tokens as they generate. Workarounds exist (post-process the block into a simulated stream), but they add complexity.
Cost Analysis: DiffusionGemma vs Autoregressive Gemma 4
Using live Spheron pricing (fetched June 11, 2026) and the throughput estimates from the section above:
| GPU | On-Demand $/hr | Spot $/hr | Tokens/sec (T=20, batch=1) | Cost/M tokens (on-demand) | Cost/M tokens (spot) |
|---|---|---|---|---|---|
| DiffusionGemma on L40S (INT8) | $0.96 | $0.67 | ~270 | ~$0.99 | ~$0.69 |
| DiffusionGemma on RTX PRO 6000 (BF16) | $2.39 | $1.32 | ~420 | ~$1.58 | ~$0.87 |
| Gemma 4 26B MoE (AR) on L40S | $0.96 | $0.67 | ~70 | ~$3.81 | ~$2.66 |
| Gemma 4 26B MoE (AR) on RTX PRO 6000 | $2.39 | $1.32 | ~100 | ~$6.64 | ~$3.67 |
At T=20, DiffusionGemma runs at roughly $0.99/M tokens on L40S on-demand and $1.58/M on RTX PRO 6000 on-demand. Spot instances bring those down to $0.69/M and $0.87/M respectively. The autoregressive Gemma 4 baseline on the same hardware runs $3.81/M (L40S) to $6.64/M (RTX PRO 6000) on-demand. That is roughly a 3-4x cost advantage for DiffusionGemma on single-GPU interactive workloads.
At T=10, DiffusionGemma's cost per token drops another ~45%: around $0.56/M tokens on L40S on-demand at batch=1 (or ~$0.39/M on spot). The quality tradeoff at T=10 is noticeable but acceptable for many non-critical tasks. Benchmark T=10 vs T=20 on your workload before committing.
Pricing fluctuates based on GPU availability. The prices above are based on 11 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Production Checklist
Quality caveat from Google. Google recommends standard Gemma 4 for quality-critical production workloads and classifies DiffusionGemma as experimental. Output quality is lower than autoregressive Gemma 4, particularly at low denoising steps (T=10-20). Before routing quality-sensitive traffic to DiffusionGemma, evaluate its output on your actual use case against Gemma 4.
Before routing live traffic to DiffusionGemma:
- GPU selection. RTX PRO 6000 for BF16 and maximum throughput. L40S for INT8 when cost is the primary constraint. RTX 4090 only for dev/test.
- Tune T before deploying. Run your actual use case prompts at T=10, 20, and 50. Measure both quality and latency. Many workloads get acceptable quality at T=15-20. Do not default to T=50 without measuring what you gain vs T=20.
- Bind to localhost in production. The FastAPI server in this guide uses
--host 0.0.0.0. For production, bind to127.0.0.1and put a reverse proxy (nginx or Caddy) in front. Never expose the raw inference server directly. - Add authentication. Even internal endpoints should require an API key. Add a
Depends(verify_api_key)FastAPI dependency before going live. - Monitor GPU utilization. During active generation, GPU utilization should be 70-90%. Below 60% suggests the serving stack is not feeding the GPU fast enough. Use
nvidia-smi dmon -s puto track during load tests. - Check framework compatibility before each framework upgrade. DiffusionGemma uses a non-standard decoding loop. Upgrading transformers or PyTorch can silently break the
num_inference_stepsparameter. Pin dependency versions in production until you have tested the upgrade. - Block streaming requests. If your endpoint exposes streaming via SSE, add an explicit block for requests with
stream: trueuntil you have tested DiffusionGemma's block-generation behavior with your streaming client. The output shape differs from autoregressive streaming.
DiffusionGemma's ~52 GB BF16 footprint runs cleanly on a single RTX PRO 6000, and INT8 quantization brings it down to 28GB for the L40S. Run it on Spheron's L40S instances with per-minute billing and no upfront commitment.
Quick Setup Guide
Log in to app.spheron.ai, navigate to GPU Instances, and launch an L40S (48GB) or RTX PRO 6000 instance with Ubuntu 22.04 and CUDA 12.4. SSH into the instance and confirm the GPU is visible with nvidia-smi. The RTX PRO 6000 Blackwell (96GB) is the recommended starting point for DiffusionGemma BF16 since it fits the full 26B MoE weights comfortably. The L40S 48GB works with INT8 quantization and leaves enough headroom for batching.
Create a virtual environment: python3 -m venv diffgemma && source diffgemma/bin/activate. Install PyTorch with CUDA 12.4 (pip install torch --index-url https://download.pytorch.org/whl/cu124), then install transformers, accelerate, fastapi, uvicorn, and huggingface_hub. Confirm CUDA access: python -c 'import torch; print(torch.cuda.get_device_name(0))'.
DiffusionGemma is released under Apache 2.0, no gated access required. Verify the exact repository name at https://huggingface.co/google before downloading, as Google may update repo paths after launch. Pull the checkpoint: huggingface-cli download google/diffusiongemma-26B-A4B-it --local-dir ./models/diffusiongemma. Check the model card for any dependency notes before running.
Create server.py using FastAPI and the transformers pipeline to serve /v1/chat/completions requests. Load the model with AutoModelForCausalLM.from_pretrained('./models/diffusiongemma', torch_dtype=torch.bfloat16, device_map='auto'). Expose port 8000 and run with uvicorn server:app --host 0.0.0.0 --port 8000. Test with: curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"diffusiongemma","messages":[{"role":"user","content":"Explain diffusion LLMs in two sentences."}],"max_tokens":256}'.
DiffusionGemma's generation quality and latency are controlled by the number of denoising steps T. Set T=10 for minimum latency (interactive chat), T=20-30 for balanced quality (default), T=50 for maximum coherence (quality-critical tasks). Pass num_inference_steps=T to the model's generate() call. Monitor tokens/sec at T=10, 20, and 50 to calibrate for your workload. Lower T reduces latency nearly linearly.
Benchmark at batch sizes 1, 4, and 8 using a fixed 256-token output prompt. Record total_output_tokens / wall_time_seconds for each configuration. Divide the GPU's on-demand $/hr by 3,600 to get the per-second cost, then divide by tokens/sec to get the cost per output token. Compare this against an autoregressive Gemma 4 26B MoE deployment on the same GPU for a like-for-like cost-per-token comparison.
Frequently Asked Questions
DiffusionGemma is Google DeepMind's first open-weight text diffusion language model, released June 10, 2026 under Apache 2.0. It uses a 26B Mixture-of-Experts architecture and generates text by iteratively refining 256-token blocks in parallel rather than producing tokens one at a time. This parallel block approach is what enables its 4x faster generation compared to autoregressive models of similar size.
DiffusionGemma's 26B MoE weights occupy roughly 52GB of VRAM at BF16 (the full model loads into memory even though only a subset of experts activate per step). After INT8 quantization, that drops to approximately 28GB. Google's native NVFP4 (4-bit floating point) is the officially supported 4-bit path and brings it to approximately 18GB, making the RTX 4090 (24GB) viable. GGUF/llama.cpp support is arriving soon. This makes DiffusionGemma compatible with an L40S (48GB) at INT8, RTX PRO 6000 (96GB) at BF16, or RTX 4090 (24GB) at NVFP4.
DiffusionGemma generates tokens up to 4x faster than autoregressive Gemma 4 at low batch sizes (1-4 concurrent requests), because it refines all positions in a 256-token block in parallel rather than one at a time. At high batch sizes (32+), autoregressive models with continuous batching close the gap. DiffusionGemma is best for interactive or speed-critical workloads; autoregressive Gemma 4 leads on long-context tasks and structured output reliability.
Yes, with some configuration. DiffusionGemma's 26B MoE weights at BF16 require roughly 52GB VRAM, which exceeds the L40S's 48GB. The L40S 48GB can run DiffusionGemma with INT8 quantization (around 28GB). The RTX PRO 6000 Blackwell's 96GB GDDR7 comfortably fits the BF16 checkpoint with headroom for large batches. Both GPUs are available on Spheron.
DiffusionGemma can be served with vLLM directly, which shipped with day-zero DiffusionGemma support on release. It also supports HuggingFace Transformers, MLX, Unsloth, and NVIDIA NeMo. For a lightweight custom endpoint, you can wrap the HuggingFace generate() call with FastAPI, exposing a /v1/chat/completions endpoint that matches the OpenAI schema. The howToSteps section of this guide includes a complete working example using that approach.
