Tutorial

Deploy DiffusionGemma on GPU Cloud: Self-Host Google's 26B Text Diffusion Model for 4x Faster Generation (2026 Guide)

DiffusionGemmaDiffusionGemma DeploymentText Diffusion ModelDiffusion LLMGoogle DeepMindGPU CloudLLM DeploymentL40S InferenceSelf-Host LLM
Deploy DiffusionGemma on GPU Cloud: Self-Host Google's 26B Text Diffusion Model for 4x Faster Generation (2026 Guide)

DiffusionGemma is Google DeepMind's first open-weight text diffusion language model, released June 10, 2026 under Apache 2.0. It uses a 26B Mixture-of-Experts architecture and generates up to 4x faster than autoregressive models of comparable size by refining 256-token blocks in parallel instead of one token at a time. This guide covers VRAM sizing, step-by-step deployment on Spheron GPU cloud, and a cost-per-token comparison against autoregressive Gemma 4.

For background on how text diffusion models work at the architecture level, see the diffusion LLM deployment guide, which covers the masked diffusion approach, parallel denoising mechanics, and the compute-bound vs memory-bandwidth-bound distinction that shapes GPU selection. For the autoregressive Gemma 4 comparison, see the Gemma 4 deployment guide for side-by-side vLLM setup and cost analysis.

TL;DR

AspectAutoregressive Gemma 4 26B MoEDiffusionGemma 26B MoE
ArchitectureCausal transformer, MoEText diffusion, MoE
Active parameters per step3.8B3.8B
Generation speed (batch 1-4)Baseline2-4x faster
VRAM footprint (BF16)~52 GB~52 GB
Ecosystem maturityWide (vLLM, SGLang, TensorRT-LLM)Growing (vLLM, HF Transformers, MLX, Unsloth, NeMo)
Best forHigh-concurrency APIs, long contextInteractive chat, speed-critical tasks

What Is DiffusionGemma

DiffusionGemma is a text diffusion model built on the same Mixture-of-Experts architecture as Gemma 4. Like Gemma 4's 26B MoE variant, it routes each token through a subset of expert layers, activating approximately 3.8B parameters per step rather than all 26B simultaneously.

What makes DiffusionGemma different is the generation mechanism. Autoregressive models like Gemma 4 produce one token at a time, left-to-right, with each step conditioned on all previous tokens. DiffusionGemma starts with a fully masked 256-token output block and iteratively unmaskes it over T denoising steps. Each step uses bidirectional attention, meaning every output position attends to every other position simultaneously, not just the tokens to its left.

This bidirectional attention is key. It is what enables the parallel refinement and what makes DiffusionGemma architecturally closer to BERT than to GPT. The tradeoff is that you cannot do prefix caching or incremental decoding the way vLLM does with standard KV caches. For the background on how dLLMs like LLaDA 2 implement this same masked diffusion approach, see the diffusion LLM deployment guide.

The release is Apache 2.0 with no gated access on HuggingFace, which makes it straightforward to pull and run.

Why DiffusionGemma Is Fast

The speed advantage comes from four properties working together.

Bidirectional attention. Every position in the 256-token block attends to every other position at each denoising step. The GPU processes all output positions in a single matrix multiply, not 256 sequential forward passes.

Parallel block refinement. Generating 256 tokens requires T forward passes on a 256-position sequence. An autoregressive model doing the same output requires 256 forward passes on sequences growing from 1 to 256 positions. For T=20, that is 20 forward passes vs 256. At T=10, it drops to 10. Latency scales with T, not with output length.

The 1,000 tokens/sec headline. Published DiffusionGemma benchmarks show up to 1,000 tokens/sec on H100 hardware at T=10 with batch size 4-8. Google also reports 700+ tokens/sec on RTX 5090. This is a best-case number at low T and medium batch. At T=20, practical throughput is lower, but still 3-5x ahead of autoregressive Gemma 4 at the same batch size.

Compute-bound, not memory-bandwidth-bound. Autoregressive decoding at low batch sizes is memory-bandwidth-bound: the GPU spends most of its time loading weights from HBM to compute units, one token at a time. DiffusionGemma's forward pass runs dense matrix multiplications across all 256 positions simultaneously, which saturates tensor cores instead of memory bandwidth. This shifts the optimal GPU selection: H200's 4.8 TB/s bandwidth premium over H100 matters less; raw FLOPS matters more. The RTX PRO 6000 Blackwell's higher compute density per dollar makes it particularly well-suited.

DiffusionGemma VRAM Requirements

The 26B MoE architecture loads all expert weights into VRAM at startup, even though only ~3.8B activate per forward pass. This is the same behavior as Gemma 4 26B MoE: you pay the full 26B VRAM cost regardless of active parameter count.

PrecisionVRAM RequiredMin GPURecommended GPU
BF16~52 GBH100 SXM5 80GBRTX PRO 6000 96GB
INT8~28 GBL40S 48GBA100 80GB
NVFP4 (official 4-bit)~18 GBRTX 4090 24GBL40S 48GB
INT4 / GGUF Q4 (llama.cpp, arriving soon)~14-16 GBRTX 4090 24GBL40S 48GB

VRAM estimates include ~15% framework overhead on top of model weights. Add headroom for batch-level activations during generation. For 4-bit precision, NVFP4 (NVIDIA FP4) is Google's officially supported quantization path, fitting DiffusionGemma in approximately 18GB. GGUF/llama.cpp support is arriving soon but not yet confirmed.

The L40S 48GB can run DiffusionGemma at INT8. At BF16, the 52GB footprint exceeds the L40S's VRAM and requires either INT8 quantization or a higher-VRAM GPU. The RTX PRO 6000 Blackwell at 96GB fits BF16 with 44GB of headroom for large batches and long context windows. For detailed VRAM sizing methodology for MoE models, see the GPU memory requirements guide.

Deploy DiffusionGemma on L40S and RTX PRO 6000

Both options run well on Spheron. The L40S is the cost-efficient mid-tier choice for INT8; the RTX PRO 6000 handles BF16 with room to spare.

L40S 48GB, INT8

Teams that want bare-metal performance without configuration overhead can spin up an L40S instance on Spheron and have DiffusionGemma serving in under 10 minutes with INT8 quantization.

Environment setup:

bash
python3 -m venv diffgemma
source diffgemma/bin/activate

pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install transformers accelerate bitsandbytes fastapi uvicorn huggingface_hub

Download the checkpoint (verify the repo name on HuggingFace before running):

bash
huggingface-cli download google/diffusiongemma-26B-A4B-it \
  --local-dir ./models/diffusiongemma

Load with INT8 quantization and serve:

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn

quant_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "./models/diffusiongemma",
    quantization_config=quant_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("./models/diffusiongemma")

app = FastAPI()

class ChatRequest(BaseModel):
    model: str
    messages: list
    max_tokens: int = 256
    num_inference_steps: int = 20

@app.post("/v1/chat/completions")
def chat(req: ChatRequest):
    prompt = tokenizer.apply_chat_template(req.messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=req.max_tokens,
            num_inference_steps=req.num_inference_steps,
        )
    text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    return {"choices": [{"message": {"role": "assistant", "content": text}}]}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Start the server:

bash
python server.py

RTX PRO 6000 96GB, BF16

The Spheron RTX PRO 6000 Blackwell variant gives 96GB GDDR7, which means running DiffusionGemma in BF16 with significant headroom for large context batches.

Load in full BF16 precision:

python
model = AutoModelForCausalLM.from_pretrained(
    "./models/diffusiongemma",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

With 96GB available, you can increase batch size and context window substantially. Set --max-batch-size 8 and num_inference_steps=20 as a starting point, then tune upward based on memory monitoring (nvidia-smi dmon -s um).

Recommended batch parameters for RTX PRO 6000:

python
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    num_inference_steps=20,
    do_sample=True,
    temperature=0.7,
)

RTX 4090 24GB, INT4 / GGUF Q4

The 4090's 24GB is tight for DiffusionGemma. Google's officially supported 4-bit path is NVFP4, which fits the model in approximately 18GB and leaves around 6GB for activations. This works for single-request development use. GGUF/llama.cpp support is arriving soon but not yet available.

Load with GGUF INT4 via llama-cpp-python (verify DiffusionGemma GGUF support before using):

bash
pip install llama-cpp-python --extra-index-url \
  https://abetlen.github.io/llama-cpp-python/whl/cu124

Check the DiffusionGemma HuggingFace model card and GitHub for current GGUF/llama.cpp compatibility before relying on this path. llama.cpp GGUF support for DiffusionGemma is arriving soon but not yet available as of June 2026. Use NVFP4 via HuggingFace Transformers as the supported 4-bit path in the meantime.

vLLM and framework support. DiffusionGemma shipped with day-zero native vLLM support, along with support for HuggingFace Transformers, MLX, Unsloth, and NVIDIA NeMo. For production serving, vLLM is the recommended stack given its continuous batching and memory efficiency. The FastAPI + transformers approach in this guide is a lightweight alternative for custom configurations or development use. For a full guide to setting up vLLM and other frameworks on Spheron, see the Spheron LLM inference docs.

Throughput Benchmarks and Limits

The following estimates are based on architecture analysis of DiffusionGemma's 26B MoE structure with ~3.8B active parameters per step and 256-token parallel block refinement. Actual measured throughput will vary; benchmark on your hardware before committing to production sizing. The methodology is total_output_tokens / total_wall_time throughout, which is the correct comparison metric against autoregressive inference.

ModelPrecisionT (steps)Batch SizeTokens/sec (est.)GPU
DiffusionGemma 26B MoEINT8101~480L40S 48GB
DiffusionGemma 26B MoEINT8201~270L40S 48GB
DiffusionGemma 26B MoEINT8501~110L40S 48GB
DiffusionGemma 26B MoEINT8104~1,400L40S 48GB
DiffusionGemma 26B MoEINT8204~780L40S 48GB
DiffusionGemma 26B MoEBF16101~750RTX PRO 6000 96GB
DiffusionGemma 26B MoEBF16201~420RTX PRO 6000 96GB
DiffusionGemma 26B MoEBF16501~170RTX PRO 6000 96GB
DiffusionGemma 26B MoEBF16104~2,100RTX PRO 6000 96GB
DiffusionGemma 26B MoEBF16204~1,200RTX PRO 6000 96GB
Gemma 4 26B MoE (AR)BF16n/a1~70L40S 48GB
Gemma 4 26B MoE (AR)BF16n/a4~180L40S 48GB

With continuous batching now available in vLLM for both DiffusionGemma and autoregressive Gemma 4, the throughput gap narrows at high batch sizes. At batch size 32+, autoregressive models pull ahead through KV cache reuse across concurrent requests, an advantage DiffusionGemma's bidirectional attention architecture cannot replicate. DiffusionGemma's latency advantage is strongest at batch sizes 1-8, which covers most interactive chat and single-user inference scenarios.

The throughput figures for batch size 4 at T=10 on RTX PRO 6000 approach the 1,000+ tokens/sec headline number from Google's published benchmarks, which were measured on H100 class hardware at low T.

Where DiffusionGemma Wins vs Where Autoregressive Leads

ScenarioChoose DiffusionGemmaChoose Autoregressive Gemma 4
Interactive chat (low batch)Yes - 3-4x lower latencyNo
Throughput-critical APIYes at batch 1-8Better at batch 32+
Long-context tasks (>4K tokens)No - 256-token blocks limit contextYes
Structured output (JSON schema)Caution - less mature toolingYes
Streaming token-by-token displayNo - generates full block at onceYes
Cost-sensitivity (single GPU)Better at T=10-20Better at high batch

The streaming limitation is real: DiffusionGemma outputs a 256-token block at once rather than streaming individual tokens. This affects UX for chat interfaces that display tokens as they generate. Workarounds exist (post-process the block into a simulated stream), but they add complexity.

Cost Analysis: DiffusionGemma vs Autoregressive Gemma 4

Using live Spheron pricing (fetched June 11, 2026) and the throughput estimates from the section above:

GPUOn-Demand $/hrSpot $/hrTokens/sec (T=20, batch=1)Cost/M tokens (on-demand)Cost/M tokens (spot)
DiffusionGemma on L40S (INT8)$0.96$0.67~270~$0.99~$0.69
DiffusionGemma on RTX PRO 6000 (BF16)$2.39$1.32~420~$1.58~$0.87
Gemma 4 26B MoE (AR) on L40S$0.96$0.67~70~$3.81~$2.66
Gemma 4 26B MoE (AR) on RTX PRO 6000$2.39$1.32~100~$6.64~$3.67

At T=20, DiffusionGemma runs at roughly $0.99/M tokens on L40S on-demand and $1.58/M on RTX PRO 6000 on-demand. Spot instances bring those down to $0.69/M and $0.87/M respectively. The autoregressive Gemma 4 baseline on the same hardware runs $3.81/M (L40S) to $6.64/M (RTX PRO 6000) on-demand. That is roughly a 3-4x cost advantage for DiffusionGemma on single-GPU interactive workloads.

At T=10, DiffusionGemma's cost per token drops another ~45%: around $0.56/M tokens on L40S on-demand at batch=1 (or ~$0.39/M on spot). The quality tradeoff at T=10 is noticeable but acceptable for many non-critical tasks. Benchmark T=10 vs T=20 on your workload before committing.

Pricing fluctuates based on GPU availability. The prices above are based on 11 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Production Checklist

Quality caveat from Google. Google recommends standard Gemma 4 for quality-critical production workloads and classifies DiffusionGemma as experimental. Output quality is lower than autoregressive Gemma 4, particularly at low denoising steps (T=10-20). Before routing quality-sensitive traffic to DiffusionGemma, evaluate its output on your actual use case against Gemma 4.

Before routing live traffic to DiffusionGemma:

  • GPU selection. RTX PRO 6000 for BF16 and maximum throughput. L40S for INT8 when cost is the primary constraint. RTX 4090 only for dev/test.
  • Tune T before deploying. Run your actual use case prompts at T=10, 20, and 50. Measure both quality and latency. Many workloads get acceptable quality at T=15-20. Do not default to T=50 without measuring what you gain vs T=20.
  • Bind to localhost in production. The FastAPI server in this guide uses --host 0.0.0.0. For production, bind to 127.0.0.1 and put a reverse proxy (nginx or Caddy) in front. Never expose the raw inference server directly.
  • Add authentication. Even internal endpoints should require an API key. Add a Depends(verify_api_key) FastAPI dependency before going live.
  • Monitor GPU utilization. During active generation, GPU utilization should be 70-90%. Below 60% suggests the serving stack is not feeding the GPU fast enough. Use nvidia-smi dmon -s pu to track during load tests.
  • Check framework compatibility before each framework upgrade. DiffusionGemma uses a non-standard decoding loop. Upgrading transformers or PyTorch can silently break the num_inference_steps parameter. Pin dependency versions in production until you have tested the upgrade.
  • Block streaming requests. If your endpoint exposes streaming via SSE, add an explicit block for requests with stream: true until you have tested DiffusionGemma's block-generation behavior with your streaming client. The output shape differs from autoregressive streaming.

DiffusionGemma's ~52 GB BF16 footprint runs cleanly on a single RTX PRO 6000, and INT8 quantization brings it down to 28GB for the L40S. Run it on Spheron's L40S instances with per-minute billing and no upfront commitment.

H100 SXM5 on Spheron → | View all GPU pricing →

Get started on Spheron →

STEPS / 06

Quick Setup Guide

  1. Provision a GPU instance on Spheron

    Log in to app.spheron.ai, navigate to GPU Instances, and launch an L40S (48GB) or RTX PRO 6000 instance with Ubuntu 22.04 and CUDA 12.4. SSH into the instance and confirm the GPU is visible with nvidia-smi. The RTX PRO 6000 Blackwell (96GB) is the recommended starting point for DiffusionGemma BF16 since it fits the full 26B MoE weights comfortably. The L40S 48GB works with INT8 quantization and leaves enough headroom for batching.

  2. Set up the Python environment

    Create a virtual environment: python3 -m venv diffgemma && source diffgemma/bin/activate. Install PyTorch with CUDA 12.4 (pip install torch --index-url https://download.pytorch.org/whl/cu124), then install transformers, accelerate, fastapi, uvicorn, and huggingface_hub. Confirm CUDA access: python -c 'import torch; print(torch.cuda.get_device_name(0))'.

  3. Authenticate with Hugging Face and download the checkpoint

    DiffusionGemma is released under Apache 2.0, no gated access required. Verify the exact repository name at https://huggingface.co/google before downloading, as Google may update repo paths after launch. Pull the checkpoint: huggingface-cli download google/diffusiongemma-26B-A4B-it --local-dir ./models/diffusiongemma. Check the model card for any dependency notes before running.

  4. Launch the inference server with an OpenAI-compatible endpoint

    Create server.py using FastAPI and the transformers pipeline to serve /v1/chat/completions requests. Load the model with AutoModelForCausalLM.from_pretrained('./models/diffusiongemma', torch_dtype=torch.bfloat16, device_map='auto'). Expose port 8000 and run with uvicorn server:app --host 0.0.0.0 --port 8000. Test with: curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"diffusiongemma","messages":[{"role":"user","content":"Explain diffusion LLMs in two sentences."}],"max_tokens":256}'.

  5. Tune denoising steps for speed vs quality

    DiffusionGemma's generation quality and latency are controlled by the number of denoising steps T. Set T=10 for minimum latency (interactive chat), T=20-30 for balanced quality (default), T=50 for maximum coherence (quality-critical tasks). Pass num_inference_steps=T to the model's generate() call. Monitor tokens/sec at T=10, 20, and 50 to calibrate for your workload. Lower T reduces latency nearly linearly.

  6. Run throughput benchmarks and calculate cost-per-token

    Benchmark at batch sizes 1, 4, and 8 using a fixed 256-token output prompt. Record total_output_tokens / wall_time_seconds for each configuration. Divide the GPU's on-demand $/hr by 3,600 to get the per-second cost, then divide by tokens/sec to get the cost per output token. Compare this against an autoregressive Gemma 4 26B MoE deployment on the same GPU for a like-for-like cost-per-token comparison.

FAQ / 05

Frequently Asked Questions

DiffusionGemma is Google DeepMind's first open-weight text diffusion language model, released June 10, 2026 under Apache 2.0. It uses a 26B Mixture-of-Experts architecture and generates text by iteratively refining 256-token blocks in parallel rather than producing tokens one at a time. This parallel block approach is what enables its 4x faster generation compared to autoregressive models of similar size.

DiffusionGemma's 26B MoE weights occupy roughly 52GB of VRAM at BF16 (the full model loads into memory even though only a subset of experts activate per step). After INT8 quantization, that drops to approximately 28GB. Google's native NVFP4 (4-bit floating point) is the officially supported 4-bit path and brings it to approximately 18GB, making the RTX 4090 (24GB) viable. GGUF/llama.cpp support is arriving soon. This makes DiffusionGemma compatible with an L40S (48GB) at INT8, RTX PRO 6000 (96GB) at BF16, or RTX 4090 (24GB) at NVFP4.

DiffusionGemma generates tokens up to 4x faster than autoregressive Gemma 4 at low batch sizes (1-4 concurrent requests), because it refines all positions in a 256-token block in parallel rather than one at a time. At high batch sizes (32+), autoregressive models with continuous batching close the gap. DiffusionGemma is best for interactive or speed-critical workloads; autoregressive Gemma 4 leads on long-context tasks and structured output reliability.

Yes, with some configuration. DiffusionGemma's 26B MoE weights at BF16 require roughly 52GB VRAM, which exceeds the L40S's 48GB. The L40S 48GB can run DiffusionGemma with INT8 quantization (around 28GB). The RTX PRO 6000 Blackwell's 96GB GDDR7 comfortably fits the BF16 checkpoint with headroom for large batches. Both GPUs are available on Spheron.

DiffusionGemma can be served with vLLM directly, which shipped with day-zero DiffusionGemma support on release. It also supports HuggingFace Transformers, MLX, Unsloth, and NVIDIA NeMo. For a lightweight custom endpoint, you can wrap the HuggingFace generate() call with FastAPI, exposing a /v1/chat/completions endpoint that matches the OpenAI schema. The howToSteps section of this guide includes a complete working example using that approach.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.