Tutorial

Deploy Google Gemma 4 on GPU Cloud: MoE and Dense Model Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 5, 2026
GPU CloudGemma 4vLLMGoogle GemmaAI InferenceMoEH100B200Deployment GuideLLM
Deploy Google Gemma 4 on GPU Cloud: MoE and Dense Model Guide (2026)

Google released Gemma 4 on April 2, 2026 under an Apache 2.0 license. The family includes four variants: E2B and E4B for edge and mobile deployment, plus the two cloud-scale models covered here. This guide focuses on the 31B Dense model (model ID google/gemma-4-31B-it) and the 26B Mixture of Experts model (model ID google/gemma-4-26B-A4B-it). The MoE variant activates only ~4B parameters per forward pass (the "A4B" in the model ID), which makes it faster and cheaper to run than most models its size. The Dense variant offers peak quality with straightforward deployment on a single H100 or B200. Both variants support multimodal input, including interleaved images and text.

For context on where Gemma 4 sits in the broader model landscape, see our open-source LLM comparison for 2026 and our Gemma 3 deployment guide for the previous generation.

Gemma 4 GPU Requirements

The 31B Dense model loads all 31 billion parameters into VRAM on every forward pass. The 26B MoE model loads all 26B expert weights into VRAM but only activates ~4B of them per inference step. Both need enough VRAM to hold total model weights, regardless of active parameter count.

ModelPrecisionVRAM RequiredMin GPURecommended GPU
Gemma 4 31B DenseBF16~71 GBH100 SXM5 80GBB200 SXM6
Gemma 4 31B DenseINT8~36 GBA100 40GBA100 80GB
Gemma 4 31B DenseINT4~16 GBRTX 4090RTX 5090
Gemma 4 26B MoEBF16~60 GBH100 SXM5B200 SXM6
Gemma 4 26B MoEFP8~30 GBRTX 5090A100 80GB
Gemma 4 26B MoEINT4~15 GBRTX 4090RTX 5090

VRAM estimates include the standard 15% framework overhead on top of raw model weights. Plan additional headroom for KV cache at production context lengths. For detailed VRAM calculations across quantization levels, see our GPU memory requirements guide.

Deploy Gemma 4 31B Dense on H100 and B200

The 31B Dense model is straightforward to serve. On a single H100 SXM5 (80GB), it runs in BF16 with enough headroom for 8K contexts. On a B200 (192GB), you can push to full 128K context without quantization.

H100 SXM5, BF16:

bash
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-31B-it \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --port 8000

H100 SXM5, FP8 (better throughput, similar quality):

bash
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-31B-it \
    --dtype bfloat16 \
    --quantization fp8 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 16384 \
    --port 8000

B200 SXM6, BF16 (full context, no quantization):

bash
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-31B-it \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 65536 \
    --port 8000

The B200's 192GB VRAM fits the full BF16 model with space for 64K+ token KV caches. Use the FP8 variant on H100 when you need context lengths beyond 8K.

Current Spheron pricing for Gemma 4 31B Dense deployments:

GPUConfigOn-Demand $/hrSpot $/hr
H100 SXM51x, BF16$2.57$0.80
B200 SXM61x, BF16$7.43$1.71
A100 80GB SXM42x, BF16 (TP=2)$3.28$0.90

Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For hardware setup guides see the H100 GPU rental page and B200 GPU rental page. For production vLLM configuration including health checks and monitoring, see our vLLM production deployment guide.

Deploy Gemma 4 26B MoE (~4B Active Parameters)

The MoE model routes each token through a subset of its expert FFN layers. The router scores each token against all available experts and dispatches it to the top-K highest-scoring ones. Only ~4B out of 26B total parameters run computation on any given token. The rest sit in VRAM, resident but idle.

This matters for inference cost: you pay to load 26B parameters into memory, but each forward pass only does the arithmetic of a ~4B model. On a per-token basis, MoE models are fast. The tradeoff is that you still need the VRAM to hold all 26B weights.

For a deep dive on expert parallelism strategy and MoE VRAM math, see our MoE inference optimization guide.

H100 SXM5, BF16:

bash
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-26B-A4B-it \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --port 8000

For single-GPU deployments of the MoE model, tensor parallelism is not required when VRAM fits the full model. The ~4B active parameter count means routing overhead is minimal, and a single GPU handles 30+ concurrent requests without batching degradation. Only add --tensor-parallel-size 2 if you want to split the model across two GPUs for longer context windows.

Current Spheron pricing for Gemma 4 26B MoE deployments:

GPUConfigOn-Demand $/hrSpot $/hr
RTX 5090 PCIe1x, FP8$0.86-
H100 SXM51x, BF16$2.57$0.80
A100 80GB SXM41x, FP8$1.64$0.45

Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The RTX 5090 at $0.86/hr is the best value option for the MoE model when running FP8, at roughly one-third the cost of an H100. The A100 at $0.45/hr spot is the cheapest option and also requires FP8 quantization to fit the full model. For full BF16 quality, use an H100 SXM5.

Quantized Gemma 4 on Consumer GPUs

INT4 and AWQ quantization bring Gemma 4 within reach of consumer GPUs for development and low-traffic use cases.

AWQ INT4 for the MoE model:

bash
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-26B-A4B-it \
    --quantization awq \
    --dtype float16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --port 8000

The AWQ INT4 MoE variant requires roughly 15GB VRAM for model weights. An RTX 4090 (24GB) can run it with headroom for KV cache. An RTX 5090 (32GB) runs it with room for larger batches.

GGUF/llama.cpp for RTX 4090:

If you prefer llama.cpp for CPU offloading or simpler setup:

bash
# Download a Q4_K_M GGUF from HuggingFace
llama-cli -m gemma-4-26B-A4B-it.Q4_K_M.gguf \
    --n-gpu-layers 99 \
    --ctx-size 4096

Q4_K_M on the 26B MoE model uses roughly 16GB, fitting an RTX 4090 for development use.

The RTX 4090 is not currently listed in the Spheron GPU pricing API. For current RTX 4090 availability on Spheron, check the GPU pricing page. The RTX 5090 at $0.86/hr is available and runs Gemma 4 MoE with FP8 quantization (~30GB VRAM). See the RTX 5090 rental page for configuration details.

For context on FP4 quantization methods and Blackwell GPU cost efficiency, see our FP4 quantization guide.

Performance Benchmarks

Approximate throughput estimates for Gemma 4 models compared to similar-tier alternatives (batch=1, context=512 tokens):

ModelActive ParamsVRAM (BF16)Tokens/sec (H100)License
Gemma 4 26B MoE~4B~60 GB~120Apache 2.0
Gemma 4 31B Dense31B~71 GB~65Apache 2.0
Llama 4 Scout17B active~216 GB (3×H100)~90†Custom
Qwen 3 32B32B~64 GB~60Apache 2.0

These are approximate inference throughput estimates. Actual throughput varies with context length, batch size, and concurrency. The MoE model's 120 tok/sec figure reflects its low active parameter count per token; the Dense model at 65 tok/sec reflects full 31B computation per pass. †Llama 4 Scout requires ~216 GB VRAM in BF16, which exceeds a single H100 (80 GB); its ~90 tok/sec figure is for a 3×H100 tensor-parallel setup.

For Llama 4 deployment specifics, see our Llama 4 GPU cloud guide.

Cost Comparison: Self-Hosted vs. API

Running Gemma 4 on Spheron vs. paying per-token API rates:

DeploymentConfigCost/1M tokens (input)Cost/1M tokens (output)
Gemma 4 MoE on RTX 5090$0.86/hr, ~2M tok/hr~$0.43~$0.43
Gemma 4 MoE on H100 spot$0.80/hr, ~4M tok/hr~$0.20~$0.20
Gemma 4 Dense on H100 spot$0.80/hr, ~2M tok/hr~$0.40~$0.40

Throughput estimates at batch=8; actual varies with context length and concurrency.

Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

At high throughput, self-hosted Gemma 4 MoE on Spheron spot instances undercuts most per-token API pricing. The break-even point depends on your traffic volume: at 10M+ tokens per day, self-hosting on spot H100 is almost always cheaper. For cost optimization strategies, see our GPU cost optimization playbook and serverless vs. on-demand vs. reserved GPU comparison.

Production Deployment Checklist

Before serving traffic, confirm these settings:

  • Set --max-model-len based on available VRAM after model weights: 8192 is safe for most single-GPU configs with BF16 Dense
  • Enable --enable-chunked-prefill for better throughput under concurrent load
  • Use --gpu-memory-utilization 0.90 to leave headroom for system processes
  • Set up health checks against GET /health on port 8000 before routing traffic
  • Use spot instances for dev and testing; switch to on-demand for production SLAs
  • Monitor GPU memory with nvidia-smi dmon -s mu or Prometheus/Grafana integration
  • Cache HuggingFace model weights on a persistent volume to avoid re-downloading on restart

For serving behind a reverse proxy with the OpenAI-compatible API, see our self-hosted OpenAI-compatible API guide.


Gemma 4's MoE architecture means you get frontier-quality inference at a fraction of the compute cost, and Spheron's spot H100 instances at $0.80/hr make it one of the cheapest ways to run a production-grade open model today.

Rent H100 → | Rent B200 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.