Tutorial

Deploy Google Gemma 4 on GPU Cloud: MoE and Dense Model Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 5, 2026
Google Gemma 4Gemma 4vLLMGPU CloudMoEH100B200LLM Deployment
Deploy Google Gemma 4 on GPU Cloud: MoE and Dense Model Guide (2026)

Google released Gemma 4 on April 2, 2026 under an Apache 2.0 license. The family includes four variants: E2B and E4B for edge and mobile deployment, plus the two cloud-scale models covered here. This guide focuses on the 31B Dense model (model ID google/gemma-4-31B-it) and the 26B Mixture of Experts model (model ID google/gemma-4-26B-A4B-it). The MoE variant activates only ~4B parameters per forward pass (the "A4B" in the model ID), which makes it faster and cheaper to run than most models its size. The Dense variant offers peak quality with straightforward deployment on a single H100 or B200. Both variants support multimodal input, including interleaved images and text.

For context on where Gemma 4 sits in the broader model landscape, see our open-source LLM comparison for 2026 and our Gemma 3 deployment guide for the previous generation.

Gemma 4 GPU Requirements

The 31B Dense model loads all 31 billion parameters into VRAM on every forward pass. The 26B MoE model loads all 26B expert weights into VRAM but only activates ~4B of them per inference step. Both need enough VRAM to hold total model weights, regardless of active parameter count.

ModelPrecisionVRAM RequiredMin GPURecommended GPU
Gemma 4 31B DenseBF16~71 GBH100 SXM5 80GBB200 SXM6
Gemma 4 31B DenseINT8~36 GBA100 40GBA100 80GB
Gemma 4 31B DenseINT4~16 GBRTX 4090RTX 5090
Gemma 4 26B MoEBF16~60 GBH100 SXM5B200 SXM6
Gemma 4 26B MoEFP8~30 GBRTX 5090A100 80GB
Gemma 4 26B MoEINT4~15 GBRTX 4090RTX 5090

VRAM estimates include the standard 15% framework overhead on top of raw model weights. Plan additional headroom for KV cache at production context lengths. For detailed VRAM calculations across quantization levels, see our GPU memory requirements guide.

Deploy Gemma 4 31B Dense on H100 and B200

The 31B Dense model is straightforward to serve. On a single H100 SXM5 (80GB), it runs in BF16 with enough headroom for 8K contexts. On a B200 (192GB), you can push to full 128K context without quantization.

H100 SXM5, BF16:

bash
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-31B-it \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --port 8000

H100 SXM5, FP8 (better throughput, similar quality):

bash
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-31B-it \
    --dtype bfloat16 \
    --quantization fp8 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 16384 \
    --port 8000

B200 SXM6, BF16 (full context, no quantization):

bash
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-31B-it \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 65536 \
    --port 8000

The B200's 192GB VRAM fits the full BF16 model with space for 64K+ token KV caches. Use the FP8 variant on H100 when you need context lengths beyond 8K.

Current Spheron pricing for Gemma 4 31B Dense deployments:

GPUConfigOn-Demand $/hrSpot $/hr
H100 SXM51x, BF16$2.57$0.80
B200 SXM61x, BF16$7.43$1.71
A100 80GB SXM42x, BF16 (TP=2)$3.28$0.90

Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For hardware setup guides see the H100 GPU rental page and B200 GPU rental page. For production vLLM configuration including health checks and monitoring, see our vLLM production deployment guide.

Deploy Gemma 4 26B MoE (~4B Active Parameters)

The MoE model routes each token through a subset of its expert FFN layers. The router scores each token against all available experts and dispatches it to the top-K highest-scoring ones. Only ~4B out of 26B total parameters run computation on any given token. The rest sit in VRAM, resident but idle.

This matters for inference cost: you pay to load 26B parameters into memory, but each forward pass only does the arithmetic of a ~4B model. On a per-token basis, MoE models are fast. The tradeoff is that you still need the VRAM to hold all 26B weights.

For a deep dive on expert parallelism strategy and MoE VRAM math, see our MoE inference optimization guide.

H100 SXM5, BF16:

bash
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-26B-A4B-it \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --port 8000

For single-GPU deployments of the MoE model, tensor parallelism is not required when VRAM fits the full model. The ~4B active parameter count means routing overhead is minimal, and a single GPU handles 30+ concurrent requests without batching degradation. Only add --tensor-parallel-size 2 if you want to split the model across two GPUs for longer context windows.

Current Spheron pricing for Gemma 4 26B MoE deployments:

GPUConfigOn-Demand $/hrSpot $/hr
RTX 5090 PCIe1x, FP8$0.86-
H100 SXM51x, BF16$2.57$0.80
A100 80GB SXM41x, FP8$1.64$0.45

Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The RTX 5090 at $0.86/hr is the best value option for the MoE model when running FP8, at roughly one-third the cost of an H100. The A100 at $0.45/hr spot is the cheapest option and also requires FP8 quantization to fit the full model. For full BF16 quality, use an H100 SXM5.

Quantized Gemma 4 on Consumer GPUs

INT4 and AWQ quantization bring Gemma 4 within reach of consumer GPUs for development and low-traffic use cases.

AWQ INT4 for the MoE model:

bash
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-26B-A4B-it \
    --quantization awq \
    --dtype float16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --port 8000

The AWQ INT4 MoE variant requires roughly 15GB VRAM for model weights. An RTX 4090 (24GB) can run it with headroom for KV cache. An RTX 5090 (32GB) runs it with room for larger batches.

GGUF/llama.cpp for RTX 4090:

If you prefer llama.cpp for CPU offloading or simpler setup:

bash
# Download a Q4_K_M GGUF from HuggingFace
llama-cli -m gemma-4-26B-A4B-it.Q4_K_M.gguf \
    --n-gpu-layers 99 \
    --ctx-size 4096

Q4_K_M on the 26B MoE model uses roughly 16GB, fitting an RTX 4090 for development use.

The RTX 4090 is not currently listed in the Spheron GPU pricing API. For current RTX 4090 availability on Spheron, check the GPU pricing page. The RTX 5090 at $0.86/hr is available and runs Gemma 4 MoE with FP8 quantization (~30GB VRAM). See the RTX 5090 rental page for configuration details.

For context on FP4 quantization methods and Blackwell GPU cost efficiency, see our FP4 quantization guide.

Performance Benchmarks

Approximate throughput estimates for Gemma 4 models compared to similar-tier alternatives (batch=1, context=512 tokens):

ModelActive ParamsVRAM (BF16)Tokens/sec (H100)License
Gemma 4 26B MoE~4B~60 GB~120Apache 2.0
Gemma 4 31B Dense31B~71 GB~65Apache 2.0
Llama 4 Scout17B active~216 GB (3×H100)~90†Custom
Qwen 3 32B32B~64 GB~60Apache 2.0

These are approximate inference throughput estimates. Actual throughput varies with context length, batch size, and concurrency. The MoE model's 120 tok/sec figure reflects its low active parameter count per token; the Dense model at 65 tok/sec reflects full 31B computation per pass. †Llama 4 Scout requires ~216 GB VRAM in BF16, which exceeds a single H100 (80 GB); its ~90 tok/sec figure is for a 3×H100 tensor-parallel setup.

For Llama 4 deployment specifics, see our Llama 4 GPU cloud guide.

Cost Comparison: Self-Hosted vs. API

Running Gemma 4 on Spheron vs. paying per-token API rates:

DeploymentConfigCost/1M tokens (input)Cost/1M tokens (output)
Gemma 4 MoE on RTX 5090$0.86/hr, ~2M tok/hr~$0.43~$0.43
Gemma 4 MoE on H100 spot$0.80/hr, ~4M tok/hr~$0.20~$0.20
Gemma 4 Dense on H100 spot$0.80/hr, ~2M tok/hr~$0.40~$0.40

Throughput estimates at batch=8; actual varies with context length and concurrency.

Pricing fluctuates based on GPU availability. The prices above are based on 05 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

At high throughput, self-hosted Gemma 4 MoE on Spheron spot instances undercuts most per-token API pricing. The break-even point depends on your traffic volume: at 10M+ tokens per day, self-hosting on spot H100 is almost always cheaper. For cost optimization strategies, see our GPU cost optimization playbook and serverless vs. on-demand vs. reserved GPU comparison.

Production Deployment Checklist

Before serving traffic, confirm these settings:

  • Set --max-model-len based on available VRAM after model weights: 8192 is safe for most single-GPU configs with BF16 Dense
  • Enable --enable-chunked-prefill for better throughput under concurrent load
  • Use --gpu-memory-utilization 0.90 to leave headroom for system processes
  • Set up health checks against GET /health on port 8000 before routing traffic
  • Use spot instances for dev and testing; switch to on-demand for production SLAs
  • Monitor GPU memory with nvidia-smi dmon -s mu or Prometheus/Grafana integration
  • Cache HuggingFace model weights on a persistent volume to avoid re-downloading on restart

For serving behind a reverse proxy with the OpenAI-compatible API, see our self-hosted OpenAI-compatible API guide.


Gemma 4's MoE architecture means you get frontier-quality inference at a fraction of the compute cost, and Spheron's spot H100 instances at $0.80/hr make it one of the cheapest ways to run a production-grade open model today.

Rent H100 → | Rent B200 → | View all GPU pricing →

Get started on Spheron →

STEPS / 06

Quick Setup Guide

  1. Provision a Spheron GPU instance

    Log in to app.spheron.ai, navigate to GPU Instances, and select your target GPU (H100 SXM5 for Dense or MoE BF16, RTX 5090 for MoE FP8). Choose Ubuntu 22.04 with CUDA 12.x. For Dense BF16, request at least 80GB VRAM; for MoE BF16, you need at least 80GB VRAM (H100 SXM5). For MoE FP8, a 32GB RTX 5090 is sufficient.

  2. Install vLLM and authenticate with Hugging Face

    Run: pip install vllm (latest release or nightly build) and then huggingface-cli login with your HF token that has access to the gated google/gemma-4 repositories. Confirm CUDA is available with python -c 'import torch; print(torch.cuda.is_available())'.

  3. Deploy Gemma 4 31B Dense on H100 or B200

    Start the vLLM server: python -m vllm.entrypoints.openai.api_server --model google/gemma-4-31B-it --dtype bfloat16 --tensor-parallel-size 1 --max-model-len 8192 --port 8000. For FP8 on H100: add --quantization fp8. For B200: no extra flags needed since 192GB VRAM fits the full model comfortably.

  4. Deploy Gemma 4 26B MoE with expert parallelism

    Start the vLLM server: python -m vllm.entrypoints.openai.api_server --model google/gemma-4-26B-A4B-it --dtype bfloat16 --tensor-parallel-size 1 --max-model-len 8192 --port 8000. The MoE model's ~4B active params mean it uses far less compute per token than the name suggests. On an H100 SXM5 (80GB), you can serve 30+ concurrent requests without batching degradation. For FP8 on RTX 5090 (32GB), add --quantization fp8 (keep --dtype bfloat16 unchanged).

  5. Run quantized Gemma 4 on consumer GPUs (RTX 4090, RTX 5090)

    For INT4 quantized inference: python -m vllm.entrypoints.openai.api_server --model google/gemma-4-26B-A4B-it --quantization awq --dtype float16 --port 8000. The AWQ INT4 MoE variant requires roughly 15 GB VRAM, leaving plenty of headroom on RTX 4090 (24GB) or RTX 5090 (32GB) for large batches.

  6. Test the deployment with the OpenAI-compatible API

    Send a test request: curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "google/gemma-4-26B-A4B-it", "messages": [{"role": "user", "content": "Explain MoE routing in two sentences."}], "max_tokens": 256}'. A valid JSON response confirms the deployment is working.

FAQ / 05

Frequently Asked Questions

Gemma 4 31B Dense in BF16 requires ~71 GB VRAM (including framework overhead). A single H100 SXM5 (80GB) fits with some headroom, or a B200 (192GB) for long contexts. Two A100 80GB cards with tensor parallelism also work.

Yes. The 26B MoE model loads all 26B expert weights into VRAM. BF16 requires ~60 GB (including framework overhead), so a single H100 SXM5 (80GB) is the practical single-GPU BF16 option. An RTX 5090 (32GB) can run it with FP8 quantization (~30GB). An RTX 4090 (24GB) can run it quantized to INT4 or GGUF Q4.

Gemma 4 31B Dense activates all 31 billion parameters on every forward pass. The 26B MoE (Mixture of Experts) model routes each token through a subset of experts, activating only ~4B parameters per step (matching the A4B in the model ID). This means the MoE model is much faster and cheaper to run while delivering competitive output quality.

Yes, vLLM supports Gemma 4 architecture natively (install the latest release or nightly build) via the transformers integration. Use --model google/gemma-4-26B-A4B-it (MoE) or --model google/gemma-4-31B-it (Dense) with the appropriate --dtype and --tensor-parallel-size flags.

On Spheron, an H100 SXM5 instance starts at $0.80/hr for spot pricing. Running Gemma 4 31B Dense in BF16 on a single H100 costs approximately $0.80-$2.57/hr depending on spot vs. on-demand. The 26B MoE on an RTX 5090 at $0.86/hr is the most cost-efficient option for moderate throughput.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.