Tutorial

Deploy Gemma 4 QAT on GPU Cloud: 31B Dense at ~66% Less VRAM

Gemma 4 QATGemma 4 Quantization Aware Trainingdeploy Gemma 4 QAT vLLMW4A16 QuantizationCompressed TensorsvLLMGPU CloudLLM InferenceCost Optimization
Deploy Gemma 4 QAT on GPU Cloud: 31B Dense at ~66% Less VRAM

Google released QAT checkpoints for all five Gemma 4 variants on June 5, 2026. The headline result: the 31B Dense at w4a16 uses roughly 66% less VRAM than BF16 during inference, which changes which GPU you need. A model that previously required an H100 80GB fits on an L40S 48GB or A100 80GB. The 26B-A4B MoE has no w4a16 checkpoint (its small expert dimensions cause excessive quality loss at 4-bit), but INT8 quantization gives roughly 47% memory savings and is what Google's vLLM recipe recommends. The cost difference is substantial either way. For the full BF16 deployment context, see the Gemma 4 base deployment guide.

What QAT Is and Why It Beats Post-Training Quantization

Post-training quantization (PTQ), which covers AWQ, GPTQ, and GGUF K-quants, takes a fully trained model and compresses the weights after training is done. A calibration step runs representative samples through the model, records activation statistics, and uses those to set quantization scales that minimize rounding error. The weights get compressed; the model loses some quality.

QAT works differently. The model is re-trained with quantization noise injected during every forward pass. Instead of patching the weights after training, the model learns to work within the constraints of 4-bit precision. By the end of training, the weights have internalized the precision loss and compensate for it. You end up with a 4-bit model that performs closer to the BF16 original than a post-hoc quantized version of the same model at the same bit-width.

The practical result at 4-bit: QAT typically matches or beats AWQ at equivalent weight precision on downstream task quality. For a deep look at how AWQ calibration works and where it falls short, see our AWQ quantization guide. For NVIDIA ModelOpt's own QAT pipeline (which covers FP8 and FP4 alongside INT4), see the TensorRT Model Optimizer guide.

The compressed-tensors format stores w4a16 weights (4-bit integer weights, bfloat16 activations) with quantization metadata in a Hugging Face-native checkpoint. vLLM reads it directly via --quantization compressed-tensors. No separate conversion step.

Gemma 4 QAT Model Sizes and VRAM Requirements

Google released QAT checkpoints for all five Gemma 4 variants:

Model IDTypeQuant. Weight VRAMBF16 Weight VRAMWeight Savings†Min GPU
google/gemma-4-E2B-it-qat-w4a16-ctDense 2B (w4a16)~1.5 GB~5 GB~70%Any
google/gemma-4-E4B-it-qat-w4a16-ctDense 4B (w4a16)~2.5 GB~8 GB~69%Any
google/gemma-4-12B-it-qat-w4a16-ctDense 12B (w4a16)~7 GB~25 GB~72%RTX 4090
google/gemma-4-26B-A4B-it (INT8‡)MoE 26B (~4B active)~28 GB~52 GB~47%A100 40GB
google/gemma-4-31B-it-qat-w4a16-ctDense 31B (w4a16)~17 GB~62 GB~73%RTX 4090

†These figures are checkpoint parameter-storage savings vs BF16. Realized VRAM reduction during vLLM inference is lower because embedding and norm layers stay at BF16: E2B ~26%, E4B ~36%, 12B ~64%, 31B ~66%, per the official vLLM Gemma 4 recipe. Small-model savings diverge most from the ceiling because embedding layers make up a larger share of total VRAM at low parameter counts.

‡The 26B-A4B MoE has no w4a16 compressed-tensors checkpoint. Google excluded it because the model's small expert dimensions (704) cause excessive quality loss at 4-bit. The recommended approach is --quantization int8_per_channel_weight_only applied on-the-fly to the base model, giving ~47% weight savings with acceptable quality.

VRAM figures are for model weights only (not including KV cache or activation overhead). KV cache adds on top based on context length and batch size. At 16K context with batch size 8, expect 4-8 GB additional VRAM per model beyond the weight footprint. For KV cache sizing guidance, see the KV cache optimization guide. For detailed VRAM math across quantization levels, see GPU memory requirements for LLMs.

The MoE model (26B-A4B) loads all 26B INT8-quantized weights into VRAM, but each forward pass activates only ~4B parameters. You need VRAM for all 26B weights at inference time, not just the active 4B.

Serving QAT Checkpoints with vLLM

Gemma 4 QAT 31B Dense

On an A100 80GB or L40S 48GB, the 31B QAT model runs comfortably at 16K+ context:

bash
vllm serve google/gemma-4-31B-it-qat-w4a16-ct \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 16384 \
    --port 8000

On an H100 80GB, the same command works and gives more KV cache headroom:

bash
vllm serve google/gemma-4-31B-it-qat-w4a16-ct \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --port 8000

The --dtype bfloat16 specifies activation precision. vLLM auto-detects compressed-tensors from the checkpoint config, so --quantization compressed-tensors is optional but you can include it explicitly if preferred.

Gemma 4 26B-A4B MoE (INT8)

The 26B-A4B has no w4a16 compressed-tensors checkpoint. Google's vLLM recipe recommends INT8 quantization applied on-the-fly to the base model, which reduces weight VRAM by ~47% vs BF16 (from ~52 GB to ~28 GB):

bash
vllm serve google/gemma-4-26B-A4B-it \
    --quantization int8_per_channel_weight_only \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 16384 \
    --port 8000

This requires an A100 40GB or larger. The MoE model loads all 26B INT8-quantized weights into VRAM but routes each token through only ~4B active parameters per forward pass. Per-token compute is cheap; VRAM holds all expert weights.

SGLang Alternative

SGLang supports compressed-tensors natively as of mid-2026:

bash
python -m sglang.launch_server \
    --model google/gemma-4-31B-it-qat-w4a16-ct \
    --quantization compressed-tensors \
    --port 30000

Both vLLM and SGLang give comparable throughput for compressed-tensors inference. vLLM is the default recommendation for most deployments. For a throughput comparison across frameworks, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Quality vs Memory: What the Benchmarks Show

QAT at 4-bit consistently outperforms AWQ at 4-bit because the model weights have been adjusted during training to compensate for precision loss, not patched after training.

BenchmarkBF16 31BQAT 31B (w4a16)AWQ 31B (INT4)Delta: QAT vs BF16
MMLU (5-shot)~84%~82.5%~81.5%~-1.5%
HumanEval (pass@1)~68%~67%~65%~-1%
GSM8K (8-shot CoT)~82%~80.5%~79%~-1.5%

Figures are approximate deltas based on typical QAT vs AWQ patterns at w4a16. Run your own evaluation before production decisions. The practical rule: if your task involves multi-step reasoning, benchmark explicitly. For instruction following, summarization, and retrieval tasks, a 1-2% MMLU gap is rarely noticeable in production.

The advantage QAT has over AWQ is most visible on reasoning tasks where error accumulates across steps. A single quantization error in a reasoning chain compounds differently than in a summarization task.

Right-Sizing the GPU: Which to Rent for QAT Inference

GPUVRAM31B QAT (w4a16) fits?26B-A4B INT8 fits?Max context (est.)
RTX 409024 GBYes (~7 GB headroom)No (needs ~28 GB)
RTX 509032 GBYes (~15 GB headroom)Not recommended (< 8 GB KV cache headroom)
L40S 48GB48 GBYes (~31 GB headroom)Yes (~20 GB headroom)32K+
A100 40GB40 GBYes (~23 GB headroom)Yes (~12 GB headroom)8K-16K
A100 80GB80 GBYes (~63 GB headroom)Yes (~52 GB headroom)Full 256K
H100 80GB80 GBYes (~63 GB headroom)Yes (~52 GB headroom)Full 256K

The critical point: BF16 Gemma 4 31B needs an H100 80GB (or two A100 80GB cards with tensor parallelism). The QAT 31B at w4a16 fits on a single L40S 48GB with substantial KV cache headroom. The 26B-A4B at INT8 uses ~28 GB, so an A100 40GB is the practical minimum for production use. The headroom column accounts for ~17 GB of QAT 31B weight storage and ~28 GB of 26B-A4B INT8 weight storage.

For context lengths beyond 32K on an L40S, watch VRAM usage with nvidia-smi dmon. At 32K context with batch 16, KV cache on the 31B model can consume 10-20 GB depending on sequence characteristics.

Cost-Per-Token Comparison

Live pricing from the Spheron API (dedicated on-demand offers):

On-demand pricing (DEDICATED offers, per GPU):

GPUOn-Demand $/hrQAT 31B tok/s (est.)$/1M tokens (est.)
RTX 5090 PCIe$0.92~65~$3.93
L40S PCIe$0.96~45~$5.93
A100 80G PCIe$1.43~80~$4.97
A100 80G SXM4$1.69~80~$5.87
H100 SXM5 (QAT 31B)$2.54~100~$7.06
H100 SXM5 (BF16 31B, for comparison)$2.54~65~$10.83

The QAT 31B on RTX 5090 costs roughly $3.93/M tokens on-demand, and on A100 PCIe about $4.97/M tokens, versus $10.83/M for BF16 31B on H100. That is a 54-64% reduction depending on the GPU.

Throughput estimates are for batch=1, 512-token context. At higher batch sizes (32+), throughput increases and cost per token drops further.

Spot pricing:

GPUSpot $/hrQAT 31B tok/s (est.)$/1M tokens (est.)
A100 80G SXM4$0.85~80~$2.95
H100 SXM5$1.49~100~$4.14

Spot instances can be reclaimed at any time without notice. Use spot for batch inference jobs and dev environments; use on-demand for production endpoints with SLA requirements.

Pricing fluctuates based on GPU availability. The prices above are based on 15 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

For a broader cost modeling framework that covers how to pick between spot, on-demand, and reserved capacity, see AI inference cost economics.

Production Checklist

Before routing traffic to a QAT deployment:

  • Use --max-model-len 16384 as the starting point on A100 or L40S; increase if VRAM usage after model load leaves enough headroom
  • Enable --enable-chunked-prefill for better concurrent request throughput under mixed batch sizes
  • Set --gpu-memory-utilization 0.90 to leave headroom for OS and driver overhead
  • Monitor queue depth with vLLM's /metrics endpoint (see the vLLM production guide)
  • Cache Hugging Face weights on a persistent volume; QAT checkpoints are roughly 17 GB for the 31B versus 62 GB for BF16 weights, so cold starts are faster and transfer costs are lower
  • For multimodal input, Gemma 4 QAT supports image plus text; add --limit-mm-per-prompt image=4 to cap images per request
  • Use on-demand for SLA-backed production endpoints and spot for dev and batch jobs
  • If per-token throughput is the primary constraint rather than cost, the 26B-A4B INT8 MoE activates only ~4B parameters per forward pass and will outperform the 31B Dense at the same per-request wall clock time (requires A100 40GB or larger for its ~28 GB weight footprint)

The Gemma 4 QAT checkpoints are released under Apache 2.0 and do not require a separate license agreement on Hugging Face. Run huggingface-cli login to authenticate and download model weights.


QAT checkpoints make the Gemma 4 31B viable on a single mid-tier GPU. Spheron's GPU instances are available on-demand, billed per minute, with SSH root access and no setup fees.

View all GPU pricing →

Get started on Spheron →

STEPS / 05

Quick Setup Guide

  1. Provision a Spheron GPU instance for QAT inference

    Log into app.spheron.ai and select an A100 80GB or L40S 48GB from the GPU catalog. Both handle Gemma 4 QAT 31B comfortably. Choose Ubuntu 22.04 with CUDA 12.x. SSH in and verify with nvidia-smi.

  2. Install vLLM and set your Hugging Face token

    Run: pip install vllm --upgrade. Set your Hugging Face token so vLLM can download model weights: huggingface-cli login. The Gemma 4 QAT checkpoints are published under Apache 2.0 and do not require a separate license acceptance on Hugging Face. Confirm CUDA is available with: python -c 'import torch; print(torch.cuda.is_available())'.

  3. Serve Gemma 4 QAT 31B with vLLM

    Start the OpenAI-compatible server: vllm serve google/gemma-4-31B-it-qat-w4a16-ct --dtype bfloat16 --gpu-memory-utilization 0.90 --max-model-len 16384 --port 8000. vLLM auto-detects the compressed-tensors format from the checkpoint config; you may also pass --quantization compressed-tensors explicitly.

  4. Serve Gemma 4 26B-A4B MoE with vLLM using INT8 quantization

    The 26B-A4B MoE has no w4a16 compressed-tensors checkpoint; its small expert dimensions cause excessive quality loss at 4-bit. Use INT8 quantization on the base model instead: vllm serve google/gemma-4-26B-A4B-it --quantization int8_per_channel_weight_only --dtype bfloat16 --gpu-memory-utilization 0.90 --max-model-len 16384 --port 8000. This reduces weight VRAM by ~47% vs BF16 (from ~52 GB to ~28 GB), requiring an A100 40GB or larger. The MoE activates only ~4B parameters per forward pass at runtime.

  5. Test the deployment

    curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "google/gemma-4-31B-it-qat-w4a16-ct", "messages": [{"role": "user", "content": "Explain quantization-aware training in two sentences."}], "max_tokens": 200}'. You should receive a JSON response.

FAQ / 05

Frequently Asked Questions

Gemma 4 QAT (Quantization-Aware Training) checkpoints are model weights where the quantization is baked in during a fine-tuning pass rather than applied post-hoc. Google released these on June 5, 2026 in the w4a16 compressed-tensors format: 4-bit weights, 16-bit activations. This differs from AWQ (post-training quantization that identifies and protects salient weights) and GGUF (a container format for llama.cpp). Because the model is re-trained to compensate for quantization error, QAT checkpoints typically retain more quality at 4-bit precision than equivalently sized AWQ checkpoints.

The Gemma 4 QAT 31B at w4a16 requires approximately 17 GB for checkpoint parameter storage, roughly 73% less than the 62 GB BF16 checkpoint. Realized VRAM reduction during vLLM inference is approximately 66%, since embedding and norm layers remain at BF16 precision. A single L40S 48GB handles it comfortably with headroom for 16K+ context windows. An A100 80GB works well for production context lengths. RTX 5090 (32GB) or RTX 4090 (24GB) are viable for development, but KV cache headroom is tighter at longer contexts.

Yes. The QAT 31B model at w4a16 has approximately 17 GB of weight storage, leaving over 60 GB on an A100 80GB for KV cache. With --max-model-len 16384, this comfortably fits with production-level concurrency. At BF16, the same 31B model requires roughly 71 GB on an H100 80GB (including framework overhead), so QAT changes the economics substantially.

w4a16 means weights are stored in 4-bit integer format and activations are computed in 16-bit (bfloat16 or float16). The compressed-tensors format is a Hugging Face-native checkpoint format introduced by Neural Magic that stores quantization metadata alongside weights in a way that vLLM and SGLang can load directly with --quantization compressed-tensors.

Google's QAT checkpoints show approximately 1-2% degradation on standard benchmarks (MMLU, HumanEval) compared to BF16. This is better than typical AWQ INT4 post-training quantization on the same model at similar bit-widths, because QAT error correction is baked into the weights rather than applied after training. For most production inference workloads, the ~66% realized VRAM reduction (31B Dense) and resulting cost savings outweigh the marginal accuracy cost.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.