Tutorial

Deploy Gemma 3 on GPU Cloud: Complete Guide for All Variants

Back to BlogWritten by Mitrasish, Co-founderMar 20, 2026
Gemma 3Google AIGPU CloudvLLMOllamaLLM DeploymentOpen Source AIMultimodal AI
Deploy Gemma 3 on GPU Cloud: Complete Guide for All Variants

Gemma 3 27B is Google's strongest open model. It fits on a single H100 without quantization, handles images alongside text, and at launch ranked in the global Chatbot Arena top 10, outperforming much larger open models including Llama 3 405B and DeepSeek V3 671B while running on a single GPU. This guide covers all five variants (270M through 27B), GPU requirements for each, step-by-step vLLM and Ollama deployment, and a full cost breakdown for the 27B on Spheron.

Gemma 3 Model Variants

Gemma 3 comes in five sizes: 270M, 1B, 4B, 12B, and 27B. The 270M and 1B are text-only with a 32K context window, suited for edge devices and embedded inference. The 4B, 12B, and 27B support multimodal input and a 128K context window, making them the primary choices for cloud GPU deployment. All five variants are listed below. All variants listed are instruction-tuned (-it). Base variants also exist but are primarily useful for fine-tuning.

VariantParametersBF16 VRAMMultimodalContextBest For
270M270M~0.5 GBNo32KEdge devices, embedded fine-tuning
1B1B~2 GBNo32KEdge inference, embedded
4B4B~8 GBYes128KDev, low-traffic APIs
12B12B~24 GBYes128KMid-range production
27B27B~54 GBYes128KHigh-quality production

VRAM figures are BF16 model weights including the SigLIP vision encoder for multimodal variants. Actual serving VRAM will be higher due to KV cache, activations, and context length. Plan for 80 GB GPU headroom when running 27B at production context lengths.

HuggingFace model IDs: google/gemma-3-270m-it, google/gemma-3-1b-it, google/gemma-3-4b-it, google/gemma-3-12b-it, google/gemma-3-27b-it.

The 1B is text-only and suited for edge devices or embedded inference where VRAM is limited. The 4B through 27B models all support interleaved image and text input with a 128K context window.

GPU Requirements Per Variant

Gemma 3 1B

Minimum GPU: Any GPU with 4+ GB VRAM (including the NVIDIA L4 at 24 GB or RTX 4090 24 GB).

Recommended: A single RTX 4090 or L4 for development; any cloud GPU with 8+ GB for production.

Budget option: Consumer cards with 8 GB VRAM run the 1B model in FP16 without issue.

This is the edge inference variant. Cloud GPU territory for 1B usually means very high concurrency requirements, not a VRAM constraint.

Gemma 3 4B

Minimum GPU: Any GPU with 8+ GB VRAM.

Recommended: RTX 4090 24 GB or L40S 48 GB for production serving.

Budget option: The RTX 4090 runs 4B in FP16 with headroom for large KV caches.

The 4B hits a useful balance of quality and speed for chatbots, classification, and lightweight generation APIs.

Gemma 3 12B

Minimum GPU: A100 40 GB for comfortable single-GPU production. The 12B model weights are ~24 GB in BF16, which fills a 24 GB GPU entirely, leaving no room for KV cache. An RTX 4090 can only run 12B with INT4 or INT8 quantization.

Recommended: A100 80 GB or H100 80 GB for production with longer contexts.

Budget option: A100 40 GB is the practical starting point for BF16/FP16 inference. For RTX 4090 users, Q4 or INT8 quantization brings the model down to 6-12 GB. See the Spheron Gemma 3 guide for hardware-specific deployment scripts.

At ~24 GB BF16 weights, the 12B requires at least a 40 GB GPU for actual inference. A 40+ GB GPU is the sweet spot for cost-sensitive APIs where quality matters but the 27B is over-budget.

Gemma 3 27B

BF16 production: 1x H100 80GB or 1x A100 80GB. Both handle the ~54 GB model weights with room for KV cache at moderate context lengths.

INT8 quantization: 1x L40S 48GB. INT8 compresses weights to ~27 GB, fitting the L40S with space for KV cache.

INT4 development: 1x RTX 4090 24GB. INT4 drops the model to ~14 GB, viable for development and low-traffic use. Not recommended for production due to throughput limits.

For a deeper breakdown of VRAM calculations across quantization levels and context lengths, see our GPU memory requirements guide.

Deploying Gemma 3 27B with vLLM

vLLM is the recommended production serving engine. It provides continuous batching, PagedAttention for efficient KV cache management, and an OpenAI-compatible API. The current stable series as of March 2026 is v0.17.x, with full Gemma 3 multimodal support. For a full server setup walkthrough on Spheron, see the Spheron vLLM guide.

Step 1: Set Up Your GPU Server

Provision a GPU server on Spheron with your target GPU. Log into app.spheron.ai, select an H100 or A100 from the catalog, and deploy an instance with Ubuntu 22.04 and CUDA 12.x. See the Spheron SSH connection guide and CUDA driver setup if this is your first time on a Spheron GPU instance. For a ready-to-use cloud-init deployment script for Gemma 3, see the Spheron Gemma 3 guide.

SSH in and confirm GPU access:

bash
nvidia-smi

You should see your GPU listed with the full VRAM capacity. For an H100 80GB, the output shows 81920 MiB.

Step 2: Install vLLM

bash
pip install vllm --upgrade

Verify the version:

bash
python -c "import vllm; print(vllm.__version__)"

The current stable series is v0.17.x (March 2026), which includes full Gemma 3 multimodal support. Running pip install vllm --upgrade installs the latest release. For step-by-step vLLM server setup and performance tuning on Spheron, see the Spheron vLLM inference guide.

Step 3: Authenticate with HuggingFace

Gemma 3 models are gated on HuggingFace. Before vLLM can download the weights, you must accept Google's license and provide a valid token.

  1. Go to huggingface.co/google/gemma-3-27b-it and accept the license terms while logged in to your HuggingFace account.
  2. Generate an access token at huggingface.co/settings/tokens.
  3. Export the token on your GPU instance:
bash
export HF_TOKEN=your_token_here

Skipping this step causes vLLM to fail with a 401 authentication error when attempting to download the model. For Spheron-specific token setup including persistent storage via /etc/vllm/hf-token and systemd configuration, see the Spheron Gemma 3 deployment guide.

Step 4: Serve Gemma 3 27B in BF16

For an H100 or A100 80GB, run the model in BF16 without quantization:

bash
vllm serve google/gemma-3-27b-it \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 16384 \
    --port 8000

For multimodal use (image + text), add:

bash
vllm serve google/gemma-3-27b-it \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 16384 \
    --limit-mm-per-prompt image=4 \
    --port 8000

The --max-model-len 16384 leaves good KV cache headroom. Increase to 32768 or higher if your application needs longer contexts and your GPU has the VRAM.

Step 5: Serve with INT8 for L40S

On an L40S 48GB, use INT8 quantization to fit the 27B model:

bash
vllm serve google/gemma-3-27b-it \
    --dtype bfloat16 \
    --quantization int8 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 16384 \
    --port 8000

INT8 compresses weights from ~54 GB to ~27 GB. Quality loss is minimal for most workloads, typically under 1-2% on standard benchmarks.

Step 6: Test the Endpoint

Send a request to verify the server is responding:

bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "google/gemma-3-27b-it",
        "messages": [{"role": "user", "content": "What is GPU tensor parallelism?"}],
        "max_tokens": 200
    }'

For multimodal input, pass images as base64-encoded content blocks:

bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "google/gemma-3-27b-it",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<BASE64_IMAGE>"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }],
        "max_tokens": 300
    }'

Monitor GPU utilization while the server handles requests:

bash
watch -n1 nvidia-smi

For advanced vLLM configuration including FP8, chunked prefill, and production monitoring, see our vLLM production deployment guide.

Deploying Gemma 3 with Ollama

Ollama is the simpler path for development and single-user inference. It handles model download, quantization, and the API server in one command.

bash
ollama pull gemma3:27b    # ~17 GB download (Q4_K_M quantization)
ollama run gemma3:27b

Once running, the Ollama API is available at localhost:11434:

bash
curl http://localhost:11434/api/chat \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gemma3:27b",
        "messages": [{"role": "user", "content": "Explain attention mechanisms."}],
        "stream": false
    }'

Ollama vs vLLM: which to use

Use Ollama when you need quick iteration during development, running on a single machine, or testing Gemma 3 without production infrastructure. It defaults to Q4_K_M quantization (~17 GB for 27B), runs on an A100 40GB, L40S, or H100, and requires zero configuration.

Use vLLM when you need concurrent request handling, token streaming with OpenAI-compatible clients, INT8 instead of INT4 for better quality, or a production SLA. vLLM's continuous batching makes a meaningful difference above 5-10 concurrent users.

For a full Ollama setup walkthrough including model management and API integration, see our guide to running LLMs locally with Ollama. For Ollama on Spheron specifically (including Open WebUI setup), see the Spheron Ollama guide.

Cost to Serve Gemma 3 27B on Spheron

GPUVRAMPrecisionTokens/sec (est.)On-Demand $/hrMonthly (24/7)
RTX 409024 GBINT4~35$0.58~$418
L40S 48GB48 GBINT8~60$0.91~$655
A100 80GB80 GBBF16~80$1.07~$770
H100 80GB80 GBBF16~120$2.01~$1,447
H200 141GB141 GBBF16~160$4.54~$3,269

Tokens/sec estimates are for single-request inference at typical output lengths. Batch throughput is higher. Monthly figures use 720 hours (24/7) of on-demand pricing.

Spot instances are available at lower rates: H200 at $1.78/hr, H100 at $0.99/hr, A100 at $0.61/hr, and L40S at $0.41/hr. RTX 4090 spot pricing is not available. Spot is suitable for development and batch workloads where interruptions are acceptable.

All prices are based on Spheron on-demand rates as of 19 Mar 2026 and can fluctuate over time based on GPU availability. For current rates, check Spheron GPU pricing. To reduce costs further, see the Spheron cost optimization guide for spot instance strategies and reserved pricing options.

For a high-quality API with moderate traffic (under 50 concurrent users), the L40S with INT8 at ~$655/month is the most cost-efficient option. For production with high concurrency or strict quality requirements, the H100 in BF16 is the standard choice.

Gemma 3 vs Llama 4 vs Qwen 3

As of March 2026, these are the top open models competing with Gemma 3 27B for production inference:

ModelParamsSingle-GPUMultimodalContextBest For
Gemma 3 27B27B denseYes (H100/A100 BF16)Yes128KStructured tasks, multimodal, single-GPU simplicity
Llama 4 Scout109B MoE (17B active)Yes (H100 INT4)Yes10MUltra-long context, RAG, MoE efficiency
Qwen 3 32B32.8B denseYes (H100 FP16, tight)No (text-only; Qwen3-VL-32B is separate)128KReasoning (thinking mode), multilingual, code

Gemma 3 27B is a dense model. All 27B parameters load into VRAM, which means consistent memory usage and no MoE routing complexity. It handles images natively across the 4B, 12B, and 27B variants, and instruction following is strong across structured tasks. The straightforward deployment path (single H100, no quantization needed) makes it the lowest-friction option.

Llama 4 Scout is a 109B MoE that activates 17B parameters per forward pass. At INT4, it fits on a single H100 at ~55 GB. Its 10M token context window is orders of magnitude larger than Gemma 3's 128K, though a single H100 80GB cannot utilize the full 10M context in practice: after loading INT4 weights (~55 GB), only ~25 GB of VRAM remains for KV cache, limiting practical context to around 4K to 16K tokens per request at batch size 1. Multi-GPU deployments unlock the longer contexts. For document-heavy workloads or RAG pipelines indexing large corpora, Scout's context advantage is still decisive at medium context lengths. See our Llama 4 deployment guide and the Spheron Llama 4 guide for the full Scout setup.

Qwen 3 32B (released April 2025) is the current Qwen flagship for single-GPU text inference. At 32.8B dense parameters, model weights use approximately 66 GB in FP16. Total inference VRAM including KV cache runs to around 80 GB, so it is a tight fit on an H100 80GB and requires capping the maximum context length to leave room for KV cache. Its standout feature is a hybrid thinking mode that can be toggled per request, making it useful for math, coding, and logical reasoning tasks where step-by-step chain of thought matters. The base Qwen3-32B model is text-only. A separate Qwen3-VL-32B model (released October 2025) adds vision-language capabilities, though it requires more VRAM due to the vision encoder. For image understanding workloads without adding the VL model, Gemma 3 remains the simpler path. For the Spheron setup, see the Spheron Qwen 3 guide.

How to pick: Start with Gemma 3 27B if you want a single-GPU production deployment with multimodal capability and strong general quality. Move to Llama 4 Scout if you need context windows beyond 128K. Consider Qwen 3 32B if thinking-mode reasoning or broad multilingual support is the primary requirement and you want to stay on a single GPU.

Which Gemma 3 Variant Should You Deploy?

27B is the default choice for production inference where quality matters. It fits on a single H100 or A100 80GB in BF16, handles images, and produces the strongest outputs of the Gemma 3 family.

12B is the sweet spot for cost-sensitive APIs. At ~24 GB BF16 weights, it fits on a single L40S 48GB or A100 40GB. Quality is meaningfully better than the 4B, and cloud costs are roughly half the 27B.

4B is for high-throughput, low-latency use cases where query complexity is limited. Chatbots handling short queries, classification APIs, and routing layers are all good fits. Any GPU with 8+ GB VRAM handles it.

1B belongs on edge devices or embedded inference pipelines, not cloud GPU deployments. If you're running it on a cloud instance, the 4B is worth the marginal cost difference.

For a full breakdown of which GPU to pair with each variant, see our best NVIDIA GPUs for LLMs guide and the GPU requirements cheat sheet for 2026.


Gemma 3 is ready to deploy on Spheron today, no waitlist, no contracts. Rent an H100 or A100, set up vLLM in under 10 minutes, and run your own inference server on bare metal. See the Spheron Gemma 3 deployment guide to get started.

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.