Google's Gemini 3.1 Flash-Lite charges $0.25/M input tokens and $1.50/M output tokens. For low-volume workloads where you only pay per token, that can make sense. But for teams running RAG pipelines, batch summarization, or agentic systems with high input/output ratios, the blended API cost climbs well above the headline output rate. At 5:1 input/output, you're paying $2.75/M effective per output token. At 10:1, you're at $4.00/M. Self-hosting a small open SLM on GPU cloud brings that number to $0.04-0.10/M depending on your GPU, model, and utilization. This post works through the math, the latency trade-offs, and the privacy case so you can make the call for your workload.
What Gemini 3.1 Flash-Lite Is
Gemini 3.1 Flash-Lite is the smallest, cheapest model in Google's Gemini 3.1 family. The lineup runs: Flash-Lite (lowest cost, highest throughput) → Flash (mid-tier) → Pro/Ultra (top-tier). Flash-Lite targets workloads that need high QPS at the lowest possible per-call cost: chatbots with short turns, agentic pipelines with many small API calls, batch document processing, and classification at scale.
It is API-only. The weights are not publicly available. If your prompt data is sensitive, every token you send goes through Google's infrastructure.
Current pricing (verify on Google AI Studio before budgeting):
- Input: ~$0.25/M tokens
- Output: ~$1.50/M tokens
The output rate is the headline. The input rate is where costs compound on input-heavy workloads.
The Hidden Cost at Scale: How Input Tokens Compound
Most production LLM workloads have more input tokens than output tokens. RAG pipelines fetch 2-5 retrieved chunks per query before generating an answer. Document summarization might send 2,000 input tokens to get 300 output tokens. Agentic pipelines pass tool results, memory, and conversation history as context with every call.
Three workload types with actual token math:
Batch document processing: 5M documents/month at 800 input + 200 output tokens each.
- Total input: 4B tokens. API cost: 4,000M × $0.25 = $1,000
- Total output: 1B tokens. API cost: 1,000M × $1.50 = $1,500
- Total: $2,500/month, effective cost per output token: $2.50/M (not $1.50/M)
Agentic pipelines: 50,000 agent runs/day at 5 API calls each, 600 input + 150 output tokens/call.
- Daily input: 50,000 × 5 × 600 = 150M tokens. Daily output: 37.5M tokens
- Monthly API cost: (150M × 30 × $0.25 + 37.5M × 30 × $1.50) / 1M = $1,125 + $1,688 = $2,813/month
- Effective blended cost per output token: $2,813/month ÷ (37.5M/day × 30 days) = roughly $2.50/M
RAG chat API: 10:1 input/output ratio (2,000 input tokens, 200 output tokens per turn).
- Effective blended cost = (10 × $0.25 + 1 × $1.50) / 1M = $4.00/M per output token
That consistent premium over the headline rate is the hidden cost. For a detailed breakdown of how these patterns compound at scale, see the AI Inference Cost Economics 2026 analysis covering the full cost lifecycle.
The Self-Hostable Equivalents: Efficient Open SLMs
Gemini 3.1 Flash-Lite cannot be self-hosted. These models can.
| Model | Params (active) | VRAM (recommended precision) | Min GPU | Best workloads | License |
|---|---|---|---|---|---|
| Gemma 4 QAT 31B | 31B dense | ~18GB (w4a16) | A100 80GB | Instruction, RAG | Apache 2.0 |
| Phi-4 14B | 14B dense | ~15GB (FP8, H100) / ~28GB (BF16, A100) | A100/H100 80GB | Math, code, reasoning | MIT |
| Qwen3 8B | 8B dense | ~8GB (FP8, H100) / ~16GB (BF16, A100) | A100 80GB | Chat, classification | Apache 2.0 |
| Mistral Small 4 | ~6B active / 119B total | ~60GB (w4a16) | H100 80GB | Multilingual, instruction | Apache 2.0 |
Gemma 4 QAT 31B, Phi-4 14B, and Qwen3 8B run on a single A100 80GB or H100. Mistral Small 4 has 119B total expert weights that must reside in VRAM; at 4-bit (w4a16) that is ~60GB, which fits a single H100 80GB with limited KV cache headroom. A deployment guide is available for Gemma 4 QAT on GPU cloud.
One thing to evaluate before committing: run a task-specific quality eval on 100-500 representative prompts. A model that passes at 8B parameters is meaningfully cheaper to operate than one that needs 31B. The cost structure is completely different.
Apples-to-Apples Cost Model
The formula for self-hosted cost per million tokens:
CPM = (GPU $/hr) / (tokens_per_sec × 3600 / 1,000,000)This produces cost per million tokens at 100% GPU utilization. Divide by your actual utilization rate (typically 60-75% for production servers with traffic variation) to get effective CPM. For the underlying cross-GPU, cross-model benchmark data behind these numbers, see GPU Cost Per Token Benchmarks 2026.
On-demand break-even table (live prices from Spheron API, fetched 29 Jun 2026):
| Config | GPU/hr (on-demand) | Model | Throughput est. | CPM (100% util) | CPM (65% util) | Flash-Lite output rate |
|---|---|---|---|---|---|---|
| 1x A100 80G SXM4 | $1.69 | Gemma 4 QAT 31B | ~1,800 tok/s | $0.26/M | $0.40/M | $1.50/M |
| 1x A100 80G PCIe | $1.48 | Phi-4 14B BF16 | ~1,800 tok/s | $0.23/M | $0.35/M | $1.50/M |
| 1x H100 SXM5 | $2.54 | Phi-4 14B FP8 | ~4,500 tok/s | $0.16/M | $0.24/M | $1.50/M |
| 1x H100 SXM5 | $2.54 | Qwen3 8B FP8 | ~9,000 tok/s | $0.08/M | $0.12/M | $1.50/M |
RTX 4090 (from ~$0.53/hr) and L40S (from ~$0.96/hr) are also available as on-demand instances on Spheron, making them viable options for smaller 8-14B models. Check current GPU pricing for live availability.
On-demand pricing puts self-hosted CPM well below Flash-Lite's output rate for all configurations. The cost advantage widens further on spot instances:
| Config | GPU/hr (spot) | Model | Throughput est. | CPM (100% util) | CPM (65% util) | Flash-Lite output rate |
|---|---|---|---|---|---|---|
| 1x A100 80G SXM4 spot | $0.82 | Phi-4 14B BF16 | ~1,800 tok/s | $0.13/M | $0.20/M | $1.50/M |
| 1x A100 80G SXM4 spot | $0.82 | Qwen3 8B BF16 | ~4,000 tok/s | $0.06/M | $0.09/M | $1.50/M |
Spot instances can be reclaimed without notice. Use them for async batch workloads (document pipelines, nightly enrichment, offline summarization) where your job can survive a restart. For real-time APIs with latency SLAs, use on-demand.
The blended comparison is where the economics shift. At a 5:1 input/output ratio:
| Config | Self-hosted output CPM (65% util) | Flash-Lite blended CPM (5:1 i/o) | Ratio |
|---|---|---|---|
| A100 SXM4 spot + Qwen3 8B | $0.09/M | $2.75/M | ~31x cheaper self-hosted |
| A100 SXM4 spot + Phi-4 14B | $0.20/M | $2.75/M | ~14x cheaper self-hosted |
| A100 PCIe on-demand + Phi-4 14B | $0.35/M | $2.75/M | ~8x cheaper self-hosted |
| H100 SXM5 on-demand + Phi-4 14B | $0.24/M | $2.75/M | ~11x cheaper self-hosted |
The self-hosted model handles input tokens at no extra cost - you pay per GPU-hour, not per token regardless of input/output split. That structure favors self-hosting as input/output ratios climb.
Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Latency and Throughput: Closing the Gap
Gemini Flash-Lite's API latency has two components: network round-trip to Google's serving infrastructure and queuing time under load. For single requests at low concurrency, Flash-Lite typically responds within 300-800ms TTFT, which is competitive for interactive applications.
A vLLM server running Phi-4 14B on a single A100 80G with continuous batching and PagedAttention at 20 concurrent requests achieves P50 TTFT of 80-150ms, comparable to or faster than a cached Flash-Lite response. The gap matters most at high concurrency: Flash-Lite queues requests on Google's side, and at peak traffic your P99 latency can spike significantly depending on region and quota limits. A self-hosted endpoint has predictable latency because you own the capacity.
Continuous batching is the single biggest lever for closing the latency gap between self-hosted and managed APIs. It eliminates idle GPU cycles between requests by immediately processing new tokens as soon as prior decode steps free up capacity. For a detailed breakdown of continuous batching implementation, see LLM Serving Optimization: Continuous Batching and PagedAttention.
FP8 halves VRAM footprint on Hopper-class hardware (H100, H200) with negligible quality regression on most classification, summarization, and generation tasks. On H100, native FP8 tensor cores roughly double throughput over FP16 without changing GPU count. A100 (Ampere) lacks hardware FP8 tensor cores; use w4a16 or bfloat16 on A100 instead. For a full explanation of precision formats and their throughput impact, see the FP8 Quantization guide.
One latency advantage of self-hosting that is often overlooked: your inference endpoint lives in the same VPC as your application. No internet hop, no load balancer at Google's edge. For latency-sensitive internal tooling, this alone can save 50-100ms per request.
Privacy, Data Residency, and Control
For many production workloads, the cost math is secondary. The data governance requirement makes self-hosting the only option.
EU GDPR and data residency: Every token you send to the Gemini API travels to Google's infrastructure. GDPR Article 46 restricts cross-border data transfers without adequate safeguards. Self-hosting on GPU cloud within the EU region keeps all prompt and completion data within the required geography. For a full compliance checklist covering GDPR, EU AI Act obligations, and how to structure a self-hosted stack for regulatory review, see the EU AI Act compliance guide for GPU cloud.
HIPAA and healthcare: Sending PHI in prompts to a closed API requires a signed Business Associate Agreement (BAA) with the provider. Self-hosting bypasses that requirement entirely. Your patient data never leaves the instance. For regulated medical workloads running inference on sensitive clinical text, self-hosting with encrypted storage is the standard approach. On Spheron GPU cloud, instances come with full root access and SSH isolation, not shared GPU tenancy.
Financial and proprietary data: Proprietary model outputs, trading signals, or customer financial data in agentic pipelines may be covered by data governance policies that prohibit third-party API transmission. For these workloads, a self-hosted endpoint with no external egress is the correct architecture regardless of what it costs per token.
Spheron does not log prompt or completion traffic. Instances are provisioned with SSH root access and no shared GPU tenancy.
Decision Framework: When Flash-Lite Wins vs Self-Hosting
| Use Flash-Lite API when | Self-host on GPU cloud when |
|---|---|
| Monthly output < 100M tokens | Monthly output > 200M tokens with 5:1+ i/o ratio |
| No dedicated ML infra team | Team can manage a GPU instance (SSH + Docker) |
| Bursty, unpredictable traffic spikes | Steady, predictable daily load |
| No data residency constraints | EU, HIPAA, or financial data in prompts |
| Need Google-specific model behavior | Open model passes your task eval |
| Rapid prototyping, pre-launch | Production serving at scale |
| Output-only workloads (1:1 i/o) | Input-heavy workloads (5:1 or higher i/o) |
The blended break-even volume depends on your input/output ratio and GPU tier. For output-heavy workloads (1:1 i/o), Flash-Lite at $1.50/M is still more expensive than spot GPU self-hosting at moderate utilization. For RAG pipelines with 5:1 i/o, the blended API cost rises to $2.75/M, and self-hosting on spot GPU cloud becomes cheaper at volumes above 200-400M monthly output tokens depending on the GPU tier.
If your workload sends sensitive data, the framework above collapses to one rule: self-host. The cost difference is irrelevant when the alternative is sending regulated data to a third-party API.
Deployment Quickstart on Spheron
1. Provision a GPU instance
Go to app.spheron.ai and select:
- A100 80GB (on-demand or spot): good for Phi-4 14B (BF16, ~28GB) or Gemma 4 QAT 31B (compressed-tensors, ~18GB)
- H100 SXM5 (on-demand or spot): best for Qwen3 8B FP8 at maximum throughput, or Phi-4 14B FP8
For SSH key setup, follow the Spheron SSH connection guide.
2. Install vLLM and serve
pip install "vllm>=0.9.0"Note: the commands below use --dtype fp8 for maximum throughput on H100 and newer. On A100 (Ampere), replace --dtype fp8 with --dtype bfloat16. A100 lacks FP8 tensor cores and FP8 is not hardware-accelerated there.
For Qwen3 8B on H100 (FP8) or A100 (bfloat16, drop --dtype fp8):
vllm serve Qwen/Qwen3-8B \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 \
--port 8000For Phi-4 14B on H100 (FP8) or A100 (bfloat16, drop --dtype fp8):
vllm serve microsoft/phi-4 \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 \
--port 8000For Gemma 4 QAT 31B on A100 or H100:
vllm serve google/gemma-4-31b-it-qat \
--quantization compressed-tensors \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 \
--port 80003. Point your existing SDK at the self-hosted endpoint
Two lines change. Everything else stays the same:
from openai import OpenAI
client = OpenAI(
base_url="http://YOUR_INSTANCE_IP:8000/v1",
api_key="not-required", # any non-empty string
)All client.chat.completions.create() calls work without further changes. The vLLM server exposes an OpenAI-compatible REST API.
4. Benchmark and measure CPM
wget https://raw.githubusercontent.com/vllm-project/vllm/main/benchmarks/benchmark_serving.py
python benchmark_serving.py \
--backend vllm \
--model Qwen/Qwen3-8B \
--request-rate 20 \
--num-prompts 500Divide GPU $/hr by (measured tokens/sec × 0.0036) to get your actual CPM. Compare that against your blended Flash-Lite cost (input/output split × respective rates). If self-hosted CPM is lower, you've found your break-even.
For async batch workloads like nightly document pipelines where you can tolerate spot instance interruption, see Batch LLM Inference on GPU Cloud for the offline inference setup with vLLM. Spot GPU instances cut effective hourly cost by 40-60%, which pushes CPM well below Flash-Lite's blended rate even for larger models.
Flash-Lite API charges $0.25/M input and $1.50/M output. Input-heavy workloads pay $2.75-4.00/M effective per output token once input billing compounds. Above 200-400M monthly output tokens with a 5:1 or higher input/output ratio, self-hosting on Spheron's A100 or H100 instances with a small open SLM runs at $0.04-0.20/M on spot, including idle time. And if GDPR, HIPAA, or financial data policies apply to your prompts, self-hosting is the only viable option regardless of the math.
H100 on Spheron → | On-demand A100 → | View all GPU pricing →
Quick Setup Guide
Pull your last 30 days of API usage logs. Count input and output tokens separately. Your input/output ratio is critical: if you have 5x more input tokens than output tokens (common in RAG pipelines), your effective blended cost per output token is $2.75/M, not $1.50/M. This is the number to compare against self-hosted CPM.
Match model to your quality bar. For summarization, classification, RAG: Phi-4 14B or Qwen3 8B cover most cases. For longer generation and instruction tasks: Gemma 4 QAT 31B or Mistral Small 4. Run 100-500 representative samples through your eval before committing. A model that passes at 8B saves significant GPU cost versus one that needs 31B.
Phi-4 14B needs ~15GB VRAM at FP8 on H100, or ~28GB at BF16 on A100. Qwen3 8B needs ~8GB at FP8 on H100, or ~16GB at BF16 on A100. Gemma 4 QAT 31B in w4a16 format needs ~18GB (A100 or H100). Note: FP8 is only hardware-accelerated on Hopper-class GPUs (H100, H200) and later; use BF16 on A100 (all models in this guide fit in A100 80GB at BF16). Add 20-30% headroom for KV cache. Go to app.spheron.ai and select the cheapest GPU tier where your model plus cache fits.
SSH into your Spheron instance. Run: vllm serve <model-id> --dtype fp8 --max-model-len 32768 --gpu-memory-utilization 0.90 --host 0.0.0.0 --port 8000 (use --dtype bfloat16 instead on A100, which lacks FP8 tensor cores). Benchmark at expected concurrency: download the benchmark script with wget https://raw.githubusercontent.com/vllm-project/vllm/main/benchmarks/benchmark_serving.py, then run python benchmark_serving.py --backend vllm --model <model-id> --request-rate 20 --num-prompts 500. Divide GPU $/hr by (tokens/sec x 0.0036) to get cost per million tokens.
Compare your self-hosted CPM against the blended Flash-Lite cost (not just output). For a RAG workload with 5:1 input/output ratio: blended API cost = (5 x $0.25 + 1 x $1.50) / 1M = $2.75/M. If your self-hosted CPM is below that at your projected utilization, you save money. Also factor data residency requirements: if GDPR or HIPAA apply to your prompts, self-hosting is mandatory regardless of the math.
Frequently Asked Questions
Gemini 3.1 Flash-Lite is Google's lowest-cost, high-throughput endpoint in the Gemini 3.1 family. For low token volumes it is hard to beat since you only pay per token with no fixed GPU cost. Above roughly 200-400M output tokens per month, depending on your input/output ratio, a self-hosted open SLM (Gemma 4 QAT, Phi-4 14B, Qwen3 8B) on GPU cloud becomes cheaper per token, while also giving you data residency control that closed APIs cannot provide.
Google prices Gemini 3.1 Flash-Lite at approximately $0.25/M input tokens and $1.50/M output tokens as of mid-2026 (verify current rates on Google AI Studio or the Vertex AI pricing page before budgeting). For input-heavy workloads with a 5:1 or 10:1 input/output ratio, the effective blended cost per output token climbs to $2.75-$4.00/M, which is where self-hosting wins decisively.
The closest self-hostable quality matches are Gemma 4 QAT 31B (4-bit precision, ~18GB VRAM, from the same Google family), Phi-4 14B (Microsoft, single H100 at FP8 or A100 80GB at BF16), Qwen3 8B (Alibaba, fits on an A100 80GB at BF16), and Mistral Small 4 (MoE, ~6B active params but 119B total expert weights, needs ~60GB at w4a16). Run a task-specific eval on representative samples before committing to a model size.
No. Gemini 3.1 Flash-Lite is a closed, API-only model. The weights are not publicly available. Self-hosting requires open-weight models. The models in this guide - Gemma 4 QAT, Phi-4, Qwen3, Mistral Small 4 - are all open-weight and can be deployed on your own GPU cloud instances.
The blended break-even depends on your input/output ratio and GPU utilization. For output-only workloads, Flash-Lite's $1.50/M rate is still more expensive than self-hosted small models at moderate utilization. For RAG and summarization workloads with 5:1 or higher input/output ratios, the effective blended API cost reaches $2.75-$4.00/M, and self-hosting on spot GPU instances with small models (Qwen3 8B) brings CPM to $0.06-0.09/M. Beyond cost, data residency requirements often make self-hosting the only viable option regardless of the math.
