Tutorial

Deploy MiMo-V2-Flash on GPU Cloud: Xiaomi's 309B MoE Model Setup Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 8, 2026
MiMo-V2-FlashGPU CloudMoE ModelsvLLMReasoning ModelsLLM Deployment
Deploy MiMo-V2-Flash on GPU Cloud: Xiaomi's 309B MoE Model Setup Guide (2026)

Xiaomi released MiMo-V2-Flash in early 2026 as part of a wave of Chinese open-source reasoning models following DeepSeek. It is a 309B parameter MoE model with 15B active parameters per token and a 256K context window, built specifically for mathematical reasoning and coding. If you are familiar with MoE inference patterns or have already worked through the DeepSeek R2 deployment guide, the deployment approach here is similar but with some key differences around the hybrid thinking mode and memory planning at the 256K context scale.

What Is MiMo-V2-Flash

MiMo-V2-Flash is Xiaomi's reasoning-focused MoE model. The architecture looks like this: 309B total parameters, with a sparse routing mechanism that activates only 15B of them per forward pass. The "Flash" designation signals a design optimized for inference speed relative to model size.

Three things make it distinctive for deployment:

256K context window. Most 300B+ MoE models ship with 128K or shorter context. 256K means longer document ingestion, multi-turn conversation, and complex multi-step reasoning chains without truncation. It also means significantly higher VRAM requirements for KV cache at long contexts.

Hybrid thinking mode. MiMo-V2-Flash supports two distinct inference modes. Direct response mode skips chain-of-thought and answers immediately, suitable for factual Q&A and classification. Chain-of-thought mode generates an extended internal reasoning trace before responding, useful for math problems, multi-step coding, and logical deduction. You switch between them via system prompt or a model parameter, giving you control over the cost and latency tradeoff per request.

Reasoning-first training. Xiaomi trained MiMo-V2-Flash with reinforcement learning focused on mathematical and coding problems, similar to DeepSeek's R-series approach. This differs from general instruction-tuning: the model is optimized specifically for correctness on structured problem types rather than broad conversational fluency.

Model comparison:

ModelTotal ParamsActive ParamsContextThinking Mode
MiMo-V2-Flash309B15B256KHybrid (direct + CoT)
DeepSeek R2~1.2T (est.)~78B (est.)128KReasoning
Qwen 3 235B235B~22B131KThinking

Note on model availability: The Hugging Face repository for MiMo-V2-Flash may be listed under XiaomiMiMo/MiMo-V2-Flash or a text-only variant. Verify the exact repository slug on Hugging Face before downloading, as naming conventions for new model releases sometimes change in the first weeks after launch.

GPU Hardware Requirements

With MoE models, you pay the VRAM cost for all expert weights regardless of how many activate per token. The router loads everything into memory and then picks which experts to run. For 309B parameters:

PrecisionWeight VRAMMinimum GPU config (weights + activation headroom)
BF16/FP16~618 GB10x H100 80GB (800 GB total)
FP8~309 GB4x H100 80GB (320 GB) or 3x H200 141GB (423 GB)
INT4~155 GB2x H200 141GB (282 GB) or 2x B200 192GB (384 GB)

These figures include headroom for activations. For weight-only storage at FP16, 8x H100 80GB (640 GB) has a raw capacity of 640 GB, but with the recommended --gpu-memory-utilization 0.90 vLLM setting the effective cap is 576 GB (640 × 0.90). The FP16 weights (618 GB) exceed that cap, so vLLM OOMs during weight initialization before any KV cache is allocated. 8x H100 cannot run FP16 under the recommended settings. Plan on 10x H100 or 5x H200 141GB (705 GB) for a real FP16 deployment. You also need headroom for:

  • Activation memory: 10-15% overhead above weight VRAM is typical, pushing the FP16 minimum to around 680 GB before any KV cache allocation.
  • KV cache: This is where the 256K context window bites. At FP16, a 32K context sequence might consume 2-4 GB of KV cache per request. At 256K context, that grows 8x. At lower precisions or with FP8 KV cache quantization, you can recover half of that.

Practical guidance on context length: Start with --max-model-len 32768 and verify stability before increasing. Jumping directly to 256K without profiling KV cache headroom first will result in OOM errors. Each GPU config has a practical ceiling for simultaneous context based on VRAM available after weights.

Memory bandwidth matters more than compute for MoE. Expert routing is bandwidth-bound: the router picks experts, those expert weights are read from HBM, and the computation runs. With only 15B active parameters, the compute per token is modest, but the bandwidth required to load the right expert weights on every forward pass is not. H100 SXM5 (3.35 TB/s HBM bandwidth) and H200 SXM5 (4.8 TB/s) handle this well. PCIe-attached H100s run at lower effective bandwidth and will show throughput degradation at high concurrency.

Recommended configurations on Spheron with live pricing (fetched 08 Apr 2026):

ConfigurationGPUsVRAMOn-Demand ($/hr)Spot ($/hr)
8x H100 SXM5 80GB (FP8 only†)8640 GB~$23.23~$6.41
4x H100 SXM5 80GB (FP8)4320 GB~$11.62~$3.21
3x H200 SXM5 141GB3423 GB~$13.50~$3.57
2x B200 SXM6 192GB2384 GBN/A~$4.12

†8x H100 80GB cannot run FP16 weights under the recommended --gpu-memory-utilization 0.90 setting: the effective VRAM cap is 576 GB (640 × 0.90), which is less than the 618 GB required, causing OOM during model load. Use FP8 quantization with this config; for full FP16, provision 10x H100 or 5x H200.

Pricing fluctuates based on GPU availability. The prices above are based on 08 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Step-by-Step Deployment with vLLM on GPU Cloud

Step 1: Provision Your GPU Cluster on Spheron

Go to app.spheron.ai and provision a multi-GPU instance. For production deployment:

  • Primary recommendation for FP16: 10x H100 SXM5 (800 GB) or 5x H200 SXM5 (705 GB) to accommodate weights plus activation and KV cache headroom.
  • Cost-optimized: 4x H100 SXM5 with FP8 quantization. This halves the hourly cost with under 2% accuracy loss on math and coding benchmarks.
  • Instance config: Ubuntu 22.04, CUDA 12.1+, NVMe storage for weight files. You need at least 400GB of NVMe for BF16 weights plus operating system.

NVLink interconnect is important here. MoE expert routing requires all-to-all communication between GPUs on every forward pass. NVLink at 900 GB/s bidirectional is significantly faster than PCIe, which shows up directly in per-request latency.

Step 2: Install vLLM and Dependencies

bash
pip install "vllm>=0.8.0" transformers accelerate
nvidia-smi  # verify GPU visibility
python -c "import torch; print(torch.cuda.device_count())"

For expert parallelism in vLLM, use the --enable-expert-parallel flag rather than a size argument. When enabled, vLLM calculates EP size automatically as EP_SIZE = TP_SIZE x DP_SIZE. If your vLLM version does not support this flag, tensor parallelism alone covers the same use case with slightly higher cross-GPU communication overhead.

Step 3: Download Model Weights

bash
pip install huggingface_hub
huggingface-cli download XiaomiMiMo/MiMo-V2-Flash \
  --local-dir ./mimo-v2-flash \
  --local-dir-use-symlinks False

The BF16 weights are approximately 618GB. Use --resume-download if the connection drops mid-transfer. Verify the exact repository slug on Hugging Face before running; the naming convention may have changed since this guide was written.

Step 4: Launch vLLM Server

FP8 on 8x H100 (recommended for 8-GPU config):

bash
vllm serve ./mimo-v2-flash \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 32768 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.90 \
  --served-model-name mimo-v2-flash \
  --host 0.0.0.0 \
  --port 8000

FP8 on 4x H100 (cost-optimized):

bash
vllm serve ./mimo-v2-flash \
  --tensor-parallel-size 4 \
  --quantization fp8 \
  --max-model-len 16384 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.97 \
  --served-model-name mimo-v2-flash \
  --host 0.0.0.0 \
  --port 8000

Note the lower --max-model-len on the 4x H100 FP8 config. The 4x H100 setup is at its absolute memory limit: 320 GB total VRAM, but --gpu-memory-utilization 0.97 caps vLLM's allocation at 320 × 0.97 = ~310.4 GB. With ~309 GB consumed by FP8 weights, that leaves only ~1.4 GB total across all four GPUs for KV cache. Setting --gpu-memory-utilization below 0.97 will cause an out-of-memory error during model loading. With only ~1.4 GB of KV cache headroom, even a 16K context request is tight and may OOM during prefill on a large MoE model. Keep --max-model-len at 16K or lower and monitor GPU memory closely. Do not increase context length without first confirming available memory via GPU monitoring.

For vLLM with mixed TP + EP on 8 GPUs (TP_SIZE=4, DP_SIZE=2, so EP_SIZE = 4 x 2 = 8):

bash
vllm serve ./mimo-v2-flash \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  --enable-expert-parallel \
  --quantization fp8 \
  --max-model-len 32768 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.90 \
  --served-model-name mimo-v2-flash \
  --host 0.0.0.0 \
  --port 8000

Step 5: Test the Endpoint

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[
        {"role": "user", "content": "Solve: If f(x) = x^2 + 3x + 2, find all roots."}
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)

A working MiMo-V2-Flash instance should return a complete solution showing x = -1 and x = -2, with intermediate steps if chain-of-thought mode is active. If you get truncated output or no thinking tokens, check that --max-model-len is large enough for the reasoning trace.

Step 6: Verify and Monitor

bash
# Check GPU memory allocation after model load
nvidia-smi

# Monitor throughput via vLLM metrics
curl http://localhost:8000/metrics | grep vllm_

# Check model is serving
curl http://localhost:8000/health

MiMo-V2-Flash vs DeepSeek R2 vs Qwen 3: Benchmark Comparison

Public third-party benchmarks for MiMo-V2-Flash are limited given how recently it launched. The table below combines figures from Xiaomi's model card with projections based on architecture:

BenchmarkMiMo-V2-FlashDeepSeek R2 (est.)Qwen 3 235B
MATH71.0% (base, 4-shot)~75%*~85%
HumanEval+70.7% (base)~75%*~80%
MMLU86.7% (base, 5-shot)~85%*~84%
GPQA-Diamond55.1% (base) / 83.7% (post-train)~68%*~65%
Context window256K128K131K

*DeepSeek R2 figures are provisional estimates based on pre-release architecture information. Verify against official technical report. MiMo-V2-Flash scores are from the base model evaluation unless noted; post-training reasoning variant shows substantially higher GPQA-Diamond scores.

Performance per dollar is where MiMo-V2-Flash's 15B active parameters matter. The compute cost per token is roughly equivalent to a 15B dense model. DeepSeek R2 with ~78B active parameters costs roughly 5x more compute per token at the same inference rate. For workloads where you care about throughput per dollar and can accept slightly lower MATH-500 accuracy, MiMo-V2-Flash is worth benchmarking against R2 before committing to the larger cluster.

For a comparison of inference frameworks for serving these models, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Optimizing MoE Inference: Expert Parallelism and Memory Planning

Expert parallelism vs tensor parallelism. With MiMo-V2-Flash on 8 GPUs, you have options:

  • Pure tensor parallelism (--tensor-parallel-size 8): Every GPU participates in every forward pass. All-to-all communication happens at the attention layer. Low latency for interactive serving.
  • Mixed TP + EP (--tensor-parallel-size 4 --data-parallel-size 2 --enable-expert-parallel): Expert layers are sharded across EP groups (EP_SIZE = TP_SIZE x DP_SIZE = 4 x 2 = 8, using all 8 GPUs), reducing per-layer communication. Better throughput for batch jobs but slightly higher TTFT.
  • Pure expert parallelism (--tensor-parallel-size 1 --enable-expert-parallel): Each GPU owns specific experts. High communication for attention layers. Not recommended for latency-sensitive use.

For interactive serving, pure tensor parallelism is the safe default. For offline batch reasoning jobs where you are processing documents or running evaluations, mixed TP + EP can improve throughput.

KV cache planning for long context. The 256K context window is the most memory-intensive aspect of MiMo-V2-Flash. At FP16, a single 256K-length request can consume 40-60 GB of KV cache depending on model depth and attention head configuration. That exceeds a single H100 80GB's available KV headroom after weights. Practical limits by config:

  • 8x H100 FP8: usable KV headroom ~267 GB total (~33 GB per GPU) after FP8 weights (~309 GB), given --gpu-memory-utilization 0.90 caps vLLM's allocation at 576 GB (640 × 0.90). 64K-128K+ contexts per request are feasible. Note: running FP16 weights on 8x H100 is not viable. The FP16 weights (618 GB) exceed the 576 GB utilization cap (640 × 0.90), causing OOM during model load before any KV cache is allocated.
  • 4x H100 FP8: ~1.4 GB total KV headroom (320 × 0.97 - 309 GB FP8 weights); keep --max-model-len at 16K or lower. 32K context is not viable with this config.
  • 3x H200 FP8: more headroom; 64K context is feasible per request.

FP8 KV cache. Add --kv-cache-dtype fp8 to halve KV memory with negligible quality impact. This is worth enabling for any config where you want longer effective context or higher concurrent request count.

For deeper treatment of KV cache optimization, see the KV cache optimization guide. For the vLLM-specific techniques that make high-concurrency serving practical (continuous batching and paged attention), see the LLM serving optimization guide.

Enabling Hybrid Thinking Mode

MiMo-V2-Flash's hybrid thinking mode is the most user-visible architectural feature. The two modes:

Direct response mode: The model answers immediately without generating a reasoning trace. Faster, lower token count, lower cost. Use for factual Q&A, classification, summarization, and any task where the answer is short and the question is unambiguous.

Chain-of-thought mode: The model generates an internal reasoning trace before producing the final answer. Slower, higher token count, higher cost. Necessary for math problems, multi-step coding, logical deduction, and anything requiring structured intermediate reasoning.

System prompt approach (works with any serving framework):

For direct mode:

System: Answer directly and concisely without showing your reasoning process.

For chain-of-thought mode:

System: Think through this problem step by step before giving your final answer.

API parameter approach (if supported by your vLLM version):

python
# Direct mode
response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    extra_body={"thinking_mode": "direct"},
    max_tokens=100,
)

# Chain-of-thought mode
response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
    extra_body={"thinking_mode": "chain_of_thought"},
    max_tokens=2048,
)

Note on the API parameter name: The exact parameter name for thinking_mode may differ depending on how the model was trained and what vLLM version you are running. The system prompt approach is the most reliable fallback since it works regardless of framework version.

Token cost comparison. Chain-of-thought mode generates 5-10x more tokens than direct mode on the same question. A math problem that takes 50 tokens to state might require 2,000 thinking tokens and 200 response tokens in CoT mode, versus 80 tokens total in direct mode. If you are serving mixed workloads, route simple queries to direct mode and complex reasoning to CoT. This is the same pattern covered in the reasoning model inference cost guide.

Cost Analysis: Spheron vs Self-Hosted vs API

Live pricing table (fetched from Spheron GPU API, 08 Apr 2026):

ConfigurationGPUsVRAMOn-Demand ($/hr)Spot ($/hr)Monthly On-DemandMonthly Spot
8x H100 SXM5 80GB (FP8 only†)8640 GB~$23.23~$6.41~$16,726~$4,615
4x H100 SXM5 80GB (FP8)4320 GB~$11.62~$3.21~$8,366~$2,311
3x H200 SXM5 141GB3423 GB~$13.50~$3.57~$9,720~$2,570
2x B200 SXM6 192GB2384 GBN/A~$4.12N/A~$2,966

Pricing fluctuates based on GPU availability. The prices above are based on 08 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

vs self-hosted on-premise. An 8x H100 SXM5 server runs $250K-$350K in hardware at current GPU prices. Add rack space, power (a server drawing 10+ kW costs $8-12K/year in electricity at $0.10/kWh), cooling, networking, and maintenance. The total 3-year cost of ownership for an 8x H100 server is typically $400K-$500K. At $23.23/hr on Spheron on-demand, you reach $400K after roughly 17,200 hours of runtime, or about 2 years at 24/7 usage. The cloud breaks even faster if you have variable workloads, since you only pay while running.

vs API pricing. No public MiMo-V2-Flash API is available from Xiaomi or third parties at the time of writing. The closest reference points are DeepSeek API ($0.55-2.19/M tokens) and OpenAI o3 ($2-8/M tokens). At Spheron H100 spot pricing of $3.21/hr for 4x H100 FP8, and a practical throughput of 800-1,200 tokens/sec, you generate roughly 70-104M tokens per 24-hour day. The per-million-token cost ranges from ~$1.11/M at 800 tokens/sec down to ~$0.74/M at 1,200 tokens/sec, cheaper than most API options for high-volume workloads.

MoE cost efficiency argument. The key differentiator for MiMo-V2-Flash is that 15B active parameters means each token costs roughly 15B-equivalent compute, not 309B. Compared to a dense 70B model, MiMo-V2-Flash runs more expensive hardware (needs more GPUs for weight storage) but processes each token faster. The net result is that for high-quality reasoning tasks, the cost per correct answer is competitive with much smaller dense models. For more on reasoning model cost patterns, see reducing reasoning model inference costs.

Production Checklist: Scaling, Monitoring, and API Configuration

Before going live with MiMo-V2-Flash:

  • Health check endpoint: Verify GET /health returns 200 before routing traffic. Add this to your load balancer's health check config.
  • Request timeouts: CoT mode generates long token sequences. Set longer timeouts for reasoning requests (120-300 seconds) versus direct mode requests (10-30 seconds). A mixed-mode API needs per-request timeout logic.
  • Autoscaling policy: Scale on queue depth or GPU memory utilization, not CPU. Reasoning model requests are GPU-bound, not CPU-bound.
  • Prometheus metrics: vLLM exposes /metrics with token throughput, request queue depth, KV cache utilization, and per-request latency. Wire these into Grafana before you go live.
  • Token cost attribution: Log thinking tokens separately from output tokens. CoT mode thinking tokens cost the same compute as visible tokens but users do not see them. Attribution matters for billing and capacity planning.
  • Load balancing: MiMo-V2-Flash serving is stateless. Round-robin load balancing across multiple vLLM instances works correctly. No sticky sessions needed.
  • Rate limiting: Apply at the API gateway level. MoE models are more sensitive to concurrent request spikes than dense models because VRAM is always fully allocated for weights.
  • Model version pinning: Pin the exact model commit hash in your deployment config. Model repositories sometimes update weights without a version bump.
  • CoT mode fallback: If P99 latency for chain-of-thought requests exceeds your SLA, consider routing to direct mode as a fallback. A simple query classifier upstream can make this decision automatically.
  • Expert routing overhead at scale: At high concurrency (50+ simultaneous long reasoning requests or 10+ requests/second sustained), expert routing adds latency overhead compared to dense models. The all-to-all GPU communication on each MoE forward pass can become the bottleneck, not compute. Profile under realistic load before finalizing your scaling target.

For Spheron-specific deployment configuration and monitoring setup, refer to the documentation available through your Spheron dashboard.


MiMo-V2-Flash's MoE design means you get 309B parameter knowledge at 15B parameter compute cost. Spheron's multi-GPU H100 and H200 configurations are sized for exactly this kind of large MoE workload.

Rent H100 80GB → | Rent H200 → | View all pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.