How much VRAM does MiMo-V2-Flash require?

MiMo-V2-Flash has 309B total parameters but only 15B active per token. In FP16 you need roughly 618GB of VRAM for weights alone, plus 10-15% activation overhead (~680GB minimum), making 10x H100 (80GB) the minimum practical FP16 setup. With FP8 quantization that drops to approximately 309GB, fitting on 4x H100 or 3x H200.

Can I run MiMo-V2-Flash with vLLM?

Yes. vLLM 0.8+ supports MoE models with expert parallelism. Use --tensor-parallel-size for tensor parallelism and --enable-expert-parallel to activate expert parallelism. When --enable-expert-parallel is set, vLLM calculates EP size automatically as EP_SIZE = TP_SIZE x DP_SIZE. For 8x H100 FP8 use --tensor-parallel-size 8; for 4x H100 FP8 use --tensor-parallel-size 4.

What is MiMo-V2-Flash's hybrid thinking mode?

MiMo-V2-Flash supports two inference modes: direct response (fast, lower compute) and chain-of-thought reasoning (slower, more accurate for math/code/logic). You control this via system prompt or the thinking_mode parameter in the API.

How does MiMo-V2-Flash compare to DeepSeek R2?

Both are MoE reasoning models. MiMo-V2-Flash has 309B total/15B active parameters vs DeepSeek R2's architecture. MiMo-V2-Flash supports a 256K context window and hybrid thinking mode. Benchmark performance is competitive on MATH and coding tasks.

Is MiMo-V2-Flash cheaper to run than dense models?

Yes. Despite 309B total parameters, only 15B parameters are active per token. This makes compute per token similar to a mid-sized dense model while retaining large-model knowledge capacity. GPU utilization is lower per token compared to dense 70B+ models.

Deploy MiMo-V2-Flash on GPU Cloud: Xiaomi's 309B MoE Model Setup Guide (2026)

Xiaomi released MiMo-V2-Flash in early 2026 as part of a wave of Chinese open-source reasoning models following DeepSeek. It is a 309B parameter MoE model with 15B active parameters per token and a 256K context window, built specifically for mathematical reasoning and coding. If you are familiar with MoE inference patterns or have already worked through the DeepSeek R2 deployment guide, the deployment approach here is similar but with some key differences around the hybrid thinking mode and memory planning at the 256K context scale.

What Is MiMo-V2-Flash

MiMo-V2-Flash is Xiaomi's reasoning-focused MoE model. The architecture looks like this: 309B total parameters, with a sparse routing mechanism that activates only 15B of them per forward pass. The "Flash" designation signals a design optimized for inference speed relative to model size.

Three things make it distinctive for deployment:

256K context window. Most 300B+ MoE models ship with 128K or shorter context. 256K means longer document ingestion, multi-turn conversation, and complex multi-step reasoning chains without truncation. It also means significantly higher VRAM requirements for KV cache at long contexts.

Hybrid thinking mode. MiMo-V2-Flash supports two distinct inference modes. Direct response mode skips chain-of-thought and answers immediately, suitable for factual Q&A and classification. Chain-of-thought mode generates an extended internal reasoning trace before responding, useful for math problems, multi-step coding, and logical deduction. You switch between them via system prompt or a model parameter, giving you control over the cost and latency tradeoff per request.

Reasoning-first training. Xiaomi trained MiMo-V2-Flash with reinforcement learning focused on mathematical and coding problems, similar to DeepSeek's R-series approach. This differs from general instruction-tuning: the model is optimized specifically for correctness on structured problem types rather than broad conversational fluency.

Model comparison:

Model	Total Params	Active Params	Context	Thinking Mode
MiMo-V2-Flash	309B	15B	256K	Hybrid (direct + CoT)
DeepSeek R2	~1.2T (est.)	~78B (est.)	128K	Reasoning
Qwen 3 235B	235B	~22B	131K	Thinking

Note on model availability: The Hugging Face repository for MiMo-V2-Flash may be listed under XiaomiMiMo/MiMo-V2-Flash or a text-only variant. Verify the exact repository slug on Hugging Face before downloading, as naming conventions for new model releases sometimes change in the first weeks after launch.

GPU Hardware Requirements

With MoE models, you pay the VRAM cost for all expert weights regardless of how many activate per token. The router loads everything into memory and then picks which experts to run. For 309B parameters:

Precision	Weight VRAM	Minimum GPU config (weights + activation headroom)
BF16/FP16	~618 GB	10x H100 80GB (800 GB total)
FP8	~309 GB	4x H100 80GB (320 GB) or 3x H200 141GB (423 GB)
INT4	~155 GB	2x H200 141GB (282 GB) or 2x B200 192GB (384 GB)

These figures include headroom for activations. For weight-only storage at FP16, 8x H100 80GB (640 GB) has a raw capacity of 640 GB, but with the recommended --gpu-memory-utilization 0.90 vLLM setting the effective cap is 576 GB (640 × 0.90). The FP16 weights (618 GB) exceed that cap, so vLLM OOMs during weight initialization before any KV cache is allocated. 8x H100 cannot run FP16 under the recommended settings. Plan on 10x H100 or 5x H200 141GB (705 GB) for a real FP16 deployment. You also need headroom for:

Activation memory: 10-15% overhead above weight VRAM is typical, pushing the FP16 minimum to around 680 GB before any KV cache allocation.
KV cache: This is where the 256K context window bites. At FP16, a 32K context sequence might consume 2-4 GB of KV cache per request. At 256K context, that grows 8x. At lower precisions or with FP8 KV cache quantization, you can recover half of that.

Practical guidance on context length: Start with --max-model-len 32768 and verify stability before increasing. Jumping directly to 256K without profiling KV cache headroom first will result in OOM errors. Each GPU config has a practical ceiling for simultaneous context based on VRAM available after weights.

Memory bandwidth matters more than compute for MoE. Expert routing is bandwidth-bound: the router picks experts, those expert weights are read from HBM, and the computation runs. With only 15B active parameters, the compute per token is modest, but the bandwidth required to load the right expert weights on every forward pass is not. H100 SXM5 (3.35 TB/s HBM bandwidth) and H200 SXM5 (4.8 TB/s) handle this well. PCIe-attached H100s run at lower effective bandwidth and will show throughput degradation at high concurrency.

Recommended configurations on Spheron with live pricing (fetched 08 Apr 2026):

Configuration	GPUs	VRAM	On-Demand ($/hr)	Spot ($/hr)
8x H100 SXM5 80GB (FP8 only†)	8	640 GB	~$23.23	~$6.41
4x H100 SXM5 80GB (FP8)	4	320 GB	~$11.62	~$3.21
3x H200 SXM5 141GB	3	423 GB	~$13.50	~$3.57
2x B200 SXM6 192GB	2	384 GB	N/A	~$4.12

†8x H100 80GB cannot run FP16 weights under the recommended --gpu-memory-utilization 0.90 setting: the effective VRAM cap is 576 GB (640 × 0.90), which is less than the 618 GB required, causing OOM during model load. Use FP8 quantization with this config; for full FP16, provision 10x H100 or 5x H200.

Pricing fluctuates based on GPU availability. The prices above are based on 08 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Step-by-Step Deployment with vLLM on GPU Cloud

Step 1: Provision Your GPU Cluster on Spheron

Go to app.spheron.ai and provision a multi-GPU instance. For production deployment:

Primary recommendation for FP16: 10x H100 SXM5 (800 GB) or 5x H200 SXM5 (705 GB) to accommodate weights plus activation and KV cache headroom.
Cost-optimized: 4x H100 SXM5 with FP8 quantization. This halves the hourly cost with under 2% accuracy loss on math and coding benchmarks.
Instance config: Ubuntu 22.04, CUDA 12.1+, NVMe storage for weight files. You need at least 400GB of NVMe for BF16 weights plus operating system.

NVLink interconnect is important here. MoE expert routing requires all-to-all communication between GPUs on every forward pass. NVLink at 900 GB/s bidirectional is significantly faster than PCIe, which shows up directly in per-request latency.

Step 2: Install vLLM and Dependencies

bash

pip install "vllm>=0.8.0" transformers accelerate
nvidia-smi  # verify GPU visibility
python -c "import torch; print(torch.cuda.device_count())"

For expert parallelism in vLLM, use the --enable-expert-parallel flag rather than a size argument. When enabled, vLLM calculates EP size automatically as EP_SIZE = TP_SIZE x DP_SIZE. If your vLLM version does not support this flag, tensor parallelism alone covers the same use case with slightly higher cross-GPU communication overhead.

Step 3: Download Model Weights

bash

pip install huggingface_hub
huggingface-cli download XiaomiMiMo/MiMo-V2-Flash \
  --local-dir ./mimo-v2-flash \
  --local-dir-use-symlinks False

The BF16 weights are approximately 618GB. Use --resume-download if the connection drops mid-transfer. Verify the exact repository slug on Hugging Face before running; the naming convention may have changed since this guide was written.

Step 4: Launch vLLM Server

FP8 on 8x H100 (recommended for 8-GPU config):

bash

vllm serve ./mimo-v2-flash \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 32768 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.90 \
  --served-model-name mimo-v2-flash \
  --host 0.0.0.0 \
  --port 8000

FP8 on 4x H100 (cost-optimized):

bash

vllm serve ./mimo-v2-flash \
  --tensor-parallel-size 4 \
  --quantization fp8 \
  --max-model-len 16384 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.97 \
  --served-model-name mimo-v2-flash \
  --host 0.0.0.0 \
  --port 8000

Note the lower --max-model-len on the 4x H100 FP8 config. The 4x H100 setup is at its absolute memory limit: 320 GB total VRAM, but --gpu-memory-utilization 0.97 caps vLLM's allocation at 320 × 0.97 = ~310.4 GB. With ~309 GB consumed by FP8 weights, that leaves only ~1.4 GB total across all four GPUs for KV cache. Setting --gpu-memory-utilization below 0.97 will cause an out-of-memory error during model loading. With only ~1.4 GB of KV cache headroom, even a 16K context request is tight and may OOM during prefill on a large MoE model. Keep --max-model-len at 16K or lower and monitor GPU memory closely. Do not increase context length without first confirming available memory via GPU monitoring.

For vLLM with mixed TP + EP on 8 GPUs (TP_SIZE=4, DP_SIZE=2, so EP_SIZE = 4 x 2 = 8):

bash

vllm serve ./mimo-v2-flash \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  --enable-expert-parallel \
  --quantization fp8 \
  --max-model-len 32768 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.90 \
  --served-model-name mimo-v2-flash \
  --host 0.0.0.0 \
  --port 8000

Step 5: Test the Endpoint

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[
        {"role": "user", "content": "Solve: If f(x) = x^2 + 3x + 2, find all roots."}
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)

A working MiMo-V2-Flash instance should return a complete solution showing x = -1 and x = -2, with intermediate steps if chain-of-thought mode is active. If you get truncated output or no thinking tokens, check that --max-model-len is large enough for the reasoning trace.

Step 6: Verify and Monitor

bash

# Check GPU memory allocation after model load
nvidia-smi

# Monitor throughput via vLLM metrics
curl http://localhost:8000/metrics | grep vllm_

# Check model is serving
curl http://localhost:8000/health

MiMo-V2-Flash vs DeepSeek R2 vs Qwen 3: Benchmark Comparison

Public third-party benchmarks for MiMo-V2-Flash are limited given how recently it launched. The table below combines figures from Xiaomi's model card with projections based on architecture:

Benchmark	MiMo-V2-Flash	DeepSeek R2 (est.)	Qwen 3 235B
MATH	71.0% (base, 4-shot)	~75%*	~85%
HumanEval+	70.7% (base)	~75%*	~80%
MMLU	86.7% (base, 5-shot)	~85%*	~84%
GPQA-Diamond	55.1% (base) / 83.7% (post-train)	~68%*	~65%
Context window	256K	128K	131K

*DeepSeek R2 figures are provisional estimates based on pre-release architecture information. Verify against official technical report. MiMo-V2-Flash scores are from the base model evaluation unless noted; post-training reasoning variant shows substantially higher GPQA-Diamond scores.

Performance per dollar is where MiMo-V2-Flash's 15B active parameters matter. The compute cost per token is roughly equivalent to a 15B dense model. DeepSeek R2 with ~78B active parameters costs roughly 5x more compute per token at the same inference rate. For workloads where you care about throughput per dollar and can accept slightly lower MATH-500 accuracy, MiMo-V2-Flash is worth benchmarking against R2 before committing to the larger cluster.

For a comparison of inference frameworks for serving these models, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Optimizing MoE Inference: Expert Parallelism and Memory Planning

Expert parallelism vs tensor parallelism. With MiMo-V2-Flash on 8 GPUs, you have options:

Pure tensor parallelism (--tensor-parallel-size 8): Every GPU participates in every forward pass. All-to-all communication happens at the attention layer. Low latency for interactive serving.
Mixed TP + EP (--tensor-parallel-size 4 --data-parallel-size 2 --enable-expert-parallel): Expert layers are sharded across EP groups (EP_SIZE = TP_SIZE x DP_SIZE = 4 x 2 = 8, using all 8 GPUs), reducing per-layer communication. Better throughput for batch jobs but slightly higher TTFT.
Pure expert parallelism (--tensor-parallel-size 1 --enable-expert-parallel): Each GPU owns specific experts. High communication for attention layers. Not recommended for latency-sensitive use.

For interactive serving, pure tensor parallelism is the safe default. For offline batch reasoning jobs where you are processing documents or running evaluations, mixed TP + EP can improve throughput.

KV cache planning for long context. The 256K context window is the most memory-intensive aspect of MiMo-V2-Flash. At FP16, a single 256K-length request can consume 40-60 GB of KV cache depending on model depth and attention head configuration. That exceeds a single H100 80GB's available KV headroom after weights. Practical limits by config:

8x H100 FP8: usable KV headroom ~267 GB total (~33 GB per GPU) after FP8 weights (~309 GB), given --gpu-memory-utilization 0.90 caps vLLM's allocation at 576 GB (640 × 0.90). 64K-128K+ contexts per request are feasible. Note: running FP16 weights on 8x H100 is not viable. The FP16 weights (618 GB) exceed the 576 GB utilization cap (640 × 0.90), causing OOM during model load before any KV cache is allocated.
4x H100 FP8: ~1.4 GB total KV headroom (320 × 0.97 - 309 GB FP8 weights); keep --max-model-len at 16K or lower. 32K context is not viable with this config.
3x H200 FP8: more headroom; 64K context is feasible per request.

FP8 KV cache. Add --kv-cache-dtype fp8 to halve KV memory with negligible quality impact. This is worth enabling for any config where you want longer effective context or higher concurrent request count.

For deeper treatment of KV cache optimization, see the KV cache optimization guide. For the vLLM-specific techniques that make high-concurrency serving practical (continuous batching and paged attention), see the LLM serving optimization guide.

Enabling Hybrid Thinking Mode

MiMo-V2-Flash's hybrid thinking mode is the most user-visible architectural feature. The two modes:

Direct response mode: The model answers immediately without generating a reasoning trace. Faster, lower token count, lower cost. Use for factual Q&A, classification, summarization, and any task where the answer is short and the question is unambiguous.

Chain-of-thought mode: The model generates an internal reasoning trace before producing the final answer. Slower, higher token count, higher cost. Necessary for math problems, multi-step coding, logical deduction, and anything requiring structured intermediate reasoning.

System prompt approach (works with any serving framework):

For direct mode:

System: Answer directly and concisely without showing your reasoning process.

For chain-of-thought mode:

System: Think through this problem step by step before giving your final answer.

API parameter approach (if supported by your vLLM version):

python

# Direct mode
response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    extra_body={"thinking_mode": "direct"},
    max_tokens=100,
)

# Chain-of-thought mode
response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
    extra_body={"thinking_mode": "chain_of_thought"},
    max_tokens=2048,
)

Note on the API parameter name: The exact parameter name for thinking_mode may differ depending on how the model was trained and what vLLM version you are running. The system prompt approach is the most reliable fallback since it works regardless of framework version.

Token cost comparison. Chain-of-thought mode generates 5-10x more tokens than direct mode on the same question. A math problem that takes 50 tokens to state might require 2,000 thinking tokens and 200 response tokens in CoT mode, versus 80 tokens total in direct mode. If you are serving mixed workloads, route simple queries to direct mode and complex reasoning to CoT. This is the same pattern covered in the reasoning model inference cost guide.

Cost Analysis: Spheron vs Self-Hosted vs API

Live pricing table (fetched from Spheron GPU API, 08 Apr 2026):

Configuration	GPUs	VRAM	On-Demand ($/hr)	Spot ($/hr)	Monthly On-Demand	Monthly Spot
8x H100 SXM5 80GB (FP8 only†)	8	640 GB	~$23.23	~$6.41	~$16,726	~$4,615
4x H100 SXM5 80GB (FP8)	4	320 GB	~$11.62	~$3.21	~$8,366	~$2,311
3x H200 SXM5 141GB	3	423 GB	~$13.50	~$3.57	~$9,720	~$2,570
2x B200 SXM6 192GB	2	384 GB	N/A	~$4.12	N/A	~$2,966

Pricing fluctuates based on GPU availability. The prices above are based on 08 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

vs self-hosted on-premise. An 8x H100 SXM5 server runs $250K-$350K in hardware at current GPU prices. Add rack space, power (a server drawing 10+ kW costs $8-12K/year in electricity at $0.10/kWh), cooling, networking, and maintenance. The total 3-year cost of ownership for an 8x H100 server is typically $400K-$500K. At $23.23/hr on Spheron on-demand, you reach $400K after roughly 17,200 hours of runtime, or about 2 years at 24/7 usage. The cloud breaks even faster if you have variable workloads, since you only pay while running.

vs API pricing. No public MiMo-V2-Flash API is available from Xiaomi or third parties at the time of writing. The closest reference points are DeepSeek API ($0.55-2.19/M tokens) and OpenAI o3 ($2-8/M tokens). At Spheron H100 spot pricing of $3.21/hr for 4x H100 FP8, and a practical throughput of 800-1,200 tokens/sec, you generate roughly 70-104M tokens per 24-hour day. The per-million-token cost ranges from ~$1.11/M at 800 tokens/sec down to ~$0.74/M at 1,200 tokens/sec, cheaper than most API options for high-volume workloads.

MoE cost efficiency argument. The key differentiator for MiMo-V2-Flash is that 15B active parameters means each token costs roughly 15B-equivalent compute, not 309B. Compared to a dense 70B model, MiMo-V2-Flash runs more expensive hardware (needs more GPUs for weight storage) but processes each token faster. The net result is that for high-quality reasoning tasks, the cost per correct answer is competitive with much smaller dense models. For more on reasoning model cost patterns, see reducing reasoning model inference costs.

Production Checklist: Scaling, Monitoring, and API Configuration

Before going live with MiMo-V2-Flash:

Health check endpoint: Verify GET /health returns 200 before routing traffic. Add this to your load balancer's health check config.
Request timeouts: CoT mode generates long token sequences. Set longer timeouts for reasoning requests (120-300 seconds) versus direct mode requests (10-30 seconds). A mixed-mode API needs per-request timeout logic.
Autoscaling policy: Scale on queue depth or GPU memory utilization, not CPU. Reasoning model requests are GPU-bound, not CPU-bound.
Prometheus metrics: vLLM exposes /metrics with token throughput, request queue depth, KV cache utilization, and per-request latency. Wire these into Grafana before you go live.
Token cost attribution: Log thinking tokens separately from output tokens. CoT mode thinking tokens cost the same compute as visible tokens but users do not see them. Attribution matters for billing and capacity planning.
Load balancing: MiMo-V2-Flash serving is stateless. Round-robin load balancing across multiple vLLM instances works correctly. No sticky sessions needed.
Rate limiting: Apply at the API gateway level. MoE models are more sensitive to concurrent request spikes than dense models because VRAM is always fully allocated for weights.
Model version pinning: Pin the exact model commit hash in your deployment config. Model repositories sometimes update weights without a version bump.
CoT mode fallback: If P99 latency for chain-of-thought requests exceeds your SLA, consider routing to direct mode as a fallback. A simple query classifier upstream can make this decision automatically.
Expert routing overhead at scale: At high concurrency (50+ simultaneous long reasoning requests or 10+ requests/second sustained), expert routing adds latency overhead compared to dense models. The all-to-all GPU communication on each MoE forward pass can become the bottleneck, not compute. Profile under realistic load before finalizing your scaling target.

For Spheron-specific deployment configuration and monitoring setup, refer to the documentation available through your Spheron dashboard.

MiMo-V2-Flash's MoE design means you get 309B parameter knowledge at 15B parameter compute cost. Spheron's multi-GPU H100 and H200 configurations are sized for exactly this kind of large MoE workload.
Rent H100 80GB → | Rent H200 → | View all pricing →
Get started on Spheron →

What Is MiMo-V2-Flash

GPU Hardware Requirements

Step-by-Step Deployment with vLLM on GPU Cloud

Step 1: Provision Your GPU Cluster on Spheron

Step 2: Install vLLM and Dependencies

Step 3: Download Model Weights

Step 4: Launch vLLM Server

Step 5: Test the Endpoint

Step 6: Verify and Monitor

MiMo-V2-Flash vs DeepSeek R2 vs Qwen 3: Benchmark Comparison

Optimizing MoE Inference: Expert Parallelism and Memory Planning

Enabling Hybrid Thinking Mode

Cost Analysis: Spheron vs Self-Hosted vs API

Production Checklist: Scaling, Monitoring, and API Configuration

Build what's next.