Tutorial

Deploy MiMo-V2.5-Pro on GPU Cloud: Xiaomi's 1T MoE Coding Model

mimo v2.5 prodeploy mimo v2.5xiaomi mimo coding model gpuMiMo-V2.5-ProMoE DeploymentvLLMFP8 InferenceExpert ParallelismH200B200
Deploy MiMo-V2.5-Pro on GPU Cloud: Xiaomi's 1T MoE Coding Model

MiMo-V2.5-Pro scores approximately 57.2% on SWE-bench Pro, frontier-class performance from a fully open-weight MIT-licensed model with 1.02T total parameters and 42B active per forward pass. If you've worked through the MiMo-V2-Flash deployment guide for the 309B predecessor, the deployment approach here follows the same expert-parallel vLLM pattern, but the VRAM math is in a different league: 1.02T parameters at FP8 means roughly 1,020 GB of weight storage before any KV cache. This guide covers the VRAM math, FP8 expert-parallel deployment on H200 and B200 nodes, and the cost comparison versus Xiaomi's hosted API.

What MiMo-V2.5-Pro Is

MiMo-V2.5-Pro is the 1T-scale successor to MiMo-V2-Flash. Where V2-Flash had 309B total parameters and 15B active, V2.5-Pro scales to 1.02T total with 42B active per token. The model ships under MIT, trains on 27T tokens, and features a hybrid attention architecture that makes the 1M context window usable without proportional VRAM growth.

PropertyValue
Total parameters1.02T
Active parameters per forward pass42B
ArchitectureMoE with 6:1 sliding-window-to-global attention
FP8 formatE4M3 (weights and activations)
Context window1M tokens
Training data27T tokens
SWE-bench Pro~57.2%
LicenseMIT (verify on model card)
HuggingFace IDXiaomiMiMo/MiMo-V2.5-Pro (verify before downloading)

The 6:1 sliding-window-to-global attention ratio is what makes 1M context practical. For every 7 attention layers, 6 use sliding-window attention: each token attends only to a bounded local window, producing a fixed per-layer KV cache regardless of sequence length. One layer in seven uses full global attention, accumulating KV across the full sequence. The result is that KV cache grows sub-linearly with context length, rather than linearly as in a standard transformer. At 32K context per session on 8x H200, you have headroom for several concurrent coding-agent sessions. At 128K or longer, that headroom shrinks and you need a larger config.

The FP8 E4M3 format covers both weights and activations. For FP8 quantization performance details, the tradeoffs are well-documented: 1-2% accuracy loss on most benchmarks versus BF16, significant throughput improvement, and full support in vLLM on H100, H200, and B200 via Hopper and Blackwell Tensor Cores.

Benchmark Performance: SWE-bench Pro and Coding Comparisons

MiMo-V2.5-Pro enters a competitive coding-model landscape. Its ~57.2% SWE-bench Pro score puts it close to GLM-5.1 at 58.4%, with a lead over both Devstral at 46.8% (SWE-bench Verified) and Cohere North Mini Code at 40.2% on SWE-bench Pro. North Mini Code's widely-cited 67.6% figure is its SWE-bench Verified score; on the harder SWE-bench Pro benchmark, MiMo-V2.5-Pro leads it by 17 points.

ModelSWE-bench ProActive paramsContextGPU minimum
GLM-5.158.4%40B200K8x H200
MiMo-V2.5-Pro~57.2%42B1M8x H200
Devstral46.8% (Verified)24B (dense)128KSingle L40S
Cohere North Mini Code40.2%3B256KSingle H100

For detailed deployment guides for the adjacent models, see the Devstral deployment guide and Cohere North Mini Code deployment guide.

Where MiMo-V2.5-Pro's case for self-hosting is strongest: the 1M context window is 4-5x larger than GLM-5.1's effective context, which matters for long-horizon coding agents that need to read large repositories in a single pass. The MIT license means no restrictions on commercial use. And for teams that cannot send source code to external APIs, self-hosting is the only option regardless of per-token cost.

GPU Hardware Requirements and VRAM Math

The VRAM calculation starts with total parameter count, not active parameter count. Every expert's weights must reside in VRAM so the router can dispatch any token to any expert.

FP8 weight footprint: 1.02T parameters × 1 byte per param = ~1,020 GB

With framework and activation memory: ~30-50 GB above raw weights at FP8 inference scale

KV cache contribution with 6:1 hybrid attention: At 32K context per session, the effective KV cache grows as if the model has 1/7 the attention depth of a pure global-attention model. The KV per token per layer is small; the scaling constraint is the 1/7 of layers that are global. For practical capacity planning, expect roughly 10-20 GB of KV headroom per concurrent 32K session on 8x H200 after FP8 weights are loaded.

ConfigurationGPUsVRAMOn-Demand $/hrSpot $/hrFit for
8x H200 SXM5 141GB (FP8)81,128 GB~$29.58~$14.08Up to 8K-32K context per request
8x B200 SXM6 192GB (FP8)81,536 GBN/A (spot only)~$42.70Production, up to 256K context
Multi-node 16x H200 (FP8)162,256 GB~$59.15~$28.161M context with headroom

H200 SXM5 on Spheron with 1,128 GB total VRAM is the minimum single-node config. At --gpu-memory-utilization 0.95, 8x H200 provides about 1,071 GB usable. After loading the FP8 weights (~1,020 GB), roughly 50 GB remains for KV cache and framework overhead. That is enough for short contexts (8K-32K) at low concurrency, but constrains you for longer sessions.

B200 SXM6 nodes at 192 GB per card give 1,536 GB across 8 GPUs, available as spot instances on Spheron. After FP8 weights plus overhead, roughly 300-366 GB remains for KV cache. At the 6:1 attention ratio, this supports 128K-256K context per session with meaningful concurrency.

H100 SXM5 is not viable. 8x H100 80GB provides only 640 GB total VRAM. The FP8 weights alone require ~1,020 GB. No quantization scheme brings MiMo-V2.5-Pro to single-node H100 territory.

Memory bandwidth matters for MoE. Expert routing is bandwidth-bound: the router selects experts, and the GPU reads those expert weights from HBM to compute. H200 SXM5 at 4.8 TB/s HBM3e and B200 SXM6 at 8.0+ TB/s handle this efficiently. For the full MoE inference bandwidth analysis, including parallelism tradeoffs across different GPU configs, see that guide.

Pricing fluctuates based on GPU availability. The prices above are based on 03 Jul 2026 and may have changed. Check current GPU pricing → for live rates.

Step-by-Step Deployment with vLLM

Step 1: Provision your GPU node

Go to app.spheron.ai and provision an 8x H200 SXM5 or 8x B200 SXM6 instance. Both use the SXM form factor with NVLink interconnect, which is required for efficient MoE all-to-all expert routing. PCIe-attached H200 variants have lower inter-GPU bandwidth and are not recommended for this model.

Storage requirements: attach at least 1.2 TB NVMe for model weights (1,020 GB at FP8 plus checksum and temp space). Deploy with the PyTorch 2.6 / CUDA 12.4 base image. For Spheron-specific guidance on instance types (Spot vs. Dedicated vs. Cluster), multi-GPU setup, and supported inference frameworks including vLLM, see the Spheron LLM inference docs.

bash
# Verify all 8 GPUs visible and NVLink connected
nvidia-smi
nvidia-smi topo -m

Step 2: Install vLLM and dependencies

bash
pip install "vllm>=0.9.0" transformers accelerate huggingface_hub hf_transfer

MiMo-V2.5-Pro's architecture may require a specific vLLM version that adds support for the model's hybrid attention and MoE routing. Check the vLLM changelog to confirm the version that added MiMo-V2.5-Pro support before installing. Do not pin an exact version like vllm==0.9.0; install the minimum that supports the model and let it float.

For multi-node deployments (16+ GPUs), install Ray as well: pip install ray.

Step 3: Download model weights

bash
export HF_TOKEN=your_token_here
export HF_HUB_ENABLE_HF_TRANSFER=1

huggingface-cli download XiaomiMiMo/MiMo-V2.5-Pro \
  --local-dir ./mimo-v2-5-pro \
  --local-dir-use-symlinks False

The repository slug XiaomiMiMo/MiMo-V2.5-Pro is based on Xiaomi's published naming at the time of writing. Xiaomi has changed model slugs between V2-Flash and subsequent releases, so verify the exact repository name on Hugging Face before running this command.

The FP8 checkpoint is approximately 1.02 TB. Use --resume-download if the connection drops mid-transfer. Budget 15-30 minutes on a fast NVMe and reliable network connection.

Step 4: Launch vLLM server

8x H200, pure tensor parallelism (lowest TTFT for interactive serving):

bash
vllm serve ./mimo-v2-5-pro \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.95 \
  --enable-chunked-prefill \
  --served-model-name mimo-v2-5-pro \
  --host 0.0.0.0 \
  --port 8000

Start with --max-model-len 32768. The model can accept requests up to 1M tokens, but jumping straight to long contexts without profiling KV cache headroom will cause OOM errors. Verify stability at 32K first, then profile with nvidia-smi before increasing.

8x B200, mixed TP+EP for batch throughput:

bash
vllm serve ./mimo-v2-5-pro \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  --enable-expert-parallel \
  --max-model-len 131072 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --served-model-name mimo-v2-5-pro \
  --host 0.0.0.0 \
  --port 8000

The mixed TP+EP config uses TP=4 and DP=2, giving EP_SIZE = TP_SIZE × DP_SIZE = 4 × 2 = 8. Expert layers are sharded across all 8 GPUs, reducing the all-to-all communication overhead for batch workloads. TTFT is slightly higher than pure TP. For interactive coding sessions, use pure TP=8. For overnight batch evaluation runs, mixed TP+EP improves throughput.

For the vLLM expert parallelism background and when to use each strategy, see DeepEP and MoE inference kernels for a detailed treatment of the expert routing overhead.

Native FP8 checkpoint: MiMo-V2.5-Pro ships as a native block-wise FP8 (e4m3) checkpoint. vLLM auto-detects pre-quantized FP8 via compressed-tensors format; do NOT pass --quantization fp8 for the native checkpoint, as it can conflict with the checkpoint's quantization metadata. Only pass --quantization fp8 if you have a BF16 variant and want vLLM to perform online quantization at load time.

Step 5: Test the endpoint

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="mimo-v2-5-pro",
    messages=[
        {
            "role": "user",
            "content": "Implement a binary search tree in Python with insert, search, and delete operations."
        }
    ],
    max_tokens=2048,
)
print(response.choices[0].message.content)

A working deployment returns a complete, runnable implementation with correct BST invariants. If you get truncated output, increase --max-model-len or max_tokens.

Step 6: Monitor

bash
# GPU memory allocation after model load
nvidia-smi

# vLLM Prometheus metrics (throughput, queue depth, KV cache utilization)
curl http://localhost:8000/metrics | grep vllm_

# Health check before routing traffic
curl http://localhost:8000/health

SGLang Alternative

SGLang is worth evaluating when multiple coding-agent sessions share a common system prompt or a large repository context block. SGLang's RadixAttention caches the KV state of shared prompt prefixes and reuses it across requests, reducing prefill compute substantially when cache hit rates are above 40%.

bash
python -m sglang.launch_server \
  --model-path ./mimo-v2-5-pro \
  --tp-size 8 \
  --dtype fp8 \
  --context-length 32768 \
  --mem-fraction-static 0.90 \
  --host 0.0.0.0 \
  --port 8000

SGLang's MoE support is maturing quickly. For complex multi-tool agentic workloads, verify SGLang's function calling output format matches your harness's expectations before committing to production. For a full SGLang production configuration including load balancing, health checks, and multi-node setup, see the SGLang production deployment guide.

Use vLLM as the default and switch to SGLang if you measure prefix-cache hit rates consistently above 40% in your workload.

Using MiMo-V2.5-Pro as a Coding-Agent Backend

Tool calling and function calling

MiMo-V2.5-Pro follows the OpenAI function calling schema. Any framework that speaks the OpenAI tool_calls API format works without modification. For OpenHands deployment on GPU cloud, point the base_url at your vLLM server and set the model to mimo-v2-5-pro. The same applies to SWE-Agent, Cline, Aider, and any other OpenAI-compatible coding agent.

Verify tool-call output format before routing production sessions:

python
response = client.chat.completions.create(
    model="mimo-v2-5-pro",
    messages=[{"role": "user", "content": "Read the file at /path/to/file.py"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read a file from disk",
            "parameters": {
                "type": "object",
                "properties": {"path": {"type": "string"}},
                "required": ["path"]
            }
        }
    }],
    tool_choice="auto",
    max_tokens=256,
)
# Verify response.choices[0].message.tool_calls is populated
print(response.choices[0].message.tool_calls)

For tool calling benchmark details across models, including latency optimization for agentic loops, see that guide.

Long-horizon tasks and 1M context

The 1M context window means multi-file repositories can fit in a single context, removing the chunking overhead that limits shorter-context models on large codebase tasks. At 32K context per request on 8x H200 FP8, you have headroom for roughly 2-4 concurrent agent sessions after accounting for FP8 weight overhead. At 128K context on 8x B200, you can run 2-3 concurrent sessions with meaningful KV headroom.

For context engineering patterns for production AI agents, including KV cache management strategies for long-horizon multi-step coding tasks, see that guide.

Throughput tuning

Set --max-num-seqs to control concurrency and protect latency targets. For interactive coding sessions (single developer, short latency requirement), --max-num-seqs 4 is a reasonable starting point on 8x H200. For batch evaluation jobs, increase this to 16-32. Always enable --enable-chunked-prefill for coding tasks, which allows long input prefills to share GPU time with decoding for other requests.

Set longer request timeouts for long-horizon coding tasks: 120-300 seconds depending on your agentic loop's expected step duration. Default framework timeouts of 30-60 seconds will cut off multi-step coding sequences before they complete.

Expose vLLM's Prometheus metrics at /metrics and monitor vllm:gpu_cache_usage_perc to track KV cache utilization. If it consistently hits 100%, reduce --max-num-seqs or increase --max-model-len limit to prevent new requests from queueing indefinitely.

Multi-Token Prediction (MTP) head. MiMo-V2.5-Pro ships with a 3-layer MTP head designed for speculative decoding. When enabled, the model drafts multiple output tokens per forward pass and verifies them in a single sweep, yielding roughly 2-3x output throughput without quality regression. The 300-500 tok/s estimates in the cost table above are conservative baselines without MTP; actual output rates with speculative decoding enabled will be meaningfully higher. Check the vLLM changelog and the MiMo-V2.5-Pro model card for the current flag to enable MTP speculative decoding, as support is actively being added to vLLM.

KV Cache Planning for 1M-Context Coding Sessions

The 6:1 sliding-window-to-global attention ratio changes KV cache scaling compared to a standard transformer:

Standard full-attention model: KV cache grows linearly with sequence length for every layer. At N layers and L sequence length, KV bytes = N × L × 2 × n_heads × head_dim × bytes_per_dtype.

MiMo-V2.5-Pro at 6:1 ratio: For every 7 layers, 6 use sliding-window attention with a fixed window size (bounded per-layer KV regardless of sequence length), and 1 uses full global attention (grows linearly with L). The effective KV scaling is closer to L/7 of a pure global-attention model at equivalent depth.

With FP8 KV cache (--kv-cache-dtype fp8_e5m2), the per-token KV footprint is half of BF16. This is the single highest-leverage change for extending usable context on memory-constrained configs.

Practical headroom estimates with --gpu-memory-utilization 0.90:

  • 8x H200 (1,128 GB total, ~1,015 GB usable at 0.90): After FP8 weights (~1,020 GB), roughly 0 GB of headroom at 0.90 utilization. You must set --gpu-memory-utilization 0.95 or higher to load the model at all. With 0.95 you get ~1,071 GB usable, leaving ~51 GB for KV cache. Keep --max-model-len 32768 or below.
  • 8x B200 (1,536 GB total, ~1,382 GB usable at 0.90): After FP8 weights (~1,020 GB), roughly 362 GB for KV cache. At FP8 KV cache and 128K context with 6:1 ratio, multiple concurrent sessions are feasible.
  • 16x H200 multi-node (2,256 GB total, ~2,030 GB usable at 0.90): After FP8 weights, roughly 1,010 GB for KV cache. Supports 1M-context sessions with headroom.

For NVMe KV cache offloading on configs where GPU KV headroom is insufficient for 1M context, that guide covers the disk offloading path with vLLM. For the full KV cache optimization guide covering eviction policies, FP8 KV tradeoffs, and prefix caching, see that reference.

Cost Analysis: Spheron vs Xiaomi MiMo API

Xiaomi's hosted MiMo API charges $1/M input tokens and $3/M output tokens. Self-hosting on Spheron pays for GPU time regardless of token count.

Pricing sourced from the Spheron API on 03 Jul 2026:

ConfigHourly costEst. throughputOutput tokens/24hr$/M output tokens
8x H200 on-demand~$29.58300-500 tok/s~26-43M~$16-27/M
8x H200 spot~$14.08300-500 tok/s~26-43M~$8-13/M
8x B200 spot~$42.70500-900 tok/s~43-78M~$13-24/M
Xiaomi MiMo APIInput: $1/M, Output: $3/MN/AN/A$3/M output

The output-token cost comparison favors the Xiaomi API at typical throughput levels. Self-hosting's cost advantage appears in two specific scenarios:

High input-to-output ratios. The Xiaomi API's $1/M input cost accumulates quickly with 1M-context sessions. If your coding agent reads 500K tokens of repository context per task and produces 10K output tokens, the API charge is $0.50 input + $0.03 output = $0.53 per session. On 8x H200 spot at $14.08/hr, each session taking 20-30 seconds at low concurrency costs roughly $0.08-0.12 in cluster time. The break-even depends heavily on your input/output ratio, but context-heavy workloads shift it significantly toward self-hosting.

Data sovereignty. For teams where source code cannot leave internal infrastructure, the API is not an option regardless of cost. Self-hosting on a Spheron instance in your preferred region addresses this without code escrow or custom enterprise agreements.

Pricing fluctuates based on GPU availability. The prices above are based on 03 Jul 2026 and may have changed. Check current GPU pricing → for live rates.

Production Checklist

Before routing production coding-agent traffic to a MiMo-V2.5-Pro instance:

  • Health check at /health before routing traffic. Add this to your load balancer health check config.
  • Set long timeouts for agentic coding sessions. 120-300 seconds depending on your expected task duration. Default timeouts cut off multi-step sequences.
  • Monitor vllm:gpu_cache_usage_perc via Prometheus. If KV cache consistently hits 100%, reduce --max-num-seqs or add more GPU capacity.
  • Pin the model commit hash in your deployment config. Xiaomi has updated weights post-release before; pin to a specific commit to avoid silent behavior changes.
  • Enable FP8 KV cache by default (--kv-cache-dtype fp8_e5m2) for all configs. The memory savings are substantial and accuracy impact is negligible for coding tasks.
  • Enable chunked prefill (--enable-chunked-prefill) to prevent long input prefills from blocking decoding for other concurrent requests.
  • Autoscale on GPU memory utilization, not CPU. MiMo-V2.5-Pro serving is GPU-bound, not CPU-bound.
  • Watch for expert routing overhead at high concurrency. At 20+ simultaneous sessions, the all-to-all GPU communication on each MoE forward pass adds latency. Profile under realistic concurrent load before setting your concurrency target.
  • Multi-node for 1M context. Reliable 1M-context production serving requires 16x H200 or 16x B200 across 2 hosts. For NCCL and InfiniBand setup for multi-node MoE inference, see the DeepEP and MoE inference kernels guide linked earlier in this post.

MiMo-V2.5-Pro's 1.02T MoE weights need H200 or B200 nodes. Spheron's multi-GPU SXM clusters are provisioned in under 2 minutes with NVLink interconnect for expert-parallel serving.

H200 SXM5 on Spheron → | B200 SXM6 on Spheron → | View all pricing →

STEPS / 06

Quick Setup Guide

  1. Calculate VRAM and select GPU configuration

    MiMo-V2.5-Pro at FP8 requires roughly 1,020 GB for weights. On 8x H200 SXM5 141 GB (1,128 GB total), set --gpu-memory-utilization 0.95 to load the model, leaving about 50 GB for KV cache at short context lengths. For 1M-context KV headroom, provision 8x B200 SXM6 192 GB (1,536 GB total), which leaves about 366 GB for KV cache after FP8 weights.

  2. Provision a multi-GPU node on Spheron

    Log in to app.spheron.ai, select an 8x H200 SXM5 or 8x B200 SXM6 instance. Choose spot pricing for batch workloads or on-demand for production serving. Deploy with the PyTorch 2.6 / CUDA 12.4 base image. Attach at least 1.2 TB of NVMe persistent storage for model weights.

  3. Install vLLM and download MiMo-V2.5-Pro weights

    Run: pip install 'vllm>=0.9.0' transformers accelerate huggingface_hub hf_transfer. Export HF_TOKEN. Enable fast transfers with HF_HUB_ENABLE_HF_TRANSFER=1. Run: huggingface-cli download XiaomiMiMo/MiMo-V2.5-Pro --local-dir ./mimo-v2-5-pro. Verify the exact Hugging Face repository ID before downloading; Xiaomi's model naming has changed across releases.

  4. Launch vLLM with expert parallelism

    For 8x H200, pure tensor parallelism (lowest TTFT): vllm serve ./mimo-v2-5-pro --tensor-parallel-size 8 --max-model-len 32768 --kv-cache-dtype fp8_e5m2 --gpu-memory-utilization 0.95 --served-model-name mimo-v2-5-pro --host 0.0.0.0 --port 8000. For batch throughput on 8x B200, mixed TP+EP: add --data-parallel-size 2 --enable-expert-parallel and increase --max-model-len up to 131072 once KV headroom is confirmed.

  5. Test the OpenAI-compatible endpoint

    Run: curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "mimo-v2-5-pro", "messages": [{"role": "user", "content": "Write a Python function to find the longest palindromic substring."}], "max_tokens": 1024}'. A working deployment returns a complete, runnable implementation.

  6. Wire into a coding-agent harness

    Point OpenHands, SWE-Agent, or any OpenAI-compatible coding agent at http://your-instance-ip:8000/v1 with the model name mimo-v2-5-pro. MiMo-V2.5-Pro supports the OpenAI function calling schema; verify tool-call output format by sending a test request with a tools array before routing production agent sessions.

FAQ / 05

Frequently Asked Questions

MiMo-V2.5-Pro has 1.02T total parameters and 42B active per token. At FP8 (1 byte/param), weights require approximately 1,020 GB. On 8x H200 SXM5 141 GB (1,128 GB total), you need --gpu-memory-utilization 0.95 to load the model; at that setting, roughly 50 GB remains for KV cache, limiting you to 32K context or below at low concurrency. For substantial KV headroom, 8x B200 SXM6 192 GB (1,536 GB) is the recommended production setup, leaving about 366 GB for KV cache after FP8 weights. 8x H100 SXM5 80 GB (640 GB) cannot fit the FP8 weights alone and is not viable.

No. At FP8, MiMo-V2.5-Pro weights require roughly 1,020 GB. 8x H100 SXM5 80GB provides only 640 GB, which is insufficient to load the weights at any usable precision. You need 8x H200 141GB (1,128 GB) as the minimum single-node config, with 8x B200 192GB (1,536 GB) recommended for production serving with 1M-context KV headroom.

MiMo-V2.5-Pro uses a hybrid attention design that alternates between sliding-window attention and full global attention at a 6:1 ratio. For every 7 attention layers, 6 use sliding-window attention (each token attends to a local window of recent tokens, bounded per-layer KV cache), and 1 uses full global attention (each token attends to all prior tokens). This makes the effective KV cache growth sub-linear up to the 1M context limit, because only 1/7 of attention layers accumulate KV cache across the full sequence length.

Use --enable-expert-parallel alongside --tensor-parallel-size. On 8x H200 with TP=4 and DP=2: the expert parallelism size is computed automatically as EP_SIZE = TP_SIZE x DP_SIZE = 4 x 2 = 8, distributing the MoE expert layers across all 8 GPUs. For latency-sensitive serving, pure TP (--tensor-parallel-size 8) minimizes TTFT. For throughput-heavy batch coding-agent jobs, mixed TP+EP (--tensor-parallel-size 4 --data-parallel-size 2 --enable-expert-parallel) improves tokens/sec at the cost of slightly higher TTFT.

Xiaomi's hosted MiMo API is priced at $1/M input tokens and $3/M output tokens. At 8x H200 spot pricing on Spheron and a practical throughput of 300-600 tokens/sec total for the full-size 1T model, the cost comparison depends heavily on your input/output token ratio. For workloads with 1M-token context windows and high input volumes, the API's $1/M input cost accumulates fast, making self-hosting competitive or cheaper. For purely output-token-heavy workloads with modest context, the API can be cheaper at low utilization. Data sovereignty, no rate limits, and no per-token billing for input make self-hosting attractive regardless of pure cost math.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.