Tutorial

Deploy MiniMax M3 on GPU Cloud: Self-Host the First Open-Weight Frontier Model with MSA, 1M Context, and Native Multimodality (2026 Guide)

Deploy MiniMax M3MiniMax M3 GPU RequirementsSelf-Host MiniMax M3MiniMax M3 vLLMMiniMax M3 MSA ArchitectureGPU CloudLLM DeploymentvLLMSGLang
Deploy MiniMax M3 on GPU Cloud: Self-Host the First Open-Weight Frontier Model with MSA, 1M Context, and Native Multimodality (2026 Guide)

MiniMax M3 launched June 1, 2026 as the first open-weight model to simultaneously hit a 59.0% SWE-Bench Pro score, support a 1M-token context window, and ship native image and video understanding in a single checkpoint. For teams comparing the upgrade from the previous generation, MiniMax M2.7's deployment guide covers the 229B MoE baseline, but M3's MSA architecture and 1M context capability are a different class of deployment challenge. This guide covers the hardware requirements, vLLM and SGLang setup, multimodal serving, agentic coding workloads, and the cost math for self-hosted M3 versus the hosted API options.

What Is MiniMax M3

MiniMax M3 is the company's open-weight frontier model released June 1, 2026. It uses a Mixture-of-Experts (MoE) architecture with 229.9B total parameters and 9.8B active per token across 256 fine-grained experts. The smaller active parameter count per forward pass keeps per-token compute comparable to a mid-sized dense model while the full 229.9B parameter set resides in VRAM.

Three capabilities define M3's position relative to earlier models:

SWE-Bench Pro at 59.0%. MiniMax published this as M3's official benchmark, and it exceeds both GPT-5.5 and Gemini 3.1 Pro on the same evaluation. SWE-Bench Pro measures multi-step software engineering task completion, which is harder to game than single-pass generation benchmarks. The 59.0% figure reflects the model's ability to understand a codebase, isolate a bug, generate a patch, and verify it passes tests across multiple turns.

BenchmarkMiniMax M3GPT-5.5Gemini 3.1 ProClaude Opus 4.8
SWE-Bench Pro59.0%Below M3Below M3~69.2%

Comparison scores for GPT-5.5 and Gemini 3.1 Pro are from MiniMax's official release announcement. M3 leads the open-weight field and outperforms several closed models, but some closed models including Claude Opus 4.8 (~69.2%) score higher on this benchmark. Verify all figures against each provider's current published numbers before using for procurement decisions.

1M token context window. The context length is 1,048,576 tokens. This is not a soft cap with degraded quality at the edges. M3's MSA architecture is specifically designed to maintain coherent retrieval and generation across the full context length at practical inference costs.

Native multimodality. Image and video understanding are built into the base checkpoint, not a separate model or vision adapter. A single endpoint handles text, image, and video inputs interchangeably.

MSA Architecture: How MiniMax M3 Handles 1M Context

Standard full attention scales quadratically: doubling context length quadruples the compute cost. At 1M tokens, full attention becomes impractical on any current GPU configuration. Most long-context models work around this with sliding window attention, retrieval-augmented approaches, or hard limits at 128K or 200K.

M3's MSA (MiniMax Sparse Attention) takes a different approach. By restricting each token's attention to a structured sparse subset of other tokens, MSA delivers more than 9x prefill speedup and more than 15x decode speedup at 1M context versus MiniMax's previous-generation M2 model. Per-token compute at 1M tokens drops to 1/20th of M2's baseline at equivalent context length.

The practical consequence: 1M context inference on M3 costs roughly what 50K context inference costs on a comparable dense model. A 2x H200 config that handles M3 at 32K context handles the same model at 1M context at an affordable operational cost, not at a hardware-multiplying cost.

This matters for real workloads. Full codebase context, entire legal documents, long conversation histories, multi-document research tasks - these now fit in a single request on mid-tier multi-GPU hardware rather than requiring either a smaller context window or a compute cluster that prices out most teams.

For a deeper look at how sparse attention mechanisms change the memory math at extreme context lengths, the DeepSeek sparse attention and long context LLM guide covers the architectural tradeoffs in detail.

GPU Memory Requirements

MiniMax M3 is a MoE model: the full parameter set must reside in VRAM even though only a fraction activates per forward pass. You cannot page unselected experts from CPU without introducing latency that breaks interactive inference.

VRAM requirements by precision (estimates based on published GPU configurations; verify exact figures from the official model card):

PrecisionVRAM RequiredMin GPU Config
BF16~460 GB4x H200 SXM5 or 6x H100 SXM5
FP8~230 GB2x H200 SXM5 or 4x H100 SXM5
AWQ INT4~115 GB1x H200 SXM5 or 2x H100 SXM5

KV cache memory adds on top of model weights and scales with context length. MSA reduces attention compute cost but does not eliminate KV cache memory growth entirely - the KV states for attended positions still need to be stored. At 32K context in FP16 on a 2x H200 config (282 GB total), expect roughly 6-8 GB of KV cache overhead. At 128K context, that grows to 20-30 GB. For 1M context inference, FP8 KV cache via --kv-cache-dtype fp8_e5m2 is effectively required to stay within VRAM limits.

Context LengthFP16 KV Cache (est.)FP8 KV Cache (est.)
32K~8 GB~4 GB
128K~30 GB~15 GB
512K~120 GB~60 GB
1M~240 GB~120 GB

At 1M context, the FP8 KV cache alone consumes roughly 120 GB. On a 2x H200 FP8 config with 282 GB total and ~230 GB of model weights, that leaves roughly 52 GB of headroom. This is still not enough to hold a full 1M context FP8 KV cache (120 GB). True 1M-context inference requires at minimum 4x H200 SXM5 (564 GB total) or 8x H100 SXM5 (640 GB total), as reflected in the context budget table later in this post. For the 4x H100 FP8 config (320 GB), the ~90 GB of headroom is also insufficient for full 1M context KV cache; keep --max-model-len at or below 524288 on 4x H100 configurations, and enable FP8 KV cache to allow concurrent requests.

For the underlying math on parameter counts and bytes per precision format, see the LLM GPU memory requirements explainer.

Spheron GPU Configurations and Pricing

Pricing fetched from the Spheron API on 12 Jun 2026:

ConfigurationPrecisionTotal VRAMSpot PriceOn-Demand Price
2x H200 SXM5FP8282 GB$3.64/hr (2 x $1.82)$9.68/hr (2 x $4.84)
4x H100 SXM5FP8320 GB$5.72/hr (4 x $1.43)$15.68/hr (4 x $3.92)
1x H200 SXM5AWQ INT4141 GB$1.82/hr$4.84/hr
2x H100 SXM5AWQ INT4160 GB$2.86/hr (2 x $1.43)$7.84/hr (2 x $3.92)

Pricing fluctuates based on GPU availability. The prices above are based on 12 Jun 2026 and may have changed. Check current GPU pricing → for live rates. Use spot for batch or bursty workloads, on-demand for 24/7 SLA serving. The cost-per-token estimates in the comparison table below use spot rates.

For FP8 full-precision M3 inference, H200 GPU pricing on Spheron is the right starting point. The 141 GB HBM3e per card provides headroom above the ~115 GB model weights per card (230 GB split across 2 GPUs) for KV cache. Two H200s at $3.64/hr spot handle context lengths up to roughly 256K-300K comfortably (the ~52 GB headroom after model weights fits the ~30 GB FP8 KV cache at those lengths); 512K context exceeds the headroom at 2x H200 FP8 and requires 4x H100 FP8 or larger; for true 1M context, you need at least 4x H200 to accommodate the 120 GB FP8 KV cache.

Teams working with AWQ INT4 and moderate context lengths will find on-demand H100 cloud pricing useful: two H100 SXM5 cards at $2.86/hr cover the INT4 footprint comfortably and bring costs down significantly for batch or experimental workloads.

Step-by-Step Deployment with vLLM

vLLM provides the OpenAI-compatible server, tensor and expert parallelism, and the FP8 KV cache support that M3's 1M context workloads require.

Step 1: Provision the GPU Node on Spheron

Select your instance in app.spheron.ai - 2x H200 SXM5 for FP8, 4x H100 SXM5 if you need the extra VRAM buffer, or 1x H200 for INT4. SSH in using the SSH connection guide.

Step 2: Install Dependencies

bash
# CUDA 12.4+ required
pip install vllm huggingface_hub
export HF_TOKEN=<your_token>

Check your CUDA version with nvidia-smi before installing. CUDA 12.4 is recommended for full H200 FP8 throughput. Before installing, check the vLLM release notes to confirm which version added MSA backend support for MiniMax M3, then pin that version or later (e.g. pip install "vllm>=<confirmed-version>"). The MSA attention mechanism requires explicit framework support that may have shipped in a point release after the initial launch.

Step 3: Download Model Weights

Before running this command, verify that MiniMaxAI/MiniMax-M3 is live and publicly accessible on HuggingFace. MiniMax indicated weights would be published within roughly 10 days of the June 1 launch. If the repository is not yet visible, check the MiniMaxAI HuggingFace organization page for the current status and the exact repository name before proceeding.

bash
# Download M3 base weights (confirm the repo exists before running)
huggingface-cli download MiniMaxAI/MiniMax-M3 \
  --local-dir ./minimax-m3 \
  --repo-type model

Also check whether a pre-quantized FP8 or AWQ INT4 variant is available in the MiniMaxAI org. The repository may be gated and require accepting MiniMax's license terms before downloading. You need approximately 230 GB of NVMe storage for the FP8 checkpoint (if available) or approximately 460 GB for BF16 weights.

If a pre-quantized FP8 checkpoint is not yet published, vLLM's on-the-fly quantization via --quantization fp8 handles the conversion from BF16 weights at load time. This adds a few minutes to startup but does not require a separate FP8 checkpoint file.

Step 4: Launch the vLLM Server

bash
python -m vllm.entrypoints.openai.api_server \
  --model ./minimax-m3 \
  --served-model-name minimax-m3 \
  --tensor-parallel-size 2 \
  --quantization fp8 \
  --enable-expert-parallel \
  --max-model-len 131072 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.92 \
  --enable-chunked-prefill \
  --max-num-seqs 16 \
  --port 8000

Flag explanations:

FlagPurpose
--tensor-parallel-size 2Splits model across 2 GPUs (set to 4 for 4x H100)
--enable-expert-parallelDistributes MoE experts across GPUs for parallel routing
--kv-cache-dtype fp8_e5m2Halves KV cache VRAM - essential for long context
--enable-chunked-prefillPrevents long 1M-token prompts from blocking shorter requests
--max-model-len 131072Context budget - increase toward 1048576 as your VRAM headroom allows
--max-num-seqs 16Concurrent request slots; reduce if running long-context sessions

For 4x H100 FP8, set --tensor-parallel-size 4. For INT4 configs, drop --quantization fp8 and load the AWQ variant instead. In some vLLM releases the expert parallelism flag is --moe-expert-parallel-size instead of --enable-expert-parallel. Check your version's flags with python -m vllm.entrypoints.openai.api_server --help | grep expert.

For 1M context inference, incrementally increase --max-model-len. Start at 131072, verify GPU memory headroom with nvidia-smi, then step up to 262144, 524288, and finally 1048576. Each step doubles the potential KV cache memory requirement.

Step 5: Validate

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax-m3",
    "messages": [
      {"role": "user", "content": "Write a Python function that merges two sorted linked lists, then write a pytest test for it."}
    ],
    "max_tokens": 512
  }'

Confirm http://localhost:8000/health returns 200 and the response includes a valid Python function and test. Monitor GPU memory with nvidia-smi dmon -s m and token throughput at http://localhost:8000/metrics.

Deploy MiniMax M3 with SGLang

SGLang is the better choice for M3's multi-turn agentic and long-document workloads. Its RadixAttention prefix cache reuses KV states for repeated tool schemas, system prompts, and shared conversation history, cutting time-to-first-token on subsequent turns by 30-60% when the shared prefix exceeds 4K tokens.

For SGLang to support M3's MSA attention mechanism, you need a release that explicitly includes M3 support. SGLang 0.5.12 supports M2.7's MoE expert parallelism; for M3's MSA, check the SGLang release notes for the version that added MSA backend support (expected in SGLang 0.6.x or later, released after M3's June 2026 launch):

bash
pip install 'sglang[all]>=0.6.0'

python -m sglang.launch_server \
  --model-path ./minimax-m3 \
  --tp 2 \
  --quantization fp8 \
  --enable-moe-ep \
  --context-length 131072 \
  --port 30000

The --enable-moe-ep flag activates MoE expert parallelism. Without it, SGLang falls back to tensor parallelism, which replicates all expert weights on every GPU and wastes memory on non-active experts.

For M3's agentic workloads where every request carries the same tool definitions and system prompt, SGLang's RadixAttention hits on every turn after the first. A 4K-token shared tool schema across 100 concurrent agent sessions means 100 fewer full prefill passes per session - at M3's throughput rates, that's a meaningful latency improvement. For production SGLang configuration details including monitoring, load balancing, and prefix caching tuning, see the SGLang production deployment guide.

Native Multimodal Serving: Image and Video Understanding

M3's multimodality is not a vision adapter patched onto a text model. The image and video understanding capability lives in the base checkpoint and is accessible through the standard /v1/chat/completions endpoint using the OpenAI vision message format.

Image Input

Pass image data as a content array in the user message. vLLM supports both URL and base64-encoded image inputs:

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax-m3",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "data:image/jpeg;base64,<base64-encoded-image>"
            }
          },
          {
            "type": "text",
            "text": "Describe what you see in this image in detail."
          }
        ]
      }
    ],
    "max_tokens": 512
  }'

For production image workloads, base64 encoding keeps everything in a single request body. For large images, be aware that the image token count contributes to your --max-model-len budget.

Video Input

Video understanding in M3 works by passing video frames as a sequence of image inputs within the same request. Pass frames sampled at your desired temporal resolution (typically 1-2 fps for general understanding, higher for action detection). The model processes the frame sequence as a temporally ordered visual context alongside any text instructions.

Whether M3 supports a dedicated video MIME type or requires frame decomposition depends on the serving framework's implementation. With vLLM, use the multi-image message format and pass frames as an ordered list of image_url items. Verify the supported video input path for your specific vLLM release from the release notes.

Multimodal KV Cache

Image tokens occupy KV cache space the same as text tokens, but at a higher per-token density. A single full-resolution image typically maps to 256-1024 visual tokens depending on how M3's vision encoder tiles the input. For long-context requests that combine a large image with a 100K text context, budget accordingly: a 1000-token image in a 100K text context adds 1% to the effective context length but may add significantly more to the KV cache if the attention pattern is different from pure text.

Monitor nvidia-smi dmon -s m when serving mixed text-image workloads. If you see KV cache memory pressure, reduce --max-model-len or lower --max-num-seqs before increasing image resolution.

Agentic Coding Workloads

M3's 59.0% SWE-Bench Pro score is the headline. For context: this benchmark measures multi-step software engineering task completion across real GitHub issues with real test suites. The score exceeds GPT-5.5 and Gemini 3.1 Pro; some other closed models like Claude Opus 4.8 (~69.2%) score higher, but M3 leads the open-weight field by a substantial margin. For the self-hosted use case, this means frontier-class coding capability on hardware you control, at costs that scale with usage rather than per-token charges.

The self-hosted advantage compounds with M3's 1M context window. A 200K-line codebase fits in a single request context, which means the model reads the full project before generating a patch rather than working from retrieval results. For multi-file refactors, interface changes, and bugs with distant root causes, full-context inference produces more accurate patches with fewer iterations than retrieval-augmented approaches.

Tool Definitions

The agentic tool schema is identical to M2.7's pattern:

json
[
  {
    "type": "function",
    "function": {
      "name": "code_exec",
      "description": "Execute Python code in a sandboxed environment and return stdout, stderr, and exit code.",
      "parameters": {
        "type": "object",
        "properties": {
          "code": {"type": "string", "description": "Python code to execute"},
          "timeout": {"type": "integer", "description": "Execution timeout in seconds", "default": 30}
        },
        "required": ["code"]
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "file_write",
      "description": "Write content to a file, creating parent directories as needed.",
      "parameters": {
        "type": "object",
        "properties": {
          "path": {"type": "string", "description": "Relative file path"},
          "content": {"type": "string", "description": "File content to write"}
        },
        "required": ["path", "content"]
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "test_runner",
      "description": "Run the test suite and return a summary of passes, failures, and error messages.",
      "parameters": {
        "type": "object",
        "properties": {
          "test_file": {"type": "string", "description": "Path to the test file to run"},
          "framework": {"type": "string", "enum": ["pytest", "unittest"], "default": "pytest"}
        },
        "required": ["test_file"]
      }
    }
  }
]

Multi-Turn Agent Loop

python
import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
MAX_ITERATIONS = 5

def dispatch_tool(name: str, args: dict) -> dict:
    raise NotImplementedError(
        f"dispatch_tool: '{name}' is not implemented. "
        "Wire up code_exec, file_write, and test_runner before running this loop."
    )

def is_stalled(messages: list) -> bool:
    raise NotImplementedError("Implement per /blog/deploy-minimax-m2-7-gpu-cloud/")

def run_agentic_loop(task: str, tools: list) -> str:
    messages = [{"role": "user", "content": task}]
    
    for iteration in range(MAX_ITERATIONS):
        response = client.chat.completions.create(
            model="minimax-m3",
            messages=messages,
            tools=tools,
            tool_choice="auto",
            max_tokens=4096,
        )
        
        choice = response.choices[0]
        messages.append(choice.message.model_dump(exclude_none=True))
        
        if not choice.message.tool_calls:
            return choice.message.content or ''
        
        if choice.finish_reason == "length":
            return "Error: response truncated before tool arguments were complete."
        
        for tool_call in choice.message.tool_calls:
            try:
                args = json.loads(tool_call.function.arguments)
            except json.JSONDecodeError:
                return "Error: malformed tool call arguments."
            result = dispatch_tool(tool_call.function.name, args)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result),
            })
        
        if iteration >= 2 and is_stalled(messages):
            return "Agent stalled: same test failures persisting. Review manually."
    
    response = client.chat.completions.create(
        model="minimax-m3",
        messages=messages,
        tools=tools,
        tool_choice="none",
        max_tokens=4096,
    )
    return response.choices[0].message.content or ''

For the stall detection and sandboxed code execution patterns, the M2.7 deployment guide covers the full implementation including Docker-based isolation and the is_stalled() function.

The 1M Context Advantage

For large codebase tasks, pass the full repository as the initial context before the task description. With --max-model-len 1048576, a 200K-token codebase occupies roughly 20% of the available context, leaving 800K tokens for the conversation history, generated patches, and test output across many iterations. This is qualitatively different from M2.7's 200K context or other models with shorter windows: you are not choosing which files to retrieve, you are giving the model everything at once.

Cost: Spheron vs MiniMax API vs OpenRouter

Self-hosting M3 makes economic sense above a throughput threshold. Below that threshold, the hosted API is cheaper because you pay only for tokens used, not for idle GPU time.

Approximate cost per 1M output tokens at typical sustained throughput:

ProviderModelPrice per 1M inputPrice per 1M outputNotes
MiniMax platformM3~$0.30 (promo), ~$0.60 (standard)~$1.20 (promo), ~$2.40 (standard)Hosted API, per-token
OpenRouterM3~$0.30 (promo), ~$0.60 (standard)~$1.20 (promo), ~$2.40 (standard)Aggregated, per-token
Spheron (2x H200 FP8, spot)M3 self-hosted~$1.26~$1.26$3.64/hr spot at ~800 tok/s sustained
Spheron (4x H100 FP8, spot)M3 self-hosted~$1.60~$1.60$5.72/hr spot at ~1,000 tok/s sustained
Spheron (1x H200 INT4, spot)M3 self-hosted~$0.73~$0.73$1.82/hr spot at ~700 tok/s, INT4 quality

Self-hosted cost per 1M tokens = (price_per_hour / throughput_toks_per_sec / 3600) * 1_000_000. At $3.64/hr with ~800 tok/s, that is $3.64 / (800 3600) 1,000,000 = $1.26/M tokens. All Spheron figures above use spot rates; on-demand rates increase costs roughly 2-4x.

Throughput estimates above are approximations based on similar model classes on the same hardware. Actual M3 throughput should be measured on your specific configuration and workload.

The breakeven picture depends on which API pricing tier you are comparing against. At the launch promo rate (~$1.20/M output tokens), self-hosting FP8 on 2x H200 spot (~$1.26/M) is at near-parity and only wins once promo pricing ends. At standard API rates (~$2.40/M output), the crossover point is roughly 420 tokens/second of sustained throughput on a 2x H200 spot config. AWQ INT4 on 1x H200 spot (~$0.73/M) already beats the promo API rate outright, with quality tradeoffs.

For workloads below 100 tok/s average throughput, the hosted API is cheaper at any pricing tier. For batch jobs, overnight processing, or shared team deployments with concurrent users, self-hosting wins on cost once average throughput clears the crossover.

Long-Context Tuning and Production Checklist

VRAM vs Context Budget Tradeoff

--max-model-len is the single most important flag for M3. Setting it to the model's theoretical maximum (1,048,576) reserves maximum KV cache allocation, which may leave insufficient VRAM for model weights on tighter configs. Set it to the 90th percentile context length of your actual workload:

--max-model-len SettingKV Cache VRAM (FP8, est.)Remaining for Concurrent Requests
32,768~4 GB~48 GB headroom (2x H200 FP8)
131,072~16 GB~36 GB headroom
524,288~64 GBNegative on 2x H200 FP8 - needs 4x H100 FP8 or larger
1,048,576~120 GBNegative on 2x H200 - needs 4x H200 or 8x H100

For true 1M context single-request inference, you need at minimum 4x H200 (564 GB total) or 8x H100 (640 GB total) to hold both model weights and the full 1M context KV cache with FP8 precision.

Throughput and Production Checklist

  • FP8 KV cache enabled (--kv-cache-dtype fp8_e5m2)
  • --max-model-len set to workload maximum, not theoretical limit
  • --enable-chunked-prefill active for mixed-length request queues
  • Expert parallelism enabled (--enable-expert-parallel or --moe-expert-parallel-size)
  • Prometheus metrics endpoint enabled and scraped (/metrics)
  • NVMe storage for weights (~230 GB for FP8 checkpoint, ~460 GB for BF16)
  • GPU memory headroom verified with nvidia-smi before serving production traffic
  • Spot vs on-demand decision logged: spot for batch, on-demand for SLA workloads

For detailed KV cache tuning strategies including FP8 quantization, --max-num-batched-tokens, and cache eviction policies, see the KV cache optimization guide.


MiniMax M3's MSA cuts 1M-context inference cost to a fraction of what full-attention alternatives require. H200 nodes on Spheron give you the VRAM to run it at FP8 precision without multi-node networking overhead.

H200 GPU pricing on Spheron → | On-demand H100 capacity → | View all GPU pricing →

Deploy M3 on Spheron →

STEPS / 06

Quick Setup Guide

  1. Calculate VRAM requirements and choose quantization

    FP8 precision requires 2x H200 SXM5 (282 GB) or 4x H100 SXM5 (320 GB). AWQ INT4 fits in 1x H200 SXM5 (141 GB) or 2x H100 SXM5 (160 GB). For 1M context inference, add FP8 KV cache via --kv-cache-dtype fp8_e5m2 to halve KV cache memory overhead.

  2. Provision a GPU node on Spheron

    Select your GPU configuration in app.spheron.ai. Choose spot for batch or bursty workloads, on-demand for 24/7 serving. SSH into the instance using the connection guide at https://docs.spheron.ai/connecting/ssh-connection.

  3. Install vLLM and download M3 weights

    Install vLLM via pip after confirming the release notes include MSA support for MiniMax M3. Verify CUDA 12.4+ with nvidia-smi. Confirm MiniMaxAI/MiniMax-M3 is published on HuggingFace before downloading, then run huggingface-cli with HF_TOKEN set.

  4. Launch the vLLM inference server

    Run python -m vllm.entrypoints.openai.api_server with --model pointing to the M3 weights directory, --tensor-parallel-size set to your GPU count, --quantization fp8, --kv-cache-dtype fp8_e5m2 for long-context efficiency, and --enable-chunked-prefill for mixed short and long request queues.

  5. Validate multimodal input

    Send a test request with an image payload to /v1/chat/completions using the vision message format with content as an array containing text and image_url items. Confirm the model returns a coherent description before serving production traffic.

  6. Monitor and tune for long-context workloads

    Watch GPU memory with nvidia-smi dmon. For 1M context sessions, enable FP8 KV cache and set --max-model-len to the actual workload maximum, not the theoretical limit, to preserve VRAM headroom for concurrent requests.

FAQ / 05

Frequently Asked Questions

MiniMax M3 has 229.9B total parameters. At FP8 precision (~230 GB), it fits in 2x H200 SXM5 (282 GB total) or 4x H100 SXM5 (320 GB total). AWQ INT4 (~115 GB) fits in 1x H200 SXM5 (141 GB) or 2x H100 SXM5 (160 GB). BF16 (~460 GB) requires 4x H200 SXM5 or 6x H100 SXM5. For 1M-context inference, true 1M context requires at minimum 4x H200 or 8x H100 to hold both model weights and the 120 GB FP8 KV cache.

Yes, once you confirm your installed vLLM version's release notes include MSA backend support for MiniMax M3. Check the vLLM release notes for the version that adds M3 support before installing. Use --tensor-parallel-size matching your GPU count. Set --max-model-len up to 1048576 for full 1M context, though KV cache grows linearly with context length. Enable --kv-cache-dtype fp8_e5m2 for memory-efficient long-context inference.

MSA is MiniMax M3's attention mechanism that delivers more than 9x prefill and more than 15x decode speedup at 1M context versus MiniMax's previous-generation M2 model, reducing per-token compute to 1/20th of M2's baseline at long context lengths. This makes 1M-context inference affordable on mid-tier multi-GPU configs rather than requiring massive clusters.

MiniMax published M3's open-weight SWE-Bench Pro score at 59.0%, which exceeds GPT-5.5 and Gemini 3.1 Pro on the same benchmark. M3 also ships native multimodality and a 1M-token context window in the base open-weight checkpoint.

FP8-quantized M3 on 2x H200 SXM5 on Spheron is the most capable full-precision configuration for sustained serving. For budget-constrained workloads, AWQ INT4 on 1x H200 SXM5 at $1.82/hr spot ($4.84/hr on-demand) covers the full model footprint with headroom for moderate context lengths. Self-hosting FP8 on 2x H200 spot (~$1.26/M tokens) is cost-competitive with hosted API standard rates (~$2.40/M output) above roughly 420 tokens/second of sustained throughput. AWQ INT4 on 1x H200 spot (~$0.72/M) beats even launch promo API rates (~$1.20/M output).

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.