Self-Host and Fine-Tune Microsoft MAI Models on GPU Cloud: The 9-Model Family Guide (2026)

Microsoft's MAI family covers five distinct AI modalities in a single product line: reasoning, code generation, image understanding, voice, and speech transcription. The catch is that only two of the nine models expose downloadable weights you can actually run and fine-tune.

This guide breaks down each model in the family, clarifies exactly which ones are open-weight versus API-only, and provides the full GPU sizing and fine-tuning walkthrough for the models you can actually deploy yourself.

The 9 MAI Models: What Each One Does

Microsoft's MAI lineup expanded significantly through late 2025 and the first half of 2026. Here is what each model does and who it is for.

MAI-DS-R1 is Microsoft's open-weight reasoning model, built on the DeepSeek V3 architecture and fine-tuned from DeepSeek R1. It has 671 billion total parameters using sparse Mixture-of-Experts routing. Microsoft released it under an MIT license with full weights on HuggingFace. It handles multi-step reasoning, math, and code with extended chain-of-thought output.

MAI-DS-R1-FP8 is the FP8-quantized version of MAI-DS-R1. Same weights, same license, lower VRAM footprint than BF16. It targets teams with 8x H100 or 8x H200 hardware that want to skip the manual FP8 quantization step and serve at full FP8 precision without additional post-processing.

MAI-Thinking-1 is Microsoft's flagship proprietary reasoning model, announced at Build 2026. It scores 97.0% on AIME 2025 and is preferred to Claude Sonnet 4.6 in blind human side-by-side evaluations. It has 35B active parameters as a large sparse MoE (Microsoft has not disclosed the total parameter count), which is why Microsoft distributes it as an API service only. For the full deployment walkthrough, see the MAI-Thinking-1 deployment guide.

MAI-Code-1-Flash is Microsoft's compact coding model optimized for IDE agents and code review pipelines. Microsoft reports it outperforms Claude Haiku 4.5 on SWE-Bench Pro (51.2% vs 35.2%) and across core coding benchmarks while being cheaper to run. That cost efficiency matters for inner-loop agentic workflows where each coding step generates a response. It is API-only via Azure AI Foundry.

MAI-Image-2.5 is Microsoft's multimodal image model for image generation and visual understanding tasks. It handles both text-to-image generation and image understanding (visual Q&A, document parsing, diagram interpretation). API-only.

MAI-Image-2.5-Flash is the faster, lower-latency variant of MAI-Image-2.5. Same capabilities, tuned for speed over maximum quality. Suitable for real-time image generation pipelines where latency matters more than peak quality. API-only.

MAI-Transcribe-1.5 handles speech-to-text transcription with multilingual support. Positioned for meeting transcription, audio indexing, and voice-to-text pipelines. API-only, no publicly available weights.

MAI-Voice-2 covers voice synthesis and voice chat. It enables conversational voice interfaces with real-time streaming. API-only, no weights released.

MAI-Voice-2-Flash is a faster variant of MAI-Voice-2, announced at Build 2026 as "coming soon." Positioned for lower-latency voice applications. No release date confirmed at time of writing.

Summary table:

Model	Modality	Architecture	Primary use case
MAI-DS-R1	Text / reasoning	DeepSeek V3 (671B MoE)	Chain-of-thought, math, code
MAI-DS-R1-FP8	Text / reasoning	DeepSeek V3 (FP8)	Same as above, lower VRAM
MAI-Thinking-1	Text / reasoning	Large sparse MoE (params undisclosed)	Extended reasoning, AIME-class math
MAI-Code-1-Flash	Code	Undisclosed	IDE agents, code review, SWE tasks
MAI-Image-2.5	Image	Undisclosed	Image generation and understanding
MAI-Image-2.5-Flash	Image	Undisclosed	Low-latency image generation
MAI-Transcribe-1.5	Speech	Undisclosed	Speech-to-text, meeting transcription
MAI-Voice-2	Voice	Undisclosed	Voice synthesis, voice chat
MAI-Voice-2-Flash	Voice	Undisclosed	Low-latency voice synthesis (coming soon)

Which MAI Models You Can Actually Self-Host

The weight availability check is the first thing to do before planning any deployment. As of June 2026, here is the full picture:

Model	Weights available?	License	HF slug
MAI-DS-R1	Yes	MIT	`microsoft/MAI-DS-R1`
MAI-DS-R1-FP8	Yes	MIT	`microsoft/MAI-DS-R1-FP8`
MAI-Thinking-1	No	N/A (API-only)	No HuggingFace release
MAI-Code-1-Flash	No	N/A (API-only)	No HuggingFace release
MAI-Image-2.5	No	N/A (API-only)	No HuggingFace release
MAI-Image-2.5-Flash	No	N/A (API-only)	No HuggingFace release
MAI-Transcribe-1.5	No	N/A (API-only)	No HuggingFace release
MAI-Voice-2	No	N/A (API-only)	No HuggingFace release
MAI-Voice-2-Flash	Not yet	N/A (coming soon)	Announced at Build 2026, not released

MAI-DS-R1 and MAI-DS-R1-FP8 are fully self-hostable. The MIT license covers commercial use without restrictions on inference, fine-tuning, or redistribution of adapter weights.

For the API-only models, Microsoft provides access through Azure AI Foundry. MAI-Thinking-1 is in private preview as of June 2026 (broader access not yet confirmed). MAI-Code-1-Flash, MAI-Image-2.5, and MAI-Transcribe-1.5 are available through the Azure AI model catalog. MAI-Voice-2 is accessible via Azure Cognitive Services.

The practical implication: if you need full inference control, on-premises deployment, data residency compliance, or fine-tuning capability, MAI-DS-R1 is the only model in the family that delivers it.

GPU and VRAM Requirements Per Model

For the open-weight models (MAI-DS-R1 and MAI-DS-R1-FP8), VRAM requirements are based on confirmed architecture. For API-only models, the table shows minimum hardware if weights were hypothetically released, based on the capabilities Microsoft has described.

MAI-DS-R1 and MAI-DS-R1-FP8

MAI-DS-R1 uses the DeepSeek V3 architecture with 671B total parameters. It is a sparse MoE model, so each forward pass activates only a subset of experts, but all expert weights must reside in VRAM.

Variant	Total Params	VRAM BF16	VRAM FP8	VRAM INT4	Recommended Config
MAI-DS-R1 (BF16)	671B	~1,342 GB	N/A	N/A	12x H200 SXM5 or 20x H100 SXM5
MAI-DS-R1 (FP8)	671B	N/A	~671 GB	N/A	8x H200 SXM5 or 8x H100 SXM5 (tight)
MAI-DS-R1-FP8 (pre-quantized)	671B	N/A	~671 GB	N/A	8x H200 SXM5 or 8x H100 SXM5 (tight)
MAI-DS-R1 (INT4 QLoRA)	671B	N/A	N/A	~336 GB	4x H100 SXM5 (tight/borderline)

For practical serving on 4x H100 SXM5 (320 GB combined), INT4 quantization is the only viable option, and even then the weight-only footprint (~336 GB) slightly exceeds available VRAM, so BitsAndBytes will overflow some state to CPU RAM. The pre-quantized microsoft/MAI-DS-R1-FP8 (~671 GB) does not fit on 4x H100 (320 GB) or 4x H200 (564 GB). To serve the FP8 variant without additional quantization, use at least 8x H200 SXM5 (1,128 GB combined, comfortable headroom for weights plus KV cache) or 8x H100 SXM5 (640 GB combined, tight and may require reduced context length or KV cache limits).

The minimum viable single-GPU config does not exist for this model. For the full VRAM calculation methodology including KV cache headroom, see GPU memory requirements for LLMs.

API-Only Models: Estimated Hardware (Hypothetical)

Model	Estimated Params	Min VRAM Estimate	Notes
MAI-Code-1-Flash	~7-8B (est.)	~16 GB FP16	Based on SWE-bench comparison to Haiku-class models
MAI-Image-2.5	Unknown	Unknown	Image diffusion requirements differ from transformer text models
MAI-Transcribe-1.5	~1-3B (est.)	~6 GB FP16	Based on ASR model precedent

These estimates are based on benchmark comparisons and category patterns, not confirmed architecture specs. Do not provision hardware based on estimates for API-only models.

Fine-Tuning MAI-DS-R1 with LoRA and PEFT

This section applies to MAI-DS-R1 only. The other eight models do not have open weights.

MAI-DS-R1 uses the deepseek_v3 model type, which is a sparse MoE transformer with multi-head latent attention (MLA) and the DeepSeek V3 expert-routing mechanism. The fine-tuning patterns are identical to DeepSeek R1 fine-tuning.

Choosing a Fine-Tuning Method

For a 671B MoE model, full fine-tuning is not practical for most teams. The standard path is QLoRA (quantized LoRA) with 4-bit NF4 loading via BitsAndBytes. This reduces the effective VRAM footprint from ~671 GB (FP8) to ~336 GB (INT4). A 4x H100 SXM5 setup (320 GB combined) is tight, slightly under the weight-only footprint, so BitsAndBytes will spill some state to CPU RAM. Gradient checkpointing helps reduce peak activation memory on top of that.

For an overview of LoRA rank selection, QLoRA tradeoffs, and newer alternatives like DoRA, see the fine-tuning framework comparison and DoRA and advanced PEFT methods.

The key decision for MAI-DS-R1 is whether you target attention layers only (safer, lower LoRA parameter count) or also target the MLP gate and projection layers (broader adaptation, more VRAM for optimizer states):

Attention-only LoRA (q_proj, k_proj, v_proj, o_proj): ~500M trainable parameters at rank 16. Conservative VRAM budget.
Full MoE LoRA (attention + gate_proj, up_proj, down_proj): ~2-4B trainable parameters at rank 16. Stronger adaptation for task-specific fine-tuning.

For most code or instruction fine-tuning tasks, attention-only LoRA produces good results at lower hardware cost. For domain-specific reasoning adaptation, full MoE LoRA is worth the VRAM overhead.

Dataset Prep

MAI-DS-R1 follows the DeepSeek instruction format. Use the standard chat template from the tokenizer:

python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/MAI-DS-R1")

def format_sample(instruction, output):
    messages = [
        {"role": "user", "content": instruction},
        {"role": "assistant", "content": output}
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False)

Token efficiency note: MAI-DS-R1 generates extended reasoning traces before final answers, similar to DeepSeek R1. If you are fine-tuning for a task that does not require chain-of-thought output, format your training data to include only the final answer in the assistant turn. This avoids inadvertently training the model to skip reasoning.

QLoRA Fine-Tuning on 4x H100 SXM5

The following setup uses PEFT and BitsAndBytes directly, since Unsloth may not yet recognize the deepseek_v3 model type (check unsloth.models compatibility before using the Unsloth path). If Unsloth supports your version, the training speed improvement is 2-5x, making it worth checking first.

Check Unsloth compatibility:

bash

python3 -c "
try:
    from unsloth.models._utils import SUPPORTED_MODELS
    print(any('deepseek' in m.lower() for m in SUPPORTED_MODELS))
except ImportError:
    print(False)
"

If that fails, use the PEFT + BitsAndBytes path:

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType

# 4-bit NF4 quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load in 4-bit on 4x H100 (device_map="auto" distributes across GPUs)
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/MAI-DS-R1",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/MAI-DS-R1")

# LoRA config targeting attention layers
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: ~2.4B || all params: 671B || trainable%: 0.36%

For the training loop, use TRL's SFTTrainer with gradient checkpointing enabled:

python

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="./mai-ds-r1-adapter",
    num_train_epochs=2,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    learning_rate=2e-4,
    fp16=False,
    bf16=True,
    logging_steps=10,
    save_steps=500,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    max_seq_length=4096,
    packing=False,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=your_dataset,
    args=training_args,
)

trainer.train()
trainer.save_model("./mai-ds-r1-adapter")

Hardware requirement: This setup needs 4x H100 SXM5 (tight/borderline at 320 GB; BitsAndBytes will spill some state to CPU RAM) or 4x H200 SXM5 (564 GB, more comfortable for INT4 training state). device_map="auto" distributes the model across all available GPUs. For distributed training with DeepSpeed ZeRO-3, see the LLM fine-tuning guide for the full ZeRO-3 config that further reduces per-GPU memory usage by partitioning optimizer states.

Serving the Adapter with vLLM

After training, the adapter directory contains LoRA weights that vLLM can load alongside the base model:

bash

vllm serve microsoft/MAI-DS-R1 \
  --dtype fp8 \
  --quantization compressed_tensors \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-expert-parallel \
  --enable-lora \
  --lora-modules mai-finetuned=/path/to/mai-ds-r1-adapter \
  --max-loras 4 \
  --trust-remote-code \
  --port 8000

To use the adapter, set the model field in your request to the module alias:

python

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="mai-finetuned",
    messages=[{"role": "user", "content": "Your prompt here"}],
    max_tokens=512,
)

For multi-tenant setups with multiple LoRA adapters in memory simultaneously, see the LoRA multi-adapter serving guide for the full vLLM configuration including per-request adapter selection and adapter cache management.

Cost: Spheron GPU Cloud vs Azure, Fireworks, and Baseten

Pricing fetched from the Spheron API on 29 Jun 2026.

On-Demand Pricing for MAI-DS-R1 Self-Hosting

MAI-DS-R1 requires multi-GPU configurations. These are the relevant on-demand rates per GPU from Spheron:

GPU	Spheron On-Demand/hr	Spheron Spot/hr
H100 SXM5 80GB	$4.06	$1.49
H100 PCIe 80GB	$2.98	N/A
H200 SXM5 141GB	$3.70	$3.31
A100 80G SXM4	$1.69	$0.85

For MAI-DS-R1 inference, cost per 1M tokens at realistic continuous-batching throughput:

Config	Quantization	On-Demand/hr	Throughput est.	Cost/1M tokens
4x H100 SXM5	INT4 (tight)	$16.24	~2,000 tok/s	~$2.26
4x H200 SXM5	INT4	$14.80	~2,500 tok/s	~$1.64
4x A100 80G SXM4	INT4	$6.76	~1,200 tok/s	~$1.56
4x H100 SXM5 (spot)	INT4 (tight)	$5.96	~2,000 tok/s	~$0.83

Note: these 4x configurations use INT4 quantization. The 4x H100 SXM5 config is borderline (320 GB available vs. ~336 GB for model weights) and relies on BitsAndBytes CPU offloading for overflow. FP8 serving of the pre-quantized MAI-DS-R1-FP8 variant requires at least 8x H100 SXM5 or 8x H200 SXM5.

Throughput estimates assume vLLM continuous batching with INT4 quantization and moderate batch sizes. Actual throughput varies significantly with request batch sizes, sequence lengths, and expert-routing patterns in the MoE layers.

API-Only Model Pricing (MAI-Code-1-Flash and Others)

For API-only MAI models accessed through Azure AI Foundry or Fireworks AI, per-token pricing applies:

Provider	Model	Price (input)	Price (output)
Azure AI Foundry	MAI-Code-1-Flash	Not publicly listed (check portal)	Not publicly listed
Fireworks AI	Haiku-class models	$0.20/1M tokens	$0.20/1M tokens
Baseten	Small coding models	$0.30-0.50/1M tokens	$0.30-0.50/1M tokens
Spheron (MAI-DS-R1, 4x H100)	Self-hosted	~$2.26/1M tokens (on-demand)	included
Spheron (MAI-DS-R1, 4x H100 spot)	Self-hosted	~$0.83/1M tokens	included

Note: Azure AI Foundry pricing for MAI-specific models was not publicly listed in the Foundry pricing page at time of writing. Check the Azure portal pricing calculator for current rates. The Fireworks rate ($0.20/1M for small/efficient models) is from the Fireworks AI alternatives guide.

The break-even calculation for self-hosting: at Fireworks' $0.20/1M rate for a MAI-Code-1-Flash class model, you would need to process 17M tokens per hour to break even against a 4x A100 setup at $3.40/hr (spot, 4 × $0.85). The 4x H100 SXM5 spot setup ($5.96/hr) delivers full-utilization throughput of 7.2M tokens per hour at $0.83/1M tokens. That figure reflects hardware throughput capacity, not a break-even against the Fireworks rate. For sustained production workloads at those volumes, self-hosting becomes cost-competitive.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

MAI-Code-1-Flash: When It's the Right Cheap Coding Model

MAI-Code-1-Flash is API-only, but the efficiency case Microsoft makes for it is worth examining in detail, because it directly affects the economics of any system that uses it at scale.

The core claim: MAI-Code-1-Flash outperforms Claude Haiku 4.5 on SWE-Bench Pro (51.2% vs 35.2%) and across core coding benchmarks, while generating fewer tokens per task. The token efficiency advantage has two direct consequences.

Cost. If MAI-Code-1-Flash and Claude Haiku 4.5 are priced at similar per-token rates, producing fewer tokens per coding task directly reduces cost. For inner-loop code agents that make 15-30 sequential model calls per task, this compounds hard: shorter per-step outputs mean lower total cost across the full agent run.

Latency. Fewer tokens generated means lower wall-clock time per coding step. On a synchronous API call, a 1,200-token response is faster than a 3,000-token response. In agentic workflows where each step waits for the previous one, the latency reduction compounds across the full chain.

When to use MAI-Code-1-Flash:

Inner-loop code agents with many sequential API calls
IDE autocomplete where latency matters more than peak quality
Code review pipelines that process large volumes of PRs
Any pipeline where the cost of a 70B reasoning model is prohibitive but you still need SWE-bench competitive quality

When not to use it:

Tasks requiring 100K+ context windows (extended reasoning chains, large codebase analysis)
Multimodal inputs (MAI-Image-2.5 handles image-code tasks)
Highest-stakes reasoning where MAI-Thinking-1's 97% AIME performance level matters
Production workloads where you need to avoid API dependency and require fine-tuning control

For self-hosted alternatives in the same coding model tier, see the guide on how to self-host a coding assistant and compare against Devstral deployment patterns.

Choosing the Right GPU for Each MAI Workload

MAI-DS-R1: Reasoning Workloads

MAI-DS-R1 needs multi-GPU setups regardless of serving precision. For production serving:

4x H100 SXM5 at $16.24/hr on-demand handles INT4-quantized MAI-DS-R1 with ~2,000 tok/sec throughput and 32K context. Use for latency-sensitive production workloads.
4x A100 80G SXM4 at $6.76/hr is the budget option for batch inference at lower throughput. The lower interconnect bandwidth compared to H100 SXM5 slows multi-GPU all-reduce operations, which is the bottleneck on large MoE models.
4x H100 SXM5 spot at $5.96/hr covers asynchronous batch jobs where occasional preemption is acceptable.

H100 on Spheron and A100 GPU rental both support multi-GPU configurations with NVLink on SXM variants.

MAI-Thinking-1: Extended Reasoning (API-Only)

MAI-Thinking-1 is inference-only via Microsoft Foundry. If you are evaluating the cost of accessing it via API versus running an open alternative, the reasoning model cost breakdown and self-hostable alternatives are covered in the MAI-Thinking-1 guide.

MAI-Code-1-Flash: Coding Tasks (API-Only)

Access via Azure AI Foundry. If you need a self-hosted equivalent for full inference control, the coding assistant deployment guide covers comparable open models on L40S and A100 instances.

MAI-Image-2.5: Image Workloads (API-Only)

Image generation and understanding models have different VRAM profiles from text transformers. Diffusion-based models are GPU compute-bound rather than memory-bandwidth-bound. Until Microsoft releases weights, Azure AI is the access path.

MAI-Voice-2 and MAI-Transcribe-1.5: Audio Workloads (API-Only)

Real-time voice and transcription pipelines have strict latency requirements that standard batch LLM serving frameworks do not address. For self-hosted speech infrastructure, the voice AI GPU infrastructure guide covers streaming ASR pipelines, TTS latency requirements, and GPU sizing for real-time audio. For transcription specifically, the Whisper production deployment guide covers open-weight alternatives you can run yourself today.

Practical Path: What to Deploy Now

The realistic deployment map for the MAI family as of June 2026:

Self-host MAI-DS-R1 for reasoning and code on 4x H100 SXM5 or 4x A100 SXM4. Open weights, MIT license, fine-tunable. The Spheron deployment docs cover multi-GPU vLLM setup for MoE models.

Use Azure Foundry API for MAI-Code-1-Flash if token efficiency on coding tasks is the priority and you can accept API dependency and per-token pricing.

Use MAI-Thinking-1 via Foundry API (private preview) if you need the 97% AIME-class reasoning ceiling and can accept the access limitations. Otherwise, MAI-DS-R1 self-hosted is the open alternative at similar architecture scale.

Use Azure Cognitive Services for MAI-Voice-2 and MAI-Transcribe-1.5, or substitute with open-weight alternatives for voice (see the voice AI infrastructure guide) and transcription.

The open-weight MAI-DS-R1 is the right choice when you need model control, fine-tuning capability, or data residency. The API-only models are the right choice when you want zero infrastructure overhead and the access constraints are acceptable.

Microsoft's MAI family covers enough workload types that you will likely need more than one GPU tier. Spheron lets you provision different GPU sizes under one account with per-minute billing and no minimums. Start with an A100 or L40S for MAI-DS-R1 experiments, then scale to H100 or H200 for production reasoning workloads.
Check H100 availability → | Compare GPU pricing → | Get started on Spheron →

STEPS / 05

Quick Setup Guide

Verify which MAI models have downloadable weights
Check the microsoft/ organization on HuggingFace for MAI model cards. As of June 2026, only MAI-DS-R1 and MAI-DS-R1-FP8 have safetensors files with MIT license. MAI-Thinking-1, MAI-Code-1-Flash, MAI-Image-2.5, MAI-Image-2.5-Flash, MAI-Transcribe-1.5, MAI-Voice-2, and MAI-Voice-2-Flash (coming soon, not yet released) are API-only. Check before provisioning hardware to avoid wasted setup time.
Size your GPU cluster to MAI-DS-R1
MAI-DS-R1 is 671B parameters using sparse MoE. At FP8 (1 byte per param), weights alone require ~671 GB of VRAM; at INT4 (~0.5 bytes per param), ~336 GB. For FP8 serving, plan for at least 8x H100 SXM5 (tight at 640 GB) or 8x H200 SXM5 (1,128 GB, comfortable). For INT4 inference or QLoRA fine-tuning in 4-bit, 4x H100 SXM5 or 4x H200 SXM5 is borderline and relies on BitsAndBytes CPU overflow.
Fine-tune MAI-DS-R1 with LoRA and PEFT
Install PEFT, TRL, and BitsAndBytes. Load MAI-DS-R1 in 4-bit NF4 quantization via BitsAndBytes. Apply a LoRA config targeting the attention and MLP projection layers. Run on 4x H100 SXM5 using DeepSpeed ZeRO-3 or FSDP for distributed training. Training on 5,000 examples takes roughly 8-12 hours at this scale.
Serve the adapter with vLLM
Start vLLM with --model microsoft/MAI-DS-R1 --tensor-parallel-size 4 --enable-lora --lora-modules your-adapter=/path/to/adapter. Send requests with the adapter alias in the model field. Multi-adapter setups cache up to 8 adapters with --max-loras 8 to avoid adapter reload latency.
Compare costs against Azure and token-based APIs
Use the cost tables in this guide to calculate your break-even token volume versus Azure or Fireworks per-token pricing. On Spheron, 4x H100 SXM5 on-demand costs $16.24/hr. At 2,000 tokens/sec continuous throughput, that works out to roughly $2.26/1M tokens. Spot instances lower the hourly rate by 60-70% for batch workloads that tolerate preemption.

FAQ / 05

Frequently Asked Questions

As of June 2026, MAI-DS-R1 and MAI-DS-R1-FP8 are the only MAI models with publicly downloadable weights on Hugging Face (MIT license, 671B parameters using the DeepSeek V3 architecture). MAI-Thinking-1, MAI-Code-1-Flash, MAI-Image-2.5, MAI-Image-2.5-Flash, MAI-Transcribe-1.5, MAI-Voice-2, and MAI-Voice-2-Flash (announced at Build 2026, not yet released) are API-only via Microsoft Foundry or Azure AI. Always verify at the official HuggingFace model card before provisioning hardware.

MAI-DS-R1 is a 671B sparse MoE model using the DeepSeek V3 architecture. QLoRA fine-tuning in 4-bit requires approximately 336 GB of VRAM just for weights, which technically exceeds 4x H100 SXM5 (320 GB combined). In practice, BitsAndBytes overflows some state to CPU RAM, making 4x H100 SXM5 or 4x H200 SXM5 141GB workable though tight. For FP8 inference of the pre-quantized variant (~671 GB), plan for at least 8x H100 SXM5 (tight) or 8x H200 SXM5.

Microsoft reports MAI-Code-1-Flash outperforms Claude Haiku 4.5 across core coding benchmarks, including a 16-point SWE-Bench Pro lead (51.2% vs 35.2%), while being cheaper to run. Fewer tokens per request means higher throughput per dollar at the same per-token API rate. However, MAI-Code-1-Flash is API-only as of June 2026, with no downloadable weights.

Yes, if you are running on a multi-GPU node with enough VRAM for the base model. Load MAI-DS-R1 once in 4-bit or FP8 quantization, then register your fine-tuned adapter with vLLM's --enable-lora --lora-modules flags. Per-request adapter switching adds sub-millisecond latency. A 4x H100 SXM5 node handles INT4-quantized MAI-DS-R1 plus multiple adapters in memory simultaneously (tight memory budget; monitor GPU usage).

For sustained inference workloads, yes. Azure charges per token, which compounds fast at scale. On Spheron, you rent the GPU by the hour and absorb whatever throughput your serving stack generates. At typical vLLM continuous-batching throughput for a quantized 671B MoE model (2,000-4,000 tokens/sec on 4x H100), the effective per-million-token cost on Spheron drops to $1-3, compared to higher per-token API rates for equivalent capability models.

The 9 MAI Models: What Each One Does

Which MAI Models You Can Actually Self-Host

GPU and VRAM Requirements Per Model

MAI-DS-R1 and MAI-DS-R1-FP8

API-Only Models: Estimated Hardware (Hypothetical)

Fine-Tuning MAI-DS-R1 with LoRA and PEFT

Choosing a Fine-Tuning Method

Dataset Prep

QLoRA Fine-Tuning on 4x H100 SXM5

Serving the Adapter with vLLM

Cost: Spheron GPU Cloud vs Azure, Fireworks, and Baseten

On-Demand Pricing for MAI-DS-R1 Self-Hosting

API-Only Model Pricing (MAI-Code-1-Flash and Others)

MAI-Code-1-Flash: When It's the Right Cheap Coding Model

Choosing the Right GPU for Each MAI Workload

MAI-DS-R1: Reasoning Workloads

MAI-Thinking-1: Extended Reasoning (API-Only)

MAI-Code-1-Flash: Coding Tasks (API-Only)

MAI-Image-2.5: Image Workloads (API-Only)

MAI-Voice-2 and MAI-Transcribe-1.5: Audio Workloads (API-Only)

Practical Path: What to Deploy Now

Quick Setup Guide

Verify which MAI models have downloadable weights

Size your GPU cluster to MAI-DS-R1

Fine-tune MAI-DS-R1 with LoRA and PEFT

Serve the adapter with vLLM

Compare costs against Azure and token-based APIs

Frequently Asked Questions

01Which Microsoft MAI models can I actually self-host?

02What GPU do I need to fine-tune MAI-DS-R1?

03How does MAI-Code-1-Flash compare to Claude Haiku 4.5 on coding benchmarks?

04Can I serve a fine-tuned MAI-DS-R1 adapter alongside other LoRA adapters on one node?

05Is it cheaper to run MAI-DS-R1 on Spheron than on Azure?

Build what's next.