Microsoft's MAI family covers five distinct AI modalities in a single product line: reasoning, code generation, image understanding, voice, and speech transcription. The catch is that only two of the nine models expose downloadable weights you can actually run and fine-tune.
This guide breaks down each model in the family, clarifies exactly which ones are open-weight versus API-only, and provides the full GPU sizing and fine-tuning walkthrough for the models you can actually deploy yourself.
The 9 MAI Models: What Each One Does
Microsoft's MAI lineup expanded significantly through late 2025 and the first half of 2026. Here is what each model does and who it is for.
MAI-DS-R1 is Microsoft's open-weight reasoning model, built on the DeepSeek V3 architecture and fine-tuned from DeepSeek R1. It has 671 billion total parameters using sparse Mixture-of-Experts routing. Microsoft released it under an MIT license with full weights on HuggingFace. It handles multi-step reasoning, math, and code with extended chain-of-thought output.
MAI-DS-R1-FP8 is the FP8-quantized version of MAI-DS-R1. Same weights, same license, lower VRAM footprint than BF16. It targets teams with 8x H100 or 8x H200 hardware that want to skip the manual FP8 quantization step and serve at full FP8 precision without additional post-processing.
MAI-Thinking-1 is Microsoft's flagship proprietary reasoning model, announced at Build 2026. It scores 97.0% on AIME 2025 and is preferred to Claude Sonnet 4.6 in blind human side-by-side evaluations. It has 35B active parameters as a large sparse MoE (Microsoft has not disclosed the total parameter count), which is why Microsoft distributes it as an API service only. For the full deployment walkthrough, see the MAI-Thinking-1 deployment guide.
MAI-Code-1-Flash is Microsoft's compact coding model optimized for IDE agents and code review pipelines. Microsoft reports it outperforms Claude Haiku 4.5 on SWE-Bench Pro (51.2% vs 35.2%) and across core coding benchmarks while being cheaper to run. That cost efficiency matters for inner-loop agentic workflows where each coding step generates a response. It is API-only via Azure AI Foundry.
MAI-Image-2.5 is Microsoft's multimodal image model for image generation and visual understanding tasks. It handles both text-to-image generation and image understanding (visual Q&A, document parsing, diagram interpretation). API-only.
MAI-Image-2.5-Flash is the faster, lower-latency variant of MAI-Image-2.5. Same capabilities, tuned for speed over maximum quality. Suitable for real-time image generation pipelines where latency matters more than peak quality. API-only.
MAI-Transcribe-1.5 handles speech-to-text transcription with multilingual support. Positioned for meeting transcription, audio indexing, and voice-to-text pipelines. API-only, no publicly available weights.
MAI-Voice-2 covers voice synthesis and voice chat. It enables conversational voice interfaces with real-time streaming. API-only, no weights released.
MAI-Voice-2-Flash is a faster variant of MAI-Voice-2, announced at Build 2026 as "coming soon." Positioned for lower-latency voice applications. No release date confirmed at time of writing.
Summary table:
| Model | Modality | Architecture | Primary use case |
|---|---|---|---|
| MAI-DS-R1 | Text / reasoning | DeepSeek V3 (671B MoE) | Chain-of-thought, math, code |
| MAI-DS-R1-FP8 | Text / reasoning | DeepSeek V3 (FP8) | Same as above, lower VRAM |
| MAI-Thinking-1 | Text / reasoning | Large sparse MoE (params undisclosed) | Extended reasoning, AIME-class math |
| MAI-Code-1-Flash | Code | Undisclosed | IDE agents, code review, SWE tasks |
| MAI-Image-2.5 | Image | Undisclosed | Image generation and understanding |
| MAI-Image-2.5-Flash | Image | Undisclosed | Low-latency image generation |
| MAI-Transcribe-1.5 | Speech | Undisclosed | Speech-to-text, meeting transcription |
| MAI-Voice-2 | Voice | Undisclosed | Voice synthesis, voice chat |
| MAI-Voice-2-Flash | Voice | Undisclosed | Low-latency voice synthesis (coming soon) |
Which MAI Models You Can Actually Self-Host
The weight availability check is the first thing to do before planning any deployment. As of June 2026, here is the full picture:
| Model | Weights available? | License | HF slug |
|---|---|---|---|
| MAI-DS-R1 | Yes | MIT | microsoft/MAI-DS-R1 |
| MAI-DS-R1-FP8 | Yes | MIT | microsoft/MAI-DS-R1-FP8 |
| MAI-Thinking-1 | No | N/A (API-only) | No HuggingFace release |
| MAI-Code-1-Flash | No | N/A (API-only) | No HuggingFace release |
| MAI-Image-2.5 | No | N/A (API-only) | No HuggingFace release |
| MAI-Image-2.5-Flash | No | N/A (API-only) | No HuggingFace release |
| MAI-Transcribe-1.5 | No | N/A (API-only) | No HuggingFace release |
| MAI-Voice-2 | No | N/A (API-only) | No HuggingFace release |
| MAI-Voice-2-Flash | Not yet | N/A (coming soon) | Announced at Build 2026, not released |
MAI-DS-R1 and MAI-DS-R1-FP8 are fully self-hostable. The MIT license covers commercial use without restrictions on inference, fine-tuning, or redistribution of adapter weights.
For the API-only models, Microsoft provides access through Azure AI Foundry. MAI-Thinking-1 is in private preview as of June 2026 (broader access not yet confirmed). MAI-Code-1-Flash, MAI-Image-2.5, and MAI-Transcribe-1.5 are available through the Azure AI model catalog. MAI-Voice-2 is accessible via Azure Cognitive Services.
The practical implication: if you need full inference control, on-premises deployment, data residency compliance, or fine-tuning capability, MAI-DS-R1 is the only model in the family that delivers it.
GPU and VRAM Requirements Per Model
For the open-weight models (MAI-DS-R1 and MAI-DS-R1-FP8), VRAM requirements are based on confirmed architecture. For API-only models, the table shows minimum hardware if weights were hypothetically released, based on the capabilities Microsoft has described.
MAI-DS-R1 and MAI-DS-R1-FP8
MAI-DS-R1 uses the DeepSeek V3 architecture with 671B total parameters. It is a sparse MoE model, so each forward pass activates only a subset of experts, but all expert weights must reside in VRAM.
| Variant | Total Params | VRAM BF16 | VRAM FP8 | VRAM INT4 | Recommended Config |
|---|---|---|---|---|---|
| MAI-DS-R1 (BF16) | 671B | ~1,342 GB | N/A | N/A | 12x H200 SXM5 or 20x H100 SXM5 |
| MAI-DS-R1 (FP8) | 671B | N/A | ~671 GB | N/A | 8x H200 SXM5 or 8x H100 SXM5 (tight) |
| MAI-DS-R1-FP8 (pre-quantized) | 671B | N/A | ~671 GB | N/A | 8x H200 SXM5 or 8x H100 SXM5 (tight) |
| MAI-DS-R1 (INT4 QLoRA) | 671B | N/A | N/A | ~336 GB | 4x H100 SXM5 (tight/borderline) |
For practical serving on 4x H100 SXM5 (320 GB combined), INT4 quantization is the only viable option, and even then the weight-only footprint (~336 GB) slightly exceeds available VRAM, so BitsAndBytes will overflow some state to CPU RAM. The pre-quantized microsoft/MAI-DS-R1-FP8 (~671 GB) does not fit on 4x H100 (320 GB) or 4x H200 (564 GB). To serve the FP8 variant without additional quantization, use at least 8x H200 SXM5 (1,128 GB combined, comfortable headroom for weights plus KV cache) or 8x H100 SXM5 (640 GB combined, tight and may require reduced context length or KV cache limits).
The minimum viable single-GPU config does not exist for this model. For the full VRAM calculation methodology including KV cache headroom, see GPU memory requirements for LLMs.
API-Only Models: Estimated Hardware (Hypothetical)
| Model | Estimated Params | Min VRAM Estimate | Notes |
|---|---|---|---|
| MAI-Code-1-Flash | ~7-8B (est.) | ~16 GB FP16 | Based on SWE-bench comparison to Haiku-class models |
| MAI-Image-2.5 | Unknown | Unknown | Image diffusion requirements differ from transformer text models |
| MAI-Transcribe-1.5 | ~1-3B (est.) | ~6 GB FP16 | Based on ASR model precedent |
These estimates are based on benchmark comparisons and category patterns, not confirmed architecture specs. Do not provision hardware based on estimates for API-only models.
Fine-Tuning MAI-DS-R1 with LoRA and PEFT
This section applies to MAI-DS-R1 only. The other eight models do not have open weights.
MAI-DS-R1 uses the deepseek_v3 model type, which is a sparse MoE transformer with multi-head latent attention (MLA) and the DeepSeek V3 expert-routing mechanism. The fine-tuning patterns are identical to DeepSeek R1 fine-tuning.
Choosing a Fine-Tuning Method
For a 671B MoE model, full fine-tuning is not practical for most teams. The standard path is QLoRA (quantized LoRA) with 4-bit NF4 loading via BitsAndBytes. This reduces the effective VRAM footprint from ~671 GB (FP8) to ~336 GB (INT4). A 4x H100 SXM5 setup (320 GB combined) is tight, slightly under the weight-only footprint, so BitsAndBytes will spill some state to CPU RAM. Gradient checkpointing helps reduce peak activation memory on top of that.
For an overview of LoRA rank selection, QLoRA tradeoffs, and newer alternatives like DoRA, see the fine-tuning framework comparison and DoRA and advanced PEFT methods.
The key decision for MAI-DS-R1 is whether you target attention layers only (safer, lower LoRA parameter count) or also target the MLP gate and projection layers (broader adaptation, more VRAM for optimizer states):
- Attention-only LoRA (q_proj, k_proj, v_proj, o_proj): ~500M trainable parameters at rank 16. Conservative VRAM budget.
- Full MoE LoRA (attention + gate_proj, up_proj, down_proj): ~2-4B trainable parameters at rank 16. Stronger adaptation for task-specific fine-tuning.
For most code or instruction fine-tuning tasks, attention-only LoRA produces good results at lower hardware cost. For domain-specific reasoning adaptation, full MoE LoRA is worth the VRAM overhead.
Dataset Prep
MAI-DS-R1 follows the DeepSeek instruction format. Use the standard chat template from the tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/MAI-DS-R1")
def format_sample(instruction, output):
messages = [
{"role": "user", "content": instruction},
{"role": "assistant", "content": output}
]
return tokenizer.apply_chat_template(messages, tokenize=False)Token efficiency note: MAI-DS-R1 generates extended reasoning traces before final answers, similar to DeepSeek R1. If you are fine-tuning for a task that does not require chain-of-thought output, format your training data to include only the final answer in the assistant turn. This avoids inadvertently training the model to skip reasoning.
QLoRA Fine-Tuning on 4x H100 SXM5
The following setup uses PEFT and BitsAndBytes directly, since Unsloth may not yet recognize the deepseek_v3 model type (check unsloth.models compatibility before using the Unsloth path). If Unsloth supports your version, the training speed improvement is 2-5x, making it worth checking first.
Check Unsloth compatibility:
python3 -c "
try:
from unsloth.models._utils import SUPPORTED_MODELS
print(any('deepseek' in m.lower() for m in SUPPORTED_MODELS))
except ImportError:
print(False)
"If that fails, use the PEFT + BitsAndBytes path:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
# 4-bit NF4 quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load in 4-bit on 4x H100 (device_map="auto" distributes across GPUs)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/MAI-DS-R1",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/MAI-DS-R1")
# LoRA config targeting attention layers
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: ~2.4B || all params: 671B || trainable%: 0.36%For the training loop, use TRL's SFTTrainer with gradient checkpointing enabled:
from trl import SFTTrainer, SFTConfig
training_args = SFTConfig(
output_dir="./mai-ds-r1-adapter",
num_train_epochs=2,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
gradient_checkpointing=True,
learning_rate=2e-4,
fp16=False,
bf16=True,
logging_steps=10,
save_steps=500,
warmup_ratio=0.05,
lr_scheduler_type="cosine",
max_seq_length=4096,
packing=False,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=your_dataset,
args=training_args,
)
trainer.train()
trainer.save_model("./mai-ds-r1-adapter")Hardware requirement: This setup needs 4x H100 SXM5 (tight/borderline at 320 GB; BitsAndBytes will spill some state to CPU RAM) or 4x H200 SXM5 (564 GB, more comfortable for INT4 training state). device_map="auto" distributes the model across all available GPUs. For distributed training with DeepSpeed ZeRO-3, see the LLM fine-tuning guide for the full ZeRO-3 config that further reduces per-GPU memory usage by partitioning optimizer states.
Serving the Adapter with vLLM
After training, the adapter directory contains LoRA weights that vLLM can load alongside the base model:
vllm serve microsoft/MAI-DS-R1 \
--dtype fp8 \
--quantization compressed_tensors \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-expert-parallel \
--enable-lora \
--lora-modules mai-finetuned=/path/to/mai-ds-r1-adapter \
--max-loras 4 \
--trust-remote-code \
--port 8000To use the adapter, set the model field in your request to the module alias:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="mai-finetuned",
messages=[{"role": "user", "content": "Your prompt here"}],
max_tokens=512,
)For multi-tenant setups with multiple LoRA adapters in memory simultaneously, see the LoRA multi-adapter serving guide for the full vLLM configuration including per-request adapter selection and adapter cache management.
Cost: Spheron GPU Cloud vs Azure, Fireworks, and Baseten
Pricing fetched from the Spheron API on 29 Jun 2026.
On-Demand Pricing for MAI-DS-R1 Self-Hosting
MAI-DS-R1 requires multi-GPU configurations. These are the relevant on-demand rates per GPU from Spheron:
| GPU | Spheron On-Demand/hr | Spheron Spot/hr |
|---|---|---|
| H100 SXM5 80GB | $4.06 | $1.49 |
| H100 PCIe 80GB | $2.98 | N/A |
| H200 SXM5 141GB | $3.70 | $3.31 |
| A100 80G SXM4 | $1.69 | $0.85 |
For MAI-DS-R1 inference, cost per 1M tokens at realistic continuous-batching throughput:
| Config | Quantization | On-Demand/hr | Throughput est. | Cost/1M tokens |
|---|---|---|---|---|
| 4x H100 SXM5 | INT4 (tight) | $16.24 | ~2,000 tok/s | ~$2.26 |
| 4x H200 SXM5 | INT4 | $14.80 | ~2,500 tok/s | ~$1.64 |
| 4x A100 80G SXM4 | INT4 | $6.76 | ~1,200 tok/s | ~$1.56 |
| 4x H100 SXM5 (spot) | INT4 (tight) | $5.96 | ~2,000 tok/s | ~$0.83 |
Note: these 4x configurations use INT4 quantization. The 4x H100 SXM5 config is borderline (320 GB available vs. ~336 GB for model weights) and relies on BitsAndBytes CPU offloading for overflow. FP8 serving of the pre-quantized MAI-DS-R1-FP8 variant requires at least 8x H100 SXM5 or 8x H200 SXM5.
Throughput estimates assume vLLM continuous batching with INT4 quantization and moderate batch sizes. Actual throughput varies significantly with request batch sizes, sequence lengths, and expert-routing patterns in the MoE layers.
API-Only Model Pricing (MAI-Code-1-Flash and Others)
For API-only MAI models accessed through Azure AI Foundry or Fireworks AI, per-token pricing applies:
| Provider | Model | Price (input) | Price (output) |
|---|---|---|---|
| Azure AI Foundry | MAI-Code-1-Flash | Not publicly listed (check portal) | Not publicly listed |
| Fireworks AI | Haiku-class models | $0.20/1M tokens | $0.20/1M tokens |
| Baseten | Small coding models | $0.30-0.50/1M tokens | $0.30-0.50/1M tokens |
| Spheron (MAI-DS-R1, 4x H100) | Self-hosted | ~$2.26/1M tokens (on-demand) | included |
| Spheron (MAI-DS-R1, 4x H100 spot) | Self-hosted | ~$0.83/1M tokens | included |
Note: Azure AI Foundry pricing for MAI-specific models was not publicly listed in the Foundry pricing page at time of writing. Check the Azure portal pricing calculator for current rates. The Fireworks rate ($0.20/1M for small/efficient models) is from the Fireworks AI alternatives guide.
The break-even calculation for self-hosting: at Fireworks' $0.20/1M rate for a MAI-Code-1-Flash class model, you would need to process 17M tokens per hour to break even against a 4x A100 setup at $3.40/hr (spot, 4 × $0.85). The 4x H100 SXM5 spot setup ($5.96/hr) delivers full-utilization throughput of 7.2M tokens per hour at $0.83/1M tokens. That figure reflects hardware throughput capacity, not a break-even against the Fireworks rate. For sustained production workloads at those volumes, self-hosting becomes cost-competitive.
Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
MAI-Code-1-Flash: When It's the Right Cheap Coding Model
MAI-Code-1-Flash is API-only, but the efficiency case Microsoft makes for it is worth examining in detail, because it directly affects the economics of any system that uses it at scale.
The core claim: MAI-Code-1-Flash outperforms Claude Haiku 4.5 on SWE-Bench Pro (51.2% vs 35.2%) and across core coding benchmarks, while generating fewer tokens per task. The token efficiency advantage has two direct consequences.
Cost. If MAI-Code-1-Flash and Claude Haiku 4.5 are priced at similar per-token rates, producing fewer tokens per coding task directly reduces cost. For inner-loop code agents that make 15-30 sequential model calls per task, this compounds hard: shorter per-step outputs mean lower total cost across the full agent run.
Latency. Fewer tokens generated means lower wall-clock time per coding step. On a synchronous API call, a 1,200-token response is faster than a 3,000-token response. In agentic workflows where each step waits for the previous one, the latency reduction compounds across the full chain.
When to use MAI-Code-1-Flash:
- Inner-loop code agents with many sequential API calls
- IDE autocomplete where latency matters more than peak quality
- Code review pipelines that process large volumes of PRs
- Any pipeline where the cost of a 70B reasoning model is prohibitive but you still need SWE-bench competitive quality
When not to use it:
- Tasks requiring 100K+ context windows (extended reasoning chains, large codebase analysis)
- Multimodal inputs (MAI-Image-2.5 handles image-code tasks)
- Highest-stakes reasoning where MAI-Thinking-1's 97% AIME performance level matters
- Production workloads where you need to avoid API dependency and require fine-tuning control
For self-hosted alternatives in the same coding model tier, see the guide on how to self-host a coding assistant and compare against Devstral deployment patterns.
Choosing the Right GPU for Each MAI Workload
MAI-DS-R1: Reasoning Workloads
MAI-DS-R1 needs multi-GPU setups regardless of serving precision. For production serving:
- 4x H100 SXM5 at $16.24/hr on-demand handles INT4-quantized MAI-DS-R1 with ~2,000 tok/sec throughput and 32K context. Use for latency-sensitive production workloads.
- 4x A100 80G SXM4 at $6.76/hr is the budget option for batch inference at lower throughput. The lower interconnect bandwidth compared to H100 SXM5 slows multi-GPU all-reduce operations, which is the bottleneck on large MoE models.
- 4x H100 SXM5 spot at $5.96/hr covers asynchronous batch jobs where occasional preemption is acceptable.
H100 on Spheron and A100 GPU rental both support multi-GPU configurations with NVLink on SXM variants.
MAI-Thinking-1: Extended Reasoning (API-Only)
MAI-Thinking-1 is inference-only via Microsoft Foundry. If you are evaluating the cost of accessing it via API versus running an open alternative, the reasoning model cost breakdown and self-hostable alternatives are covered in the MAI-Thinking-1 guide.
MAI-Code-1-Flash: Coding Tasks (API-Only)
Access via Azure AI Foundry. If you need a self-hosted equivalent for full inference control, the coding assistant deployment guide covers comparable open models on L40S and A100 instances.
MAI-Image-2.5: Image Workloads (API-Only)
Image generation and understanding models have different VRAM profiles from text transformers. Diffusion-based models are GPU compute-bound rather than memory-bandwidth-bound. Until Microsoft releases weights, Azure AI is the access path.
MAI-Voice-2 and MAI-Transcribe-1.5: Audio Workloads (API-Only)
Real-time voice and transcription pipelines have strict latency requirements that standard batch LLM serving frameworks do not address. For self-hosted speech infrastructure, the voice AI GPU infrastructure guide covers streaming ASR pipelines, TTS latency requirements, and GPU sizing for real-time audio. For transcription specifically, the Whisper production deployment guide covers open-weight alternatives you can run yourself today.
Practical Path: What to Deploy Now
The realistic deployment map for the MAI family as of June 2026:
- Self-host MAI-DS-R1 for reasoning and code on 4x H100 SXM5 or 4x A100 SXM4. Open weights, MIT license, fine-tunable. The Spheron deployment docs cover multi-GPU vLLM setup for MoE models.
- Use Azure Foundry API for MAI-Code-1-Flash if token efficiency on coding tasks is the priority and you can accept API dependency and per-token pricing.
- Use MAI-Thinking-1 via Foundry API (private preview) if you need the 97% AIME-class reasoning ceiling and can accept the access limitations. Otherwise, MAI-DS-R1 self-hosted is the open alternative at similar architecture scale.
- Use Azure Cognitive Services for MAI-Voice-2 and MAI-Transcribe-1.5, or substitute with open-weight alternatives for voice (see the voice AI infrastructure guide) and transcription.
The open-weight MAI-DS-R1 is the right choice when you need model control, fine-tuning capability, or data residency. The API-only models are the right choice when you want zero infrastructure overhead and the access constraints are acceptable.
Microsoft's MAI family covers enough workload types that you will likely need more than one GPU tier. Spheron lets you provision different GPU sizes under one account with per-minute billing and no minimums. Start with an A100 or L40S for MAI-DS-R1 experiments, then scale to H100 or H200 for production reasoning workloads.
Check H100 availability → | Compare GPU pricing → | Get started on Spheron →
Quick Setup Guide
Check the microsoft/ organization on HuggingFace for MAI model cards. As of June 2026, only MAI-DS-R1 and MAI-DS-R1-FP8 have safetensors files with MIT license. MAI-Thinking-1, MAI-Code-1-Flash, MAI-Image-2.5, MAI-Image-2.5-Flash, MAI-Transcribe-1.5, MAI-Voice-2, and MAI-Voice-2-Flash (coming soon, not yet released) are API-only. Check before provisioning hardware to avoid wasted setup time.
MAI-DS-R1 is 671B parameters using sparse MoE. At FP8 (1 byte per param), weights alone require ~671 GB of VRAM; at INT4 (~0.5 bytes per param), ~336 GB. For FP8 serving, plan for at least 8x H100 SXM5 (tight at 640 GB) or 8x H200 SXM5 (1,128 GB, comfortable). For INT4 inference or QLoRA fine-tuning in 4-bit, 4x H100 SXM5 or 4x H200 SXM5 is borderline and relies on BitsAndBytes CPU overflow.
Install PEFT, TRL, and BitsAndBytes. Load MAI-DS-R1 in 4-bit NF4 quantization via BitsAndBytes. Apply a LoRA config targeting the attention and MLP projection layers. Run on 4x H100 SXM5 using DeepSpeed ZeRO-3 or FSDP for distributed training. Training on 5,000 examples takes roughly 8-12 hours at this scale.
Start vLLM with --model microsoft/MAI-DS-R1 --tensor-parallel-size 4 --enable-lora --lora-modules your-adapter=/path/to/adapter. Send requests with the adapter alias in the model field. Multi-adapter setups cache up to 8 adapters with --max-loras 8 to avoid adapter reload latency.
Use the cost tables in this guide to calculate your break-even token volume versus Azure or Fireworks per-token pricing. On Spheron, 4x H100 SXM5 on-demand costs $16.24/hr. At 2,000 tokens/sec continuous throughput, that works out to roughly $2.26/1M tokens. Spot instances lower the hourly rate by 60-70% for batch workloads that tolerate preemption.
Frequently Asked Questions
As of June 2026, MAI-DS-R1 and MAI-DS-R1-FP8 are the only MAI models with publicly downloadable weights on Hugging Face (MIT license, 671B parameters using the DeepSeek V3 architecture). MAI-Thinking-1, MAI-Code-1-Flash, MAI-Image-2.5, MAI-Image-2.5-Flash, MAI-Transcribe-1.5, MAI-Voice-2, and MAI-Voice-2-Flash (announced at Build 2026, not yet released) are API-only via Microsoft Foundry or Azure AI. Always verify at the official HuggingFace model card before provisioning hardware.
MAI-DS-R1 is a 671B sparse MoE model using the DeepSeek V3 architecture. QLoRA fine-tuning in 4-bit requires approximately 336 GB of VRAM just for weights, which technically exceeds 4x H100 SXM5 (320 GB combined). In practice, BitsAndBytes overflows some state to CPU RAM, making 4x H100 SXM5 or 4x H200 SXM5 141GB workable though tight. For FP8 inference of the pre-quantized variant (~671 GB), plan for at least 8x H100 SXM5 (tight) or 8x H200 SXM5.
Microsoft reports MAI-Code-1-Flash outperforms Claude Haiku 4.5 across core coding benchmarks, including a 16-point SWE-Bench Pro lead (51.2% vs 35.2%), while being cheaper to run. Fewer tokens per request means higher throughput per dollar at the same per-token API rate. However, MAI-Code-1-Flash is API-only as of June 2026, with no downloadable weights.
Yes, if you are running on a multi-GPU node with enough VRAM for the base model. Load MAI-DS-R1 once in 4-bit or FP8 quantization, then register your fine-tuned adapter with vLLM's --enable-lora --lora-modules flags. Per-request adapter switching adds sub-millisecond latency. A 4x H100 SXM5 node handles INT4-quantized MAI-DS-R1 plus multiple adapters in memory simultaneously (tight memory budget; monitor GPU usage).
For sustained inference workloads, yes. Azure charges per token, which compounds fast at scale. On Spheron, you rent the GPU by the hour and absorb whatever throughput your serving stack generates. At typical vLLM continuous-batching throughput for a quantized 671B MoE model (2,000-4,000 tokens/sec on 4x H100), the effective per-million-token cost on Spheron drops to $1-3, compared to higher per-token API rates for equivalent capability models.
