Magistral is Mistral's first reasoning-native model family, released in June 2025. Magistral Small (24B parameters) is open-source under Apache 2.0 and fully self-hostable; Magistral Medium is proprietary and accessible only via Mistral's API. Weights are not publicly available. Unlike Mistral Small 4 (which adds reasoning as an optional capability via a reasoning_effort parameter), Magistral generates chain-of-thought traces by default on every complex query. The Mistral Small 4 deployment guide covers the configurable reasoning approach; this guide covers Magistral Small's always-on reasoning architecture and self-hosting with vLLM. For context on why reasoning models change GPU provisioning economics, see the reasoning model inference cost analysis.
What Is Magistral and Why It Matters for Open-Source Reasoning
Magistral is a family of two models: Magistral Small (24B parameters, open-source Apache 2.0) and Magistral Medium (proprietary, available via Mistral API only; parameter count not publicly disclosed and weights not released). Both are reasoning-native, meaning the architecture generates an internal chain-of-thought scratchpad before producing the visible response, rather than treating reasoning as an optional mode you toggle per request.
The chain-of-thought mechanism works like this: the model emits thinking tokens wrapped in <think>...</think> tags as an internal trace, then produces the final answer. The thinking tokens are fully autoregressive and consume the same compute and KV cache as output tokens. A Magistral response to a math competition problem might generate 8,000-20,000 thinking tokens followed by a 200-token answer. The user sees only the answer, but the GPU worked through the full trace.
This differs from Mistral Small 4's reasoning_effort API parameter, which controls whether reasoning is invoked and at what depth. Magistral does not have a reasoning on/off switch at the model level. If you need a model that handles both fast instruct responses and deep reasoning on the same checkpoint, Mistral Small 4 is the better fit. If reasoning quality on math, science, and logic problems is the primary requirement and you can accept reasoning tokens on every query, Magistral is purpose-built for that.
Magistral Small ships under the Apache 2.0 license, making it suitable for commercial self-hosted deployments without usage restrictions. Magistral Medium is proprietary and accessible only through Mistral's API or Le Chat. For context on how reinforcement learning shapes reasoning model behavior, the GRPO fine-tuning guide walks through one widely-used RL approach for reasoning models.
Magistral Small vs Magistral Medium
| Property | Magistral Small | Magistral Medium |
|---|---|---|
| Parameters | 24B | Not publicly disclosed |
| Context length | 128K tokens (40K recommended) | 128K tokens |
| VRAM at FP8 | ~28 GB | N/A (API only) |
| Recommended GPU | 1x H100 SXM5 | Mistral API only |
| AIME 2024 pass@1 | ~70.7% | ~73.6% |
| License | Apache 2.0 | Proprietary |
| Availability | Self-hostable | Mistral API / Le Chat |
The choice between the two comes down to whether self-hosting matters. Magistral Small is open-source, runs on a single GPU, and handles the large majority of production reasoning tasks (code review, data analysis, structured problem-solving). It is the right starting point for most teams. Magistral Medium is available via Mistral's API when you need higher reasoning accuracy. It scores approximately 73.6% on AIME 2024 versus Small's 70.7%, a meaningful gap for competition-level math problems but negligible for most production workloads.
For context window requirements, Magistral Small nominally supports 128K tokens, but the HuggingFace model card notes that performance degrades past 40K tokens, with 40K as the recommended maximum for production use. For long-context reasoning workloads, keep --max-model-len at 32K-40K and consider a single H200 SXM5 instance for the extra KV cache headroom.
Hardware Requirements: VRAM, GPU Recommendations, and Quantization
The VRAM math is direct. At FP8 (1 byte per parameter), Magistral Small at 24B parameters needs 24 GB for weights. Add 15% activation and framework overhead and you get roughly 27.6 GB. That leaves 52 GB on an 80 GB H100 SXM5 rental for KV cache, which is enough for thousands of concurrent reasoning tokens even on long chains. Keep --max-model-len at 32K-40K in production; the model card recommends not exceeding 40K for reliable output quality, despite the 128K nominal support.
| Config | Precision | Weights + overhead | KV Cache Headroom | Notes |
|---|---|---|---|---|
| 1x H100 SXM5 (80 GB) | FP8 | ~28 GB | ~47 GB | Recommended for Magistral Small |
| 1x H200 SXM5 (141 GB) | FP8 | ~28 GB | ~108 GB | Best single-GPU option for Small |
| 1x RTX PRO 6000 (96 GB) | INT4 | ~14 GB | ~77 GB | Budget option for Small; validate accuracy |
For INT4 quantization: Magistral Small fits comfortably on an RTX PRO 6000 rental at INT4, but reasoning accuracy degrades more than on standard generation tasks. Quantization errors compound through long thinking chains; a small precision error early in a 15,000-token reasoning trace can shift logical inferences and produce wrong answers. FP8 on H100-class hardware causes less than 1-2% accuracy loss on AIME benchmarks and is the correct default for production deployments.
For H200 instances on Spheron, single-GPU configs work well for Magistral Small with extended context. The H200's 141 GB VRAM gives roughly 2x the KV cache headroom of an H100, which matters when serving longer reasoning chains.
Step-by-Step vLLM Deployment on Spheron GPU Cloud
Step 1: Provision a Spheron GPU instance
Log in at app.spheron.ai and go to GPU Cloud. Select H100 SXM5 with a count of 1 for Magistral Small. Use the PyTorch 2.5 / CUDA 12.4 base image. Enable spot pricing to reduce hourly cost. Provision at least 200 GB of persistent storage for model weights and vLLM cache.
# Verify GPU setup after SSH access
nvidia-smi
nvidia-smi topo -m # confirm NVLink topology on multi-GPU configsStep 2: Install vLLM and dependencies
pip install "vllm>=0.8.0"
pip install huggingface_hub hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_TOKEN=your_hf_token_hereIf the Magistral repository is gated, accept the license terms at the model card on HuggingFace before running the download.
Step 3: Download Magistral weights
# Magistral Small (24B, ~24 GB FP8 checkpoint)
huggingface-cli download mistralai/Magistral-Small-2506 \
--local-dir /models/magistral-smallNote: mistralai/Magistral-Small-2506 is a provisional model ID based on Mistral's naming convention. Verify the exact repository name on the official HuggingFace model card before downloading, as IDs sometimes change between announcement and final release. Magistral Medium weights are not publicly available (the repository returns 401 Unauthorized); use the Mistral API to access Medium.
Use persistent storage on your Spheron instance to avoid re-downloading on restarts.
Step 4: Launch vLLM with reasoning template
Magistral uses a chain-of-thought reasoning architecture. vLLM does not currently ship a built-in reasoning parser named mistral or magistral; parse <think>...</think> blocks client-side (as shown in Step 5) rather than relying on a server-side parser flag. Check the vLLM reasoning outputs documentation for any newly added Magistral parser at your installed version.
For Magistral Small on 1x H100:
vllm serve /models/magistral-small \
--served-model-name magistral-small \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--kv-cache-dtype fp8_e5m2 \
--port 8000--max-model-len 32768 keeps context within the 40K recommended ceiling from the Magistral Small model card; do not push to 128K in production. --enable-chunked-prefill is important for reasoning workloads. The prefill phase must process the entire system prompt and conversation context before generating reasoning tokens, and chunked prefill breaks long prefills into smaller batches so other requests are not blocked.
Step 5: Configure chain-of-thought output
Magistral generates <think>...</think> tokens before the visible response. Here is how to work with the reasoning output using the OpenAI-compatible client:
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="unused"
)
# Reasoning query: math problem
response = client.chat.completions.create(
model="magistral-small",
messages=[{"role": "user", "content": "Prove that the sum of angles in a triangle is 180 degrees."}],
max_tokens=8192, # budget for thinking + response
)
full_output = response.choices[0].message.content
# Parse thinking trace from response
# full_output can be None when vLLM places reasoning into reasoning_content
if full_output and "</think>" in full_output:
thinking_trace, final_answer = full_output.split("</think>", 1)
thinking_trace = thinking_trace.replace("<think>", "").strip()
final_answer = final_answer.strip()
print(f"Thinking ({len(thinking_trace.split())} words):")
print(thinking_trace[:500] + "..." if len(thinking_trace) > 500 else thinking_trace)
print("\nFinal answer:")
print(final_answer)
else:
reasoning = getattr(response.choices[0].message, 'reasoning_content', None)
if reasoning:
print("Reasoning content:")
print(reasoning)
else:
print(full_output)For token budget control, set max_tokens based on the expected reasoning depth. Competition math problems typically need 8,000-16,000 tokens total. For simpler tasks, 2,000-4,000 tokens is sufficient. Unconstrained reasoning chains on hard problems can exceed 20,000 tokens; set an appropriate cap to avoid blocking the serving queue.
Step 6: Test the deployment
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "magistral-small",
"messages": [{"role": "user", "content": "If a train travels 120 km in 1.5 hours, what is its average speed in m/s?"}],
"max_tokens": 4096
}'A working deployment returns a response containing a <think> block with the step-by-step calculation followed by the final numerical answer. If you see only a short response without any reasoning, verify that the model loaded correctly and that you are using the reasoning-native Magistral model rather than a fine-tuned instruct-only variant.
# Verify GPU utilization during active inference
nvidia-smi dmon -s uExpect GPU utilization above 80% during active reasoning generation. If utilization drops below 60%, increase batch size with --max-num-seqs 64 or higher.
Benchmarks: Magistral vs DeepSeek R2 and Nemotron Ultra
| Model | AIME 2024 pass@1 | MATH-500 | HumanEval | MMLU |
|---|---|---|---|---|
| Magistral Small (24B) | ~70.7% | ~93.7% | ~87% | ~85% |
| Magistral Medium (API only) | ~73.6% | ~95.1% | ~92% | ~89% |
| DeepSeek R2 Distill 70B | ~79.8% | ~94.9% | ~86% | ~83% |
| Nemotron Ultra 253B | ~83.2% | ~94.8% | ~93% | ~88% |
Note: AIME 2024 pass@1 figures for Magistral are from Mistral's official release. MATH-500, HumanEval, and MMLU figures for Magistral are approximate community evaluations not independently verified by Mistral's published release; treat them as directional estimates. The non-Magistral figures combine community evaluations; verify against formal publications before making model selection decisions.
Magistral Medium sits between DeepSeek R2 distilled 70B and Nemotron Ultra on reasoning benchmarks, but it is accessible only via Mistral's API. For teams needing an open-source self-hosted alternative to DeepSeek R2, Magistral Small at 24B parameters is notably strong for its size on math benchmarks, ships under Apache 2.0, and is the practical self-hosted choice when GPU cost is the constraint.
For general production use, benchmark results matter less than latency on your actual query distribution. Reasoning models generate widely varying output lengths; run your own benchmark with representative queries before committing to a hardware config.
Inference Cost Analysis: Spheron vs RunPod vs Together AI
Prices fetched from the Spheron live pricing API. RunPod and Together AI rates change frequently; verify at their respective pricing pages before making deployment decisions.
| Provider | GPU | On-Demand $/hr | Spot $/hr | Est. $/1M tokens (spot) |
|---|---|---|---|---|
| Spheron | 1x H100 SXM5 | $3.90 | $1.73 | ~$0.43-0.58 |
| Spheron | 1x H200 SXM5 | $2.51 | $1.40 | ~$0.35-0.50 |
| RunPod | H100 SXM5 | Check runpod.io | Check runpod.io | varies |
| Together AI | Serverless | Per-token billing | n/a | check pricing |
Token throughput assumptions: Magistral Small on 1x H100 at FP8 sustains approximately 1,200-1,800 tok/s at 70% utilization, yielding 3-4.5M tokens per hour. The estimated $/1M tokens column reflects spot pricing.
For bursty reasoning workloads, Magistral's per-second billing on Spheron works in your favor. Reasoning chains generate variable-length outputs, from 1,000 tokens on simple tasks to 20,000+ tokens on hard math problems. You pay only for the seconds the GPU is active, rather than a minimum block charge. Spot instances bring the H200 rate to $1.40/hr, the lowest per-token cost for Magistral Small in a single-GPU config. For production workloads where spot preemption is not acceptable, on-demand H100 SXM5 at $3.90/hr or H200 SXM5 at $2.51/hr keeps Magistral Small serving continuously.
Pricing fluctuates based on GPU availability. The prices above are based on 27 May 2026 and may have changed. Check current GPU pricing → for live rates.
Production Tips: KV Cache, Speculative Decoding, and Monitoring
KV cache for long reasoning chains. Magistral reasoning traces run 5,000-25,000 tokens on math and science problems. Each token holds KV state in memory until the request completes. Enable --kv-cache-dtype fp8_e5m2 to cut KV cache memory by half with negligible accuracy loss on H100 hardware. Set --max-model-len to your 95th-percentile trace length: if 95% of your reasoning chains complete within 16,384 tokens, use that value rather than 32,768 or higher. Over-provisioning max-model-len reserves memory for worst-case requests that may never arrive, reducing available slots for other concurrent users.
Speculative decoding. As of vLLM v0.8+, speculative decoding works for standard generation but compatibility with reasoning trace tokens is model-dependent. The structured pattern of Magistral's thinking tokens (mathematical notation, step-by-step logical chains) makes them good candidates for speculative decoding, but verify against your specific vLLM version before enabling in production. A mismatch between the draft and main model's handling of <think> tokens can corrupt the reasoning trace.
Monitoring. Expose the vLLM Prometheus metrics endpoint at /metrics. Key metrics for reasoning workloads:
vllm:gpu_cache_usage_perc: KV cache utilization. If this stays above 90%, reduce--max-model-lenor add GPUs.vllm:num_requests_running: concurrent active requests. High counts with low GPU utilization indicates scheduling overhead.- Use
nvidia-smi dmon -s ufor real-time per-GPU utilization.
Production readiness checklist:
- Enable
--kv-cache-dtype fp8_e5m2to maximize KV cache capacity for long thinking chains - Set
--max-model-lento 95th-percentile reasoning trace length, not theoretical maximum - Enable
--enable-chunked-prefillfor long-context workloads - Add request-level
max_tokenscaps to prevent runaway 30,000+ token reasoning chains from blocking the queue - Monitor
vllm:gpu_cache_usage_percand alert above 85%
Magistral brings reasoning-native inference to open-source deployment. On Spheron's H100 and H200 bare-metal instances, per-second billing means you pay for thinking time, not idle time between queries.
Rent H100 SXM5 → | Rent H200 SXM5 → | View all GPU pricing →
Quick Setup Guide
Select Magistral Small (open-source, Apache 2.0) for self-hosted deployments on H100 SXM5 or H200 SXM5. For budget workloads with shorter contexts, Magistral Small with FP8 on a single H100 (80 GB) covers most production reasoning tasks. To access Magistral Medium, use the Mistral API. It is not available for self-hosting.
Log in to app.spheron.ai, navigate to GPU Cloud, select H100 SXM5 or H200 SXM5, and deploy with the PyTorch 2.5 / CUDA 12.4 base image. Enable spot pricing to reduce hourly cost. Provision at least 200 GB of persistent storage for model weights and vLLM cache.
Run pip install 'vllm>=0.8.0' and pip install huggingface_hub hf_transfer. Export HF_TOKEN with your HuggingFace access token and set HF_HUB_ENABLE_HF_TRANSFER=1. Visit the Magistral model card on HuggingFace and accept the license terms before downloading weights.
Use huggingface-cli download mistralai/Magistral-Small-2506 for the 24B self-hostable model. Verify the exact model ID on the official HuggingFace page at publish time. Magistral Medium weights are not publicly available; access Medium via the Mistral API instead. FP8 checkpoint size for Small is approximately 24 GB.
For Magistral Small on 1x H100: vllm serve mistralai/Magistral-Small-2506 --dtype fp8 --max-model-len 32768 --port 8000. vLLM does not currently ship a built-in reasoning parser for Magistral; parse <think>...</think> blocks client-side as shown in Step 5 of this guide, or check the vLLM reasoning outputs docs for any newly added parser at your installed version.
Send a math or logic prompt to /v1/chat/completions and verify the response contains a <think>...</think> block. Set max_tokens to at least 8192 to allow full reasoning depth. Use nvidia-smi dmon -s u to verify GPU utilization stays above 80% under load.
Add --kv-cache-dtype fp8_e5m2 to halve KV cache memory with near-zero accuracy impact. Set --max-model-len to the 95th-percentile reasoning trace length for your workload. Reasoning chains on Magistral typically run 5,000-25,000 tokens on math and science problems; size the KV cache budget accordingly.
Frequently Asked Questions
Magistral Small (24B parameters) at FP8 needs roughly 28 GB of GPU memory for weights plus KV cache overhead. A single H100 SXM5 (80 GB) handles it with plenty of headroom for long reasoning chains. For budget workloads, Magistral Small with INT4 quantization runs on a single RTX PRO 6000 (96 GB), though FP8 is recommended for reasoning accuracy. Magistral Medium is available only via the Mistral API and cannot be self-hosted.
Magistral Small (24B parameters) is open-source under Apache 2.0 and fully self-hostable. It scores approximately 70.7% on AIME 2024 (pass@1). Magistral Medium is proprietary. Mistral has not publicly disclosed its parameter count, and the weights are not available for download. It is accessible only via Mistral's API or Le Chat, not for self-hosting. Magistral Medium scores approximately 73.6% on AIME 2024 (pass@1). For self-hosted production reasoning tasks, Magistral Small on a single H100 spot instance at $1.73/hr on Spheron covers the large majority of use cases.
Magistral generates internal reasoning tokens wrapped in <think>...</think> tags before the visible response. vLLM does not currently ship a dedicated reasoning parser for Magistral; parse the <think>...</think> blocks client-side by splitting the response on '</think>' and taking the second element for the final answer. Check the vLLM reasoning outputs docs for any newly added parsers at your installed version. Set max_tokens high enough to cover both the thinking trace and the response. For most math and logic problems, 8192-16384 tokens total covers 95th-percentile reasoning depth. To suppress the thinking trace in the response, use a system prompt instructing the model to keep reasoning concise, though hard enforcement requires setting a max_tokens cap at the request level.
At moderate query volumes, yes. Mistral's API pricing for Magistral runs around $0.50-2.00 per million tokens depending on model size and reasoning depth. Self-hosted Magistral Small on a single H100 spot instance at $1.73/hr generates roughly 3-4 million tokens per hour at 70% utilization, putting the cost at $0.43-0.58 per million tokens. The crossover point is approximately 300k-500k queries per month for Magistral Small, assuming average reasoning chains of 3,000-6,000 tokens. The main operational cost is managing the deployment rather than the GPU hours themselves. Magistral Medium is not available for self-hosting; it is accessible only via Mistral's API.
Magistral Small can run at INT4 on hardware with sufficient VRAM, such as the RTX PRO 6000 (96 GB). However, INT4 is not recommended for reasoning workloads. Quantization errors compound through long thinking chains: a precision error that causes a minor output difference in a 500-token standard response can cascade through a 10,000-token reasoning trace and produce an incorrect final answer. FP8 on H100 class hardware reduces VRAM by 50% compared to BF16 with less than 1-2% accuracy loss on MATH-500 and AIME benchmarks, and is the correct default for production Magistral deployments. If hardware constraints force INT4, validate accuracy on your specific task set before deploying to production.
