Most teams deploying AI for production use cases end up running four separate model servers: one for general instruction following, one for reasoning, one for vision, one for code. Four instances to provision, four sets of weights to store, four inference servers to monitor. Mistral Small 4 changes this calculus. It's a 119B MoE model that handles all four capability classes in a single deployment, with roughly 6B parameters active per forward pass (8B including embeddings).
This guide covers everything you need to run it: exact GPU and VRAM requirements, step-by-step vLLM setup on Spheron, how to tune the reasoning budget per request, and a side-by-side cost comparison against running four separate models.
What Makes Mistral Small 4 Different
Mistral Small 4 uses a Mixture of Experts architecture: 119B total parameters with roughly 6B active per token during inference (8B including embedding and output layers). The router selects a subset of expert networks for each forward pass, keeping compute cost similar to a dense 6-8B model while retaining the capacity of a much larger network across diverse tasks.
What makes it worth attention in 2026 is the breadth of what a single checkpoint covers:
Instruction following. Standard chat and completion workloads. The model follows complex multi-turn instructions without needing special prompting or system prompt tricks.
Configurable reasoning. Mistral Small 4 supports think tokens: an internal chain-of-thought scratchpad the model uses before generating its final answer. You control the reasoning depth per request via the reasoning_effort field. Use "low" for fast answers, "high" for complex problem-solving.
Vision inputs. The model processes image inputs alongside text, covering use cases like document understanding, chart reading, and visual question answering in a single deployment.
Code generation. Competitive on standard coding benchmarks across multiple languages. Sufficient for most production code generation workloads without a separate code-specialized model.
| Property | Value |
|---|---|
| Total parameters | 119B |
| Active parameters per token | ~6B (8B with embeddings) |
| Context window | 256K tokens |
| Quantization support | BF16, FP8, INT4 |
| License | Apache 2.0 |
Our MoE inference optimization guide covers expert routing overhead and parallelism strategies in depth.
GPU Hardware Requirements
The VRAM math is straightforward. At BF16, 119B parameters × 2 bytes = 238 GB for weights alone. With the standard 15% activation and framework overhead, you're looking at roughly 275 GB runtime footprint before any KV cache. That rules out most single-node setups at BF16.
At FP8, 119B × 1 byte = approximately 120 GB for weights. Add 15% overhead and you're at around 138 GB before KV cache. Two H200 SXM5 GPUs give you 282 GB total VRAM, which fits the FP8 weights comfortably and leaves roughly 130-140 GB for KV cache at max-model-len 32K.
| Config | Precision | Weights + overhead | KV Cache Headroom | Notes |
|---|---|---|---|---|
| 2x H200 SXM5 (282 GB) | FP8 | ~138 GB weights + overhead | ~130 GB | Recommended |
| 4x H100 SXM5 (320 GB) | FP8 | ~138 GB weights + overhead | ~170 GB | Same weights, more KV cache room |
| 2x H100 SXM5 (160 GB) | INT4 | ~60 GB weights + overhead | ~90 GB | Budget option, shorter contexts |
| 1x H200 SXM5 (141 GB) | INT4 | ~60 GB weights + overhead | ~70 GB | Minimum viable, limited concurrency |
Why 2x H200 over 4x H100? The H200 SXM5 uses NVLink interconnect between GPUs, which reduces tensor parallelism communication overhead significantly compared to four separate H100 nodes. For a 2-GPU setup, the bandwidth advantage translates to meaningful throughput gains. Four H100s gives more total VRAM (320 GB) and is better if you need to run longer contexts or larger batch sizes.
For a general VRAM sizing reference, see the GPU memory requirements guide for LLMs.
Reserve your compute now: H200 instances on Spheron support 2x and 4x configs with NVLink. For H100 multi-GPU setups, see H100 GPU rental on Spheron.
Step-by-Step Deployment with vLLM on Spheron GPU Cloud
Step 1: Provision a Spheron GPU instance
Log in at app.spheron.ai and navigate to GPU Cloud. Select H200 SXM5 with a count of 2, or H100 SXM5 with a count of 4. Use the PyTorch 2.5 / CUDA 12.4 base image. Enable spot pricing to reduce your hourly cost. Provision at least 300 GB of persistent storage for model weights and vLLM cache.
# Verify GPU setup after SSH access
nvidia-smi
nvidia-smi topo -m # confirm NVLink topology on H200Step 2: Install vLLM and dependencies
pip install "vllm>=0.8.0"
pip install huggingface_hub hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_TOKEN=your_hf_token_hereIf the Mistral Small 4 repository is gated, accept the license at huggingface.co/mistralai/Mistral-Small-4 before running the download.
Step 3: Download weights
huggingface-cli download mistralai/Mistral-Small-4 \
--local-dir /models/mistral-small-4The FP8 checkpoint size depends on the release format; verify from the model card. BF16 weights are approximately 238 GB. Use persistent storage to avoid re-downloading on instance restarts.
Step 4: Launch vLLM (2x H200 config)
vllm serve /models/mistral-small-4 \
--tensor-parallel-size 2 \
--served-model-name mistral-small-4 \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--kv-cache-dtype fp8 \
--port 8000--enable-chunked-prefill helps throughput on long-context requests without blocking shorter ones. --kv-cache-dtype fp8 cuts KV cache memory by half, giving you more room for concurrent requests.
Step 5: Launch vLLM (4x H100 config)
vllm serve /models/mistral-small-4 \
--tensor-parallel-size 4 \
--served-model-name mistral-small-4 \
--dtype fp8 \
--max-model-len 65536 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--kv-cache-dtype fp8 \
--port 8000With 4x H100 and more total VRAM, you can push --max-model-len to 65K. For 128K context, change --max-model-len to 131072 and reduce --max-num-seqs to 64 to avoid KV cache pressure.
Step 6: Test the endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-small-4",
"messages": [{"role": "user", "content": "Write a Python function to parse JSON with error handling."}],
"max_tokens": 512
}'Expect the first request to be slower as vLLM warms up the CUDA graphs. Subsequent requests at similar lengths will be significantly faster.
See the vLLM multi-GPU production deployment guide for load balancing, health checks, and monitoring setup.
Configurable Reasoning Effort: Tuning Think Tokens Per Request
Mistral Small 4 includes a configurable reasoning mode. When enabled, the model generates an internal chain-of-thought before producing its final response. The thinking tokens are hidden in the API response by default.
You control the reasoning depth via reasoning_effort in the request body. Set it to "low" for fast instruct-style replies, "medium" for balanced reasoning, or "high" for problems that benefit from extended chain-of-thought. Omit the field to disable thinking entirely.
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="unused"
)
# Fast mode: minimal thinking
response = client.chat.completions.create(
model="mistral-small-4",
messages=[{"role": "user", "content": "Summarize this document in three sentences."}],
extra_body={"reasoning_effort": "low"}
)
# Deep reasoning mode
response = client.chat.completions.create(
model="mistral-small-4",
messages=[{"role": "user", "content": "Find the bug in this algorithm and explain why it fails."}],
extra_body={"reasoning_effort": "high"}
)
# Pure instruct mode: no thinking
response = client.chat.completions.create(
model="mistral-small-4",
messages=[{"role": "user", "content": "Translate this to Spanish."}],
)Here is the rough latency and quality trade-off by effort level:
| reasoning_effort | Use case | Latency impact |
|---|---|---|
| Not set | Translation, summarization, chat | Minimal |
| "low" | Code review, short explanations | Low |
| "medium" | Debugging, data analysis | Moderate |
| "high" | Complex reasoning, proofs, long-form analysis | High |
Benchmark Results: Mistral Small 4 vs Qwen 3.5 vs Llama 4 Scout
For official quality benchmarks (MMLU, HumanEval, MATH, vision tasks), see the Mistral Small 4 model card and the official Mistral release announcement. Published numbers from competing models: Qwen 3.5-35B-A3B scores approximately 79 on MMLU, 85 on HumanEval, and 87 on MATH. Llama 4 Scout lands around 79.2 on MMLU and 74 on HumanEval.
Throughput on 2x H200 (vLLM FP8, TP=2, batch=32, input=512 tokens, output=256 tokens) varies significantly by sequence length mix and batch composition. Run your own benchmark with your actual request distribution before committing to a hardware config.
For a broader model comparison, see DeepSeek vs Llama 4 vs Qwen3 for the current model landscape and benchmark context. For inference framework trade-offs across these model families, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
Cost Analysis: One Model vs Four Separate Deployments
Here is the practical comparison. The "before" scenario assumes each model runs on its own dedicated H100 SXM5 instance, using the lowest available on-demand rate.
Before: four separate model servers
| Role | Model | Config | $/hr | $/month |
|---|---|---|---|---|
| Instruct | Llama 3.3 70B | 1x H100 SXM5 | $2.54 | $1,855 |
| Reasoning | DeepSeek R1 70B | 1x H100 SXM5 | $2.54 | $1,855 |
| Vision | Qwen2.5-VL 32B | 1x H100 SXM5 | $2.54 | $1,855 |
| Code | Qwen2.5-Coder 32B | 1x H100 SXM5 | $2.54 | $1,855 |
| Total | 4x H100 SXM5 | $10.16 | $7,420 |
After: single Mistral Small 4 deployment
| Config | $/hr on-demand | $/hr spot | $/month on-demand | $/month spot |
|---|---|---|---|---|
| 2x H200 SXM5 | $7.92 | $2.38 | $5,782 | $1,738 |
| 4x H100 SXM5 | $10.16 | $3.20 | $7,420 | $2,336 |
The 2x H200 spot configuration at $1,738/month undercuts the four separate H100 spot deployments ($2,336/month) by approximately $598/month. On-demand, 2x H200 at $7.92/hr ($5,782/month) is also cheaper than four separate H100s at $10.16/hr ($7,420/month), saving roughly $1,638/month without touching spot pricing.
Beyond the monthly cost, there are operational savings that are harder to quantify: one inference server to monitor instead of four, one set of weights to download and update, one API endpoint, and roughly half the total GPU count in the recommended config.
The 4x H100 config runs at the same on-demand cost as four separate H100 deployments ($10.16/hr) but collapses four services into one: one deployment, one API surface, one set of weights to manage.
Pricing fluctuates based on GPU availability. The prices above are based on 21 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Compare current GPU rental pricing on Spheron for the latest rates before making a deployment decision.
Production Tips
Context length scaling
Start with --max-model-len 32768. This is the sweet spot for most workloads: low KV cache pressure, good concurrency, and enough context for the majority of real-world tasks. For applications that need 64K+ context, move to the 4x H100 config for more total VRAM, or add --max-num-seqs 32 to limit concurrent requests and free KV cache for longer sequences.
For full 256K context on 2x H200, use --max-model-len 262144. Expect fewer concurrent sessions at maximum context depth:
vllm serve /models/mistral-small-4 \
--tensor-parallel-size 2 \
--served-model-name mistral-small-4 \
--dtype fp8 \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 8 \
--enable-chunked-prefill \
--port 8000Keep --kv-cache-dtype fp8 here: FP8 stores each KV entry in 1 byte instead of 2 bytes (BF16), halving KV cache memory per token. At 256K context each sequence consumes enormous KV cache, so the memory saving is at its most valuable in this config.
Batch size tuning
Set --max-num-seqs 256 for high-throughput batch inference. For latency-sensitive production serving, reduce to 32-64. Watch nvidia-smi dmon -s u to keep all GPUs above 80% utilization under load. If you see GPU utilization dropping below 60%, increase batch size. If you see frequent OOM warnings in vLLM logs, reduce it.
Spot instance savings
For development, staging, and non-interactive batch jobs, spot instances cut the hourly rate by 30-60%. Spot preemption on Spheron gives advance notice; configure your vLLM wrapper to checkpoint the current request queue gracefully before shutdown. See the speculative decoding production guide for techniques to improve per-token latency on top of this setup.
FP8 KV cache
Adding --kv-cache-dtype fp8 cuts KV cache memory by roughly half with minimal quality impact on most benchmarks. This frees VRAM for longer sequences or larger concurrent batches without changing the GPU count. Note it is incompatible with --enable-prefix-caching, so choose based on your workload: prefix caching for high-repetition chat, FP8 KV cache for diverse or long-context workloads. The KV cache optimization guide covers eviction strategies and when each approach wins.
Mistral Small 4 on 2x H200 runs cleanly on Spheron's bare-metal multi-GPU instances: no virtualization overhead, full NVLink bandwidth, and spot pricing that cuts the hourly rate significantly below on-demand.
Rent H200 on Spheron → | Rent H100 → | View all GPU pricing →
