Deploy Mistral Small 4: Self-Host the 119B MoE Model on GPU Cloud (2026)

Most teams deploying AI for production use cases end up running four separate model servers: one for general instruction following, one for reasoning, one for vision, one for code. Four instances to provision, four sets of weights to store, four inference servers to monitor. Mistral Small 4 changes this calculus. It's a 119B MoE model that handles all four capability classes in a single deployment, with roughly 6B parameters active per forward pass (8B including embeddings).

This guide covers everything you need to run it: exact GPU and VRAM requirements, step-by-step vLLM setup on Spheron, how to tune the reasoning budget per request, and a side-by-side cost comparison against running four separate models. If you need a lighter Mistral deployment without the MoE overhead, Mistral's newer Ministral 3 family covers the same instruction, reasoning, and vision capabilities in 3B, 8B, and 14B dense models. For Mistral's reasoning-native model family with always-on chain-of-thought, see the Magistral deployment guide. If you need a larger Mistral model for higher reasoning quality, Mistral also released Mistral Large 3, a 675B MoE requiring multi-GPU setup on 4x B200 or 8x H200.

What Makes Mistral Small 4 Different

Mistral Small 4 uses a Mixture of Experts architecture: 119B total parameters with roughly 6B active per token during inference (8B including embedding and output layers). The router selects a subset of expert networks for each forward pass, keeping compute cost similar to a dense 6-8B model while retaining the capacity of a much larger network across diverse tasks.

What makes it worth attention in 2026 is the breadth of what a single checkpoint covers:

Instruction following. Standard chat and completion workloads. The model follows complex multi-turn instructions without needing special prompting or system prompt tricks.

Configurable reasoning. Mistral Small 4 supports think tokens: an internal chain-of-thought scratchpad the model uses before generating its final answer. You control the reasoning depth per request via the reasoning_effort field. Use "low" for fast answers, "high" for complex problem-solving.

Vision inputs. The model processes image inputs alongside text, covering use cases like document understanding, chart reading, and visual question answering in a single deployment.

Code generation. Competitive on standard coding benchmarks across multiple languages. Sufficient for most production code generation workloads without a separate code-specialized model.

Property	Value
Total parameters	119B
Active parameters per token	~6B (8B with embeddings)
Context window	256K tokens
Quantization support	BF16, FP8, INT4
License	Apache 2.0

Our MoE inference optimization guide covers expert routing overhead and parallelism strategies in depth.

GPU Hardware Requirements

The VRAM math is straightforward. At BF16, 119B parameters × 2 bytes = 238 GB for weights alone. With the standard 15% activation and framework overhead, you're looking at roughly 275 GB runtime footprint before any KV cache. That rules out most single-node setups at BF16.

At FP8, 119B × 1 byte = approximately 120 GB for weights. Add 15% overhead and you're at around 138 GB before KV cache. Two H200 SXM5 GPUs give you 282 GB total VRAM, which fits the FP8 weights comfortably and leaves roughly 130-140 GB for KV cache at max-model-len 32K.

Config	Precision	Weights + overhead	KV Cache Headroom	Notes
2x H200 SXM5 (282 GB)	FP8	~138 GB weights + overhead	~130 GB	Recommended
4x H100 SXM5 (320 GB)	FP8	~138 GB weights + overhead	~170 GB	Same weights, more KV cache room
2x H100 SXM5 (160 GB)	INT4	~60 GB weights + overhead	~90 GB	Budget option, shorter contexts
1x H200 SXM5 (141 GB)	INT4	~60 GB weights + overhead	~70 GB	Minimum viable, limited concurrency

Why 2x H200 over 4x H100? The H200 SXM5 uses NVLink interconnect between GPUs, which reduces tensor parallelism communication overhead significantly compared to four separate H100 nodes. For a 2-GPU setup, the bandwidth advantage translates to meaningful throughput gains. Four H100s gives more total VRAM (320 GB) and is better if you need to run longer contexts or larger batch sizes.

For a general VRAM sizing reference, see the GPU memory requirements guide for LLMs.

Reserve your compute now: H200 instances on Spheron support 2x and 4x configs with NVLink. For H100 multi-GPU setups, see H100 GPU rental on Spheron.

Step-by-Step Deployment with vLLM on Spheron GPU Cloud

Step 1: Provision a Spheron GPU instance

Log in at app.spheron.ai and navigate to GPU Cloud. Select H200 SXM5 with a count of 2, or H100 SXM5 with a count of 4. Use the PyTorch 2.5 / CUDA 12.4 base image. Enable spot pricing to reduce your hourly cost. Provision at least 300 GB of persistent storage for model weights and vLLM cache.

bash

# Verify GPU setup after SSH access
nvidia-smi
nvidia-smi topo -m  # confirm NVLink topology on H200

Step 2: Install vLLM and dependencies

bash

pip install "vllm>=0.8.0"
pip install huggingface_hub hf_transfer

export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_TOKEN=your_hf_token_here

If the Mistral Small 4 repository is gated, accept the license at huggingface.co/mistralai/Mistral-Small-4 before running the download.

Step 3: Download weights

bash

huggingface-cli download mistralai/Mistral-Small-4 \
  --local-dir /models/mistral-small-4

The FP8 checkpoint size depends on the release format; verify from the model card. BF16 weights are approximately 238 GB. Use persistent storage to avoid re-downloading on instance restarts.

Step 4: Launch vLLM (2x H200 config)

bash

vllm serve /models/mistral-small-4 \
  --tensor-parallel-size 2 \
  --served-model-name mistral-small-4 \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --kv-cache-dtype fp8 \
  --port 8000

--enable-chunked-prefill helps throughput on long-context requests without blocking shorter ones. --kv-cache-dtype fp8 cuts KV cache memory by half, giving you more room for concurrent requests.

Step 5: Launch vLLM (4x H100 config)

bash

vllm serve /models/mistral-small-4 \
  --tensor-parallel-size 4 \
  --served-model-name mistral-small-4 \
  --dtype fp8 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --kv-cache-dtype fp8 \
  --port 8000

With 4x H100 and more total VRAM, you can push --max-model-len to 65K. For 128K context, change --max-model-len to 131072 and reduce --max-num-seqs to 64 to avoid KV cache pressure.

Step 6: Test the endpoint

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-small-4",
    "messages": [{"role": "user", "content": "Write a Python function to parse JSON with error handling."}],
    "max_tokens": 512
  }'

Expect the first request to be slower as vLLM warms up the CUDA graphs. Subsequent requests at similar lengths will be significantly faster.

See the vLLM multi-GPU production deployment guide for load balancing, health checks, and monitoring setup.

Configurable Reasoning Effort: Tuning Think Tokens Per Request

Mistral Small 4 includes a configurable reasoning mode. When enabled, the model generates an internal chain-of-thought before producing its final response. The thinking tokens are hidden in the API response by default.

You control the reasoning depth via reasoning_effort in the request body. Set it to "low" for fast instruct-style replies, "medium" for balanced reasoning, or "high" for problems that benefit from extended chain-of-thought. Omit the field to disable thinking entirely.

python

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="unused"
)

# Fast mode: minimal thinking
response = client.chat.completions.create(
    model="mistral-small-4",
    messages=[{"role": "user", "content": "Summarize this document in three sentences."}],
    extra_body={"reasoning_effort": "low"}
)

# Deep reasoning mode
response = client.chat.completions.create(
    model="mistral-small-4",
    messages=[{"role": "user", "content": "Find the bug in this algorithm and explain why it fails."}],
    extra_body={"reasoning_effort": "high"}
)

# Pure instruct mode: no thinking
response = client.chat.completions.create(
    model="mistral-small-4",
    messages=[{"role": "user", "content": "Translate this to Spanish."}],
)

Here is the rough latency and quality trade-off by effort level:

reasoning_effort	Use case	Latency impact
Not set	Translation, summarization, chat	Minimal
"low"	Code review, short explanations	Low
"medium"	Debugging, data analysis	Moderate
"high"	Complex reasoning, proofs, long-form analysis	High

Benchmark Results: Mistral Small 4 vs Qwen 3.5 vs Llama 4 Scout

For official quality benchmarks (MMLU, HumanEval, MATH, vision tasks), see the Mistral Small 4 model card and the official Mistral release announcement. Published numbers from competing models: Qwen 3.5-35B-A3B scores approximately 79 on MMLU, 85 on HumanEval, and 87 on MATH. Llama 4 Scout lands around 79.2 on MMLU and 74 on HumanEval.

Throughput on 2x H200 (vLLM FP8, TP=2, batch=32, input=512 tokens, output=256 tokens) varies significantly by sequence length mix and batch composition. Run your own benchmark with your actual request distribution before committing to a hardware config.

For a broader model comparison, see DeepSeek vs Llama 4 vs Qwen3 for the current model landscape and benchmark context. For inference framework trade-offs across these model families, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Cost Analysis: One Model vs Four Separate Deployments

Here is the practical comparison. The "before" scenario assumes each model runs on its own dedicated H100 SXM5 instance, using the lowest available on-demand rate.

Before: four separate model servers

Role	Model	Config	$/hr	$/month
Instruct	Llama 3.3 70B	1x H100 SXM5	$2.54	$1,855
Reasoning	DeepSeek R1 70B	1x H100 SXM5	$2.54	$1,855
Vision	Qwen2.5-VL 32B	1x H100 SXM5	$2.54	$1,855
Code	Qwen2.5-Coder 32B	1x H100 SXM5	$2.54	$1,855
Total	4x H100 SXM5	$10.16	$7,420

After: single Mistral Small 4 deployment

Config	$/hr on-demand	$/hr spot	$/month on-demand	$/month spot
2x H200 SXM5	$7.92	$2.38	$5,782	$1,738
4x H100 SXM5	$10.16	$3.20	$7,420	$2,336

The 2x H200 spot configuration at $1,738/month undercuts the four separate H100 spot deployments ($2,336/month) by approximately $598/month. On-demand, 2x H200 at $7.92/hr ($5,782/month) is also cheaper than four separate H100s at $10.16/hr ($7,420/month), saving roughly $1,638/month without touching spot pricing.

Beyond the monthly cost, there are operational savings that are harder to quantify: one inference server to monitor instead of four, one set of weights to download and update, one API endpoint, and roughly half the total GPU count in the recommended config.

The 4x H100 config runs at the same on-demand cost as four separate H100 deployments ($10.16/hr) but collapses four services into one: one deployment, one API surface, one set of weights to manage.

Pricing fluctuates based on GPU availability. The prices above are based on 21 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Compare current GPU rental pricing on Spheron for the latest rates before making a deployment decision.

Production Tips

Context length scaling

Start with --max-model-len 32768. This is the sweet spot for most workloads: low KV cache pressure, good concurrency, and enough context for the majority of real-world tasks. For applications that need 64K+ context, move to the 4x H100 config for more total VRAM, or add --max-num-seqs 32 to limit concurrent requests and free KV cache for longer sequences.

For full 256K context on 2x H200, use --max-model-len 262144. Expect fewer concurrent sessions at maximum context depth:

bash

vllm serve /models/mistral-small-4 \
  --tensor-parallel-size 2 \
  --served-model-name mistral-small-4 \
  --dtype fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 8 \
  --enable-chunked-prefill \
  --port 8000

Keep --kv-cache-dtype fp8 here: FP8 stores each KV entry in 1 byte instead of 2 bytes (BF16), halving KV cache memory per token. At 256K context each sequence consumes enormous KV cache, so the memory saving is at its most valuable in this config.

Batch size tuning

Set --max-num-seqs 256 for high-throughput batch inference. For latency-sensitive production serving, reduce to 32-64. Watch nvidia-smi dmon -s u to keep all GPUs above 80% utilization under load. If you see GPU utilization dropping below 60%, increase batch size. If you see frequent OOM warnings in vLLM logs, reduce it.

Spot instance savings

For development, staging, and non-interactive batch jobs, spot instances cut the hourly rate by 30-60%. Spot preemption on Spheron gives advance notice; configure your vLLM wrapper to checkpoint the current request queue gracefully before shutdown. See the speculative decoding production guide for techniques to improve per-token latency on top of this setup.

FP8 KV cache

Adding --kv-cache-dtype fp8 cuts KV cache memory by roughly half with minimal quality impact on most benchmarks. This frees VRAM for longer sequences or larger concurrent batches without changing the GPU count. Note it is incompatible with --enable-prefix-caching, so choose based on your workload: prefix caching for high-repetition chat, FP8 KV cache for diverse or long-context workloads. The KV cache optimization guide covers eviction strategies and when each approach wins.

Mistral Small 4 on 2x H200 runs cleanly on Spheron's bare-metal multi-GPU instances: no virtualization overhead, full NVLink bandwidth, and spot pricing that cuts the hourly rate significantly below on-demand.
H200 GPU on Spheron → | H100 GPU pricing → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

Check GPU requirements and choose your configuration
Select 2x H200 SXM5 (282 GB combined, FP8) for the best cost-per-performance. Four H100 SXM5 GPUs (320 GB combined, FP8) is an alternative when H200 availability is low. For budget workloads with short contexts, 2x H100 with INT4 quantization covers most instruct and code tasks.
Provision a Spheron multi-GPU instance
Log in to app.spheron.ai, go to GPU Cloud, select H200 SXM5 or H100 SXM5, set the GPU count to 2 or 4, and deploy with the PyTorch 2.5 / CUDA 12.4 base image. Use a spot instance to cut hourly costs by 30-60% compared to on-demand rates.
Install vLLM and set the HuggingFace token
Run pip install 'vllm>=0.8.0' and pip install huggingface_hub hf_transfer. Export HF_TOKEN with your HuggingFace access token. Enable fast transfers with HF_HUB_ENABLE_HF_TRANSFER=1. If the Mistral Small 4 checkpoint requires license acceptance, visit huggingface.co/mistralai/Mistral-Small-4 and accept the terms before running the download.
Download Mistral Small 4 weights
Run huggingface-cli download mistralai/Mistral-Small-4 to the local cache. FP8 checkpoint size varies by release format; verify the exact size from the model card before provisioning storage. BF16 weights are approximately 238 GB. Use persistent storage on your Spheron instance to avoid re-downloading on restarts.
Launch vLLM with tensor parallelism
For 2x H200: vllm serve mistralai/Mistral-Small-4 --tensor-parallel-size 2 --dtype fp8 --max-model-len 32768 --port 8000. For 4x H100: set --tensor-parallel-size 4. Add --enable-chunked-prefill for long-context workloads and --kv-cache-dtype fp8 to double KV cache capacity.
Test the endpoint and tune reasoning budget
Send a POST to /v1/chat/completions. Add extra_body: {reasoning_effort: "medium"} for standard tasks or {reasoning_effort: "high"} for reasoning-heavy queries. Omit reasoning_effort for fast instruct-only responses. Monitor GPU utilization with nvidia-smi dmon -s u to verify all GPUs stay above 80% under load.

FAQ / 05

Frequently Asked Questions

Mistral Small 4 in FP8 needs roughly 120 GB of GPU memory for weights plus KV cache overhead. Two H200 SXM5 GPUs (282 GB combined) is the recommended production config. Four H100 SXM5 GPUs (320 GB combined) works equally well. For budget workloads with shorter contexts, two H100s with INT4 quantization can cover most instruct and code tasks. Spot instances on Spheron cut these costs significantly compared to on-demand rates.

In many cases, yes. Mistral Small 4 combines instruction following, configurable chain-of-thought reasoning, multimodal vision inputs, and code generation in one 119B MoE model with roughly 6B active parameters per forward pass (8B including embeddings). One deployment handles workloads that previously required four separate model servers, simplifying infrastructure and reducing total GPU count.

Pass a reasoning_effort field in the extra_body of your API request to control the thinking depth. Set it to "low" for fast instruct-style replies, "medium" for balanced reasoning tasks, or "high" for complex problems that benefit from extended chain-of-thought. Omit the field to disable reasoning for pure instruct workloads.

Mistral Small 4 supports a 256K token context window. For production on 2x H200, start at 32K to keep KV cache memory manageable, then increase max-model-len based on observed GPU memory headroom. At maximum context, expect fewer concurrent sessions.

On 2x H200 SXM5 spot instances, the all-in rate starts at roughly $2.38/hr (about $1,738/month). That undercuts four separate H100 spot instances at $3.20/hr ($2,336/month). On-demand, 2x H200 at $7.92/hr ($5,782/month) is also cheaper than running four separate H100 SXM5 instances at $10.16/hr ($7,420/month).

What Makes Mistral Small 4 Different

GPU Hardware Requirements

Step-by-Step Deployment with vLLM on Spheron GPU Cloud

Step 1: Provision a Spheron GPU instance

Step 2: Install vLLM and dependencies

Step 3: Download weights

Step 4: Launch vLLM (2x H200 config)

Step 5: Launch vLLM (4x H100 config)

Step 6: Test the endpoint

Configurable Reasoning Effort: Tuning Think Tokens Per Request

Benchmark Results: Mistral Small 4 vs Qwen 3.5 vs Llama 4 Scout

Cost Analysis: One Model vs Four Separate Deployments

Production Tips

Context length scaling

Batch size tuning

Spot instance savings

FP8 KV cache

Quick Setup Guide

Check GPU requirements and choose your configuration

Provision a Spheron multi-GPU instance

Install vLLM and set the HuggingFace token

Download Mistral Small 4 weights

Launch vLLM with tensor parallelism

Test the endpoint and tune reasoning budget

Frequently Asked Questions

01What GPU does Mistral Small 4 require?

02Can Mistral Small 4 replace four separate AI models?

03How do I control reasoning depth in Mistral Small 4?

04What context window does Mistral Small 4 support?

05How much does it cost to run Mistral Small 4 on Spheron?

Try It on Real GPUs