Tutorial

Deploy Mistral Small 4: Self-Host the 119B MoE Model on GPU Cloud (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 21, 2026
Mistral Small 4Mistral MoEGPU CloudvLLMFP8 QuantizationConfigurable ReasoningH200H100Self-Hosted AILLM Deployment
Deploy Mistral Small 4: Self-Host the 119B MoE Model on GPU Cloud (2026)

Most teams deploying AI for production use cases end up running four separate model servers: one for general instruction following, one for reasoning, one for vision, one for code. Four instances to provision, four sets of weights to store, four inference servers to monitor. Mistral Small 4 changes this calculus. It's a 119B MoE model that handles all four capability classes in a single deployment, with roughly 6B parameters active per forward pass (8B including embeddings).

This guide covers everything you need to run it: exact GPU and VRAM requirements, step-by-step vLLM setup on Spheron, how to tune the reasoning budget per request, and a side-by-side cost comparison against running four separate models.

What Makes Mistral Small 4 Different

Mistral Small 4 uses a Mixture of Experts architecture: 119B total parameters with roughly 6B active per token during inference (8B including embedding and output layers). The router selects a subset of expert networks for each forward pass, keeping compute cost similar to a dense 6-8B model while retaining the capacity of a much larger network across diverse tasks.

What makes it worth attention in 2026 is the breadth of what a single checkpoint covers:

Instruction following. Standard chat and completion workloads. The model follows complex multi-turn instructions without needing special prompting or system prompt tricks.

Configurable reasoning. Mistral Small 4 supports think tokens: an internal chain-of-thought scratchpad the model uses before generating its final answer. You control the reasoning depth per request via the reasoning_effort field. Use "low" for fast answers, "high" for complex problem-solving.

Vision inputs. The model processes image inputs alongside text, covering use cases like document understanding, chart reading, and visual question answering in a single deployment.

Code generation. Competitive on standard coding benchmarks across multiple languages. Sufficient for most production code generation workloads without a separate code-specialized model.

PropertyValue
Total parameters119B
Active parameters per token~6B (8B with embeddings)
Context window256K tokens
Quantization supportBF16, FP8, INT4
LicenseApache 2.0

Our MoE inference optimization guide covers expert routing overhead and parallelism strategies in depth.

GPU Hardware Requirements

The VRAM math is straightforward. At BF16, 119B parameters × 2 bytes = 238 GB for weights alone. With the standard 15% activation and framework overhead, you're looking at roughly 275 GB runtime footprint before any KV cache. That rules out most single-node setups at BF16.

At FP8, 119B × 1 byte = approximately 120 GB for weights. Add 15% overhead and you're at around 138 GB before KV cache. Two H200 SXM5 GPUs give you 282 GB total VRAM, which fits the FP8 weights comfortably and leaves roughly 130-140 GB for KV cache at max-model-len 32K.

ConfigPrecisionWeights + overheadKV Cache HeadroomNotes
2x H200 SXM5 (282 GB)FP8~138 GB weights + overhead~130 GBRecommended
4x H100 SXM5 (320 GB)FP8~138 GB weights + overhead~170 GBSame weights, more KV cache room
2x H100 SXM5 (160 GB)INT4~60 GB weights + overhead~90 GBBudget option, shorter contexts
1x H200 SXM5 (141 GB)INT4~60 GB weights + overhead~70 GBMinimum viable, limited concurrency

Why 2x H200 over 4x H100? The H200 SXM5 uses NVLink interconnect between GPUs, which reduces tensor parallelism communication overhead significantly compared to four separate H100 nodes. For a 2-GPU setup, the bandwidth advantage translates to meaningful throughput gains. Four H100s gives more total VRAM (320 GB) and is better if you need to run longer contexts or larger batch sizes.

For a general VRAM sizing reference, see the GPU memory requirements guide for LLMs.

Reserve your compute now: H200 instances on Spheron support 2x and 4x configs with NVLink. For H100 multi-GPU setups, see H100 GPU rental on Spheron.

Step-by-Step Deployment with vLLM on Spheron GPU Cloud

Step 1: Provision a Spheron GPU instance

Log in at app.spheron.ai and navigate to GPU Cloud. Select H200 SXM5 with a count of 2, or H100 SXM5 with a count of 4. Use the PyTorch 2.5 / CUDA 12.4 base image. Enable spot pricing to reduce your hourly cost. Provision at least 300 GB of persistent storage for model weights and vLLM cache.

bash
# Verify GPU setup after SSH access
nvidia-smi
nvidia-smi topo -m  # confirm NVLink topology on H200

Step 2: Install vLLM and dependencies

bash
pip install "vllm>=0.8.0"
pip install huggingface_hub hf_transfer

export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_TOKEN=your_hf_token_here

If the Mistral Small 4 repository is gated, accept the license at huggingface.co/mistralai/Mistral-Small-4 before running the download.

Step 3: Download weights

bash
huggingface-cli download mistralai/Mistral-Small-4 \
  --local-dir /models/mistral-small-4

The FP8 checkpoint size depends on the release format; verify from the model card. BF16 weights are approximately 238 GB. Use persistent storage to avoid re-downloading on instance restarts.

Step 4: Launch vLLM (2x H200 config)

bash
vllm serve /models/mistral-small-4 \
  --tensor-parallel-size 2 \
  --served-model-name mistral-small-4 \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --kv-cache-dtype fp8 \
  --port 8000

--enable-chunked-prefill helps throughput on long-context requests without blocking shorter ones. --kv-cache-dtype fp8 cuts KV cache memory by half, giving you more room for concurrent requests.

Step 5: Launch vLLM (4x H100 config)

bash
vllm serve /models/mistral-small-4 \
  --tensor-parallel-size 4 \
  --served-model-name mistral-small-4 \
  --dtype fp8 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --kv-cache-dtype fp8 \
  --port 8000

With 4x H100 and more total VRAM, you can push --max-model-len to 65K. For 128K context, change --max-model-len to 131072 and reduce --max-num-seqs to 64 to avoid KV cache pressure.

Step 6: Test the endpoint

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-small-4",
    "messages": [{"role": "user", "content": "Write a Python function to parse JSON with error handling."}],
    "max_tokens": 512
  }'

Expect the first request to be slower as vLLM warms up the CUDA graphs. Subsequent requests at similar lengths will be significantly faster.

See the vLLM multi-GPU production deployment guide for load balancing, health checks, and monitoring setup.

Configurable Reasoning Effort: Tuning Think Tokens Per Request

Mistral Small 4 includes a configurable reasoning mode. When enabled, the model generates an internal chain-of-thought before producing its final response. The thinking tokens are hidden in the API response by default.

You control the reasoning depth via reasoning_effort in the request body. Set it to "low" for fast instruct-style replies, "medium" for balanced reasoning, or "high" for problems that benefit from extended chain-of-thought. Omit the field to disable thinking entirely.

python
import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="unused"
)

# Fast mode: minimal thinking
response = client.chat.completions.create(
    model="mistral-small-4",
    messages=[{"role": "user", "content": "Summarize this document in three sentences."}],
    extra_body={"reasoning_effort": "low"}
)

# Deep reasoning mode
response = client.chat.completions.create(
    model="mistral-small-4",
    messages=[{"role": "user", "content": "Find the bug in this algorithm and explain why it fails."}],
    extra_body={"reasoning_effort": "high"}
)

# Pure instruct mode: no thinking
response = client.chat.completions.create(
    model="mistral-small-4",
    messages=[{"role": "user", "content": "Translate this to Spanish."}],
)

Here is the rough latency and quality trade-off by effort level:

reasoning_effortUse caseLatency impact
Not setTranslation, summarization, chatMinimal
"low"Code review, short explanationsLow
"medium"Debugging, data analysisModerate
"high"Complex reasoning, proofs, long-form analysisHigh

Benchmark Results: Mistral Small 4 vs Qwen 3.5 vs Llama 4 Scout

For official quality benchmarks (MMLU, HumanEval, MATH, vision tasks), see the Mistral Small 4 model card and the official Mistral release announcement. Published numbers from competing models: Qwen 3.5-35B-A3B scores approximately 79 on MMLU, 85 on HumanEval, and 87 on MATH. Llama 4 Scout lands around 79.2 on MMLU and 74 on HumanEval.

Throughput on 2x H200 (vLLM FP8, TP=2, batch=32, input=512 tokens, output=256 tokens) varies significantly by sequence length mix and batch composition. Run your own benchmark with your actual request distribution before committing to a hardware config.

For a broader model comparison, see DeepSeek vs Llama 4 vs Qwen3 for the current model landscape and benchmark context. For inference framework trade-offs across these model families, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Cost Analysis: One Model vs Four Separate Deployments

Here is the practical comparison. The "before" scenario assumes each model runs on its own dedicated H100 SXM5 instance, using the lowest available on-demand rate.

Before: four separate model servers

RoleModelConfig$/hr$/month
InstructLlama 3.3 70B1x H100 SXM5$2.54$1,855
ReasoningDeepSeek R1 70B1x H100 SXM5$2.54$1,855
VisionQwen2.5-VL 32B1x H100 SXM5$2.54$1,855
CodeQwen2.5-Coder 32B1x H100 SXM5$2.54$1,855
Total4x H100 SXM5$10.16$7,420

After: single Mistral Small 4 deployment

Config$/hr on-demand$/hr spot$/month on-demand$/month spot
2x H200 SXM5$7.92$2.38$5,782$1,738
4x H100 SXM5$10.16$3.20$7,420$2,336

The 2x H200 spot configuration at $1,738/month undercuts the four separate H100 spot deployments ($2,336/month) by approximately $598/month. On-demand, 2x H200 at $7.92/hr ($5,782/month) is also cheaper than four separate H100s at $10.16/hr ($7,420/month), saving roughly $1,638/month without touching spot pricing.

Beyond the monthly cost, there are operational savings that are harder to quantify: one inference server to monitor instead of four, one set of weights to download and update, one API endpoint, and roughly half the total GPU count in the recommended config.

The 4x H100 config runs at the same on-demand cost as four separate H100 deployments ($10.16/hr) but collapses four services into one: one deployment, one API surface, one set of weights to manage.

Pricing fluctuates based on GPU availability. The prices above are based on 21 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Compare current GPU rental pricing on Spheron for the latest rates before making a deployment decision.

Production Tips

Context length scaling

Start with --max-model-len 32768. This is the sweet spot for most workloads: low KV cache pressure, good concurrency, and enough context for the majority of real-world tasks. For applications that need 64K+ context, move to the 4x H100 config for more total VRAM, or add --max-num-seqs 32 to limit concurrent requests and free KV cache for longer sequences.

For full 256K context on 2x H200, use --max-model-len 262144. Expect fewer concurrent sessions at maximum context depth:

bash
vllm serve /models/mistral-small-4 \
  --tensor-parallel-size 2 \
  --served-model-name mistral-small-4 \
  --dtype fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 8 \
  --enable-chunked-prefill \
  --port 8000

Keep --kv-cache-dtype fp8 here: FP8 stores each KV entry in 1 byte instead of 2 bytes (BF16), halving KV cache memory per token. At 256K context each sequence consumes enormous KV cache, so the memory saving is at its most valuable in this config.

Batch size tuning

Set --max-num-seqs 256 for high-throughput batch inference. For latency-sensitive production serving, reduce to 32-64. Watch nvidia-smi dmon -s u to keep all GPUs above 80% utilization under load. If you see GPU utilization dropping below 60%, increase batch size. If you see frequent OOM warnings in vLLM logs, reduce it.

Spot instance savings

For development, staging, and non-interactive batch jobs, spot instances cut the hourly rate by 30-60%. Spot preemption on Spheron gives advance notice; configure your vLLM wrapper to checkpoint the current request queue gracefully before shutdown. See the speculative decoding production guide for techniques to improve per-token latency on top of this setup.

FP8 KV cache

Adding --kv-cache-dtype fp8 cuts KV cache memory by roughly half with minimal quality impact on most benchmarks. This frees VRAM for longer sequences or larger concurrent batches without changing the GPU count. Note it is incompatible with --enable-prefix-caching, so choose based on your workload: prefix caching for high-repetition chat, FP8 KV cache for diverse or long-context workloads. The KV cache optimization guide covers eviction strategies and when each approach wins.


Mistral Small 4 on 2x H200 runs cleanly on Spheron's bare-metal multi-GPU instances: no virtualization overhead, full NVLink bandwidth, and spot pricing that cuts the hourly rate significantly below on-demand.

Rent H200 on Spheron → | Rent H100 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.