Not every production workload needs a 119B MoE model. If you want Mistral's instruction, reasoning, and vision capabilities without the multi-GPU overhead of Mistral Small 4, the Ministral 3 family covers the same capability surface in dense 3B, 8B, and 14B checkpoints that fit on a single GPU. This guide covers GPU sizing, vLLM deployment, AWQ quantization, multimodal inference, and an edge-to-cloud routing pattern for all three variants.
The Ministral 3 Family
Ministral 3 is Mistral's December 2025 multi-SKU dense model release. Unlike Small 4's MoE approach, all three size tiers are fully dense transformers, which simplifies serving infrastructure and reduces communication overhead on single-GPU deployments.
| Model | Variant | Vision | Active Params | Context Window |
|---|---|---|---|---|
| Ministral 3 3B | base, instruct | Yes | 3B | 128K tokens |
| Ministral 3 3B | reasoning | Yes | 3B | 128K tokens |
| Ministral 3 8B | base, instruct | Yes | 8B | 128K tokens |
| Ministral 3 8B | reasoning | Yes | 8B | 128K tokens |
| Ministral 3 14B | base, instruct | Yes | 14B | 128K tokens |
| Ministral 3 14B | reasoning | Yes | 14B | 128K tokens |
All variants include image understanding.
The 3B fits in the edge tier: on-device or fractional-GPU serving for latency-sensitive queries. The 8B is the balanced production option: single L40S, good throughput, covers most instruction and vision workloads. The 14B Reasoning is for workloads that need explicit chain-of-thought: complex multi-step tasks, structured reasoning, and agentic pipelines.
For context on SLM economics broadly, see the small language models deployment guide.
Why a Small Reasoning Family Matters in 2026
Two years ago, getting useful chain-of-thought from a 14B model required careful prompt engineering and you still got brittle results. The quality ceiling for small reasoning models was around 7B, and it was notably lower than 70B class models. Ministral 3's 14B Reasoning variant changes that. The reasoning scratchpad gives you structured chain-of-thought output at a fraction of the cost of a 70B inference call.
The cost difference compounds fast. A single L40S at $1.07/hr on-demand (or $0.72/hr spot) can serve the 14B Reasoning variant for tasks where a $1.66/hr H100 SXM5 spot instance or a $4.50+/hr multi-GPU setup was previously required. For 10,000 requests per day at 1,000 tokens each, the gap between L40S spot and H100 spot is roughly $0.94/hr multiplied by 24 hours: roughly $22/day saved without touching model quality for most workloads.
The vision capability across all tiers also changes the architecture calculus. Instead of deploying a separate vision model for image queries and a text model for everything else, a single Ministral 3 8B checkpoint handles both. One deployment, one VRAM budget, one serving binary.
For a detailed breakdown of how inference costs scale with model size and request volume, see the AI inference cost economics guide.
GPU Sizing Per Ministral 3 Variant
VRAM requirements at BF16: multiply parameter count by 2 bytes per parameter. Add roughly 20% for framework overhead and KV cache at standard context lengths.
| Model | Precision | VRAM for Weights | KV Cache at 32K | Recommended GPU | Notes |
|---|---|---|---|---|---|
| Ministral 3 3B | BF16 | ~6 GB | ~4 GB | RTX 4090 (24 GB) | Abundant headroom; also runs on fractional L40S |
| Ministral 3 3B | INT4 AWQ | ~2 GB | ~4 GB | Any GPU with 8+ GB | Edge-friendly; runs on smaller hardware |
| Ministral 3 8B | BF16 | ~16 GB | ~8 GB | L40S 48GB | Fits single GPU with room for batch KV cache |
| Ministral 3 8B | INT4 AWQ | ~5 GB | ~8 GB | RTX 4090 (24 GB) | Tight on KV cache at 128K context |
| Ministral 3 14B | BF16 | ~28 GB | ~12 GB | L40S 48GB or H100 80GB | Single GPU recommended; 2x GPUs for longer context |
| Ministral 3 14B | INT4 AWQ | ~8-10 GB | ~12 GB | RTX 4090 or L40S | AWQ drops quality slightly; validate for your task |
For the 3B variant, an RTX 4090 on Spheron gives you 24 GB of VRAM which is about 4x the model's weight footprint, leaving plenty of room for concurrent requests. You can also run the 3B on a fractional GPU partition if you need to share the GPU across multiple services.
For the 8B and 14B in BF16, L40S GPU instances on Spheron are the cost-optimal choice: 48 GB VRAM at the lowest on-demand rates in the recommended GPU range. The L40S covers the 14B with about 8 GB headroom after weights for KV cache at 32K context.
The 14B Reasoning variant on a single H100 on Spheron gives you more KV cache headroom (80 GB total, 52 GB free after weights) which is worth it if you need full 128K context or large batch sizes.
Step-by-Step: Deploy Ministral 3 with vLLM on Spheron
Step 1: Provision a Spheron instance
Log in at app.spheron.ai and navigate to GPU Cloud. Select your target GPU based on the sizing table above.
For the 14B Reasoning variant, pick L40S PCIe or H100 SXM5. For the 3B or 8B, an RTX 4090 or L40S works. Use spot instances for development and batch inference workloads on most GPU SKUs. For L40S, spot is currently cheaper than on-demand on Spheron ($0.72/hr vs $1.07/hr), so prefer spot for batch and dev workloads on L40S. Deploy with the PyTorch 2.5 / CUDA 12.4 base image.
Mount persistent storage before downloading weights. Minimum sizes per variant:
- Ministral 3 3B BF16: 10-15 GB
- Ministral 3 8B BF16: 20-25 GB
- Ministral 3 14B BF16: 50 GB minimum (add buffer for vLLM cache)
Step 2: Install vLLM and dependencies
pip install "vllm>=0.8.4"
pip install huggingface_hub hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_TOKEN=your_hf_token_hereStep 3: Download model weights
Ministral 3 checkpoints may require license acceptance per variant. Before running huggingface-cli download, visit the relevant model card at huggingface.co/mistralai and accept the terms if the repository is gated.
# Ministral 3 3B instruct
huggingface-cli download mistralai/Ministral-3-3B-Instruct-2512
# Ministral 3 8B instruct
huggingface-cli download mistralai/Ministral-3-8B-Instruct-2512
# Ministral 3 14B reasoning (recommended for complex workloads)
huggingface-cli download mistralai/Ministral-3-14B-Reasoning-2512Step 4: Launch the vLLM server
Ministral 3 3B on RTX 4090 (24 GB VRAM):
vllm serve mistralai/Ministral-3-3B-Instruct-2512 \
--dtype bfloat16 \
--max-model-len 65536 \
--port 8000The 65K context limit gives you generous KV cache headroom on the RTX 4090. The 3B model leaves ~18 GB free for cache.
Ministral 3 8B on L40S (48 GB VRAM):
vllm serve mistralai/Ministral-3-8B-Instruct-2512 \
--dtype bfloat16 \
--max-model-len 65536 \
--port 8000Ministral 3 14B Reasoning on L40S (48 GB VRAM, single GPU):
vllm serve mistralai/Ministral-3-14B-Reasoning-2512 \
--dtype bfloat16 \
--max-model-len 32768 \
--reasoning-parser mistral \
--port 8000Ministral 3 14B Reasoning on 2x GPUs for longer context:
vllm serve mistralai/Ministral-3-14B-Reasoning-2512 \
--dtype bfloat16 \
--max-model-len 65536 \
--reasoning-parser mistral \
--tensor-parallel-size 2 \
--port 8000Tensor parallelism across 2 GPUs doubles your VRAM budget and lets you push to 65K+ context or increase batch size significantly.
Step 5: Send requests
Reasoning request:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="token")
response = client.chat.completions.create(
model="mistralai/Ministral-3-14B-Reasoning-2512",
messages=[
{"role": "user", "content": "Explain why merge sort is more efficient than bubble sort for large datasets."}
],
max_tokens=1024,
)
print(response.choices[0].message.content)Vision request (multimodal with image):
response = client.chat.completions.create(
model="mistralai/Ministral-3-8B-Instruct-2512",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "https://example.com/chart.png"},
},
{"type": "text", "text": "Describe what this chart shows."},
],
}
],
max_tokens=512,
)Deploying Ministral 3 Vision: Multimodal Requests
All Ministral 3 variants load a vision encoder alongside the language model. When you serve via vLLM, no extra flags are needed for multimodal support: vLLM detects the vision encoder from the model config and loads it automatically.
For image preprocessing, keep input images at 1024px max on the longer edge before sending. Larger images increase tokenization time and can saturate memory on smaller GPUs. You can resize in Python before encoding to base64:
from PIL import Image
import base64
import io
def prepare_image(path: str, max_size: int = 1024) -> str:
img = Image.open(path)
img.thumbnail((max_size, max_size))
img = img.convert('RGB')
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=85)
return base64.b64encode(buffer.getvalue()).decode()
b64_image = prepare_image("your_image.jpg")
response = client.chat.completions.create(
model="mistralai/Ministral-3-14B-Reasoning-2512",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{b64_image}"},
},
{"type": "text", "text": "What does this image show?"},
],
}
],
max_tokens=256,
)Latency expectations: Image tokenization adds roughly 50-200ms per image depending on resolution and GPU. For the 14B on L40S at 32K context, expect first-token latency of 300-600ms with a single image. Batch multiple images only if your use case genuinely needs it: each additional image adds VRAM pressure proportional to its token count.
Quantization: AWQ for the 14B Reasoning Variant
AWQ INT4
AWQ drops the 14B's VRAM requirement from ~28 GB to ~8-10 GB, putting it on a single RTX 4090. Quality loss is minimal on standard instruction tasks but can be more noticeable on precise reasoning chains. Test against your specific workload before committing.
To quantize from BF16:
pip install autoawq
python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'mistralai/Ministral-3-14B-Reasoning-2512'
quant_path = './ministral-3-14b-awq'
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
quant_config = {'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
"Then serve with vLLM:
vllm serve ./ministral-3-14b-awq \
--quantization awq \
--dtype float16 \
--max-model-len 32768 \
--port 8000For a comprehensive walkthrough of AWQ quantization across model families, see the AWQ quantization deployment guide.
Blackwell (B200/B300)
As of May 2026, no dedicated Blackwell-optimized checkpoint (MXFP4 or NVFP4) has been published for Ministral 3 14B. You can run BF16 on B200 or B300 instances today using the same vLLM command with --dtype bfloat16. The 192 GB HBM3e on B200 leaves ample room for large batch sizes and long context. Watch Mistral's HuggingFace page for quantized Blackwell variants when they are released.
Throughput and Cost: Ministral 3 vs Mistral Small 4 and Llama 4 Scout
| Model | GPU Config | Tokens/sec (approx) | On-demand $/hr | Spot $/hr | Best Use Case |
|---|---|---|---|---|---|
| Ministral 3 3B | RTX 4090 (24 GB) | 1,200-1,800 | $0.53 | N/A | Edge serving, simple queries, high throughput |
| Ministral 3 8B | L40S 48GB | 600-900 | $1.07 | $0.72 | Balanced: instruction + vision, moderate reasoning |
| Ministral 3 14B Reasoning | L40S 48GB | 350-500 | $1.07 | $0.72 | Complex reasoning, chain-of-thought, agents |
| Ministral 3 14B Reasoning | H100 SXM5 | 500-700 | $3.70 | $1.66 | Same model, more KV cache, higher concurrency |
| Mistral Small 4 | 2x H200 SXM5 | 400-600 | ~$8.72 | ~$3.52 | All-in-one: vision + reasoning + code, 119B MoE |
| Llama 4 Scout | H100 SXM5 | 500-750 | $3.70 | $1.66 | Apache license, Meta ecosystem |
When does Ministral 3 14B beat Mistral Small 4? When you need lower GPU cost, simpler serving infrastructure, or strict single-GPU constraints. Mistral Small 4 is the right call when a single model needs to handle extreme task diversity without routing, or when you need 256K context windows. See the Mistral Small 4 deployment guide for multi-GPU setup details.
Llama 4 Scout covers use cases where the Apache 2.0 license matters for legal or organizational reasons, or where Meta's ecosystem (LlamaIndex, Meta AI tools) is already in place.
Live Spheron Pricing for Recommended GPU SKUs
| GPU | On-demand $/hr | Spot $/hr | Recommended For |
|---|---|---|---|
| RTX 4090 PCIe | $0.53 | N/A | Ministral 3 3B, edge serving, high-throughput small models |
| L40S PCIe | $1.07 | $0.72 | Ministral 3 8B and 14B (best cost-per-token in this range) |
| H100 SXM5 | $3.70 | $1.66 | Ministral 3 14B Reasoning with large KV cache or high concurrency |
| H200 SXM5 | $4.36 | $1.76 | Multi-model serving, very long context, future headroom |
| B200 SXM6 | $6.76 | $3.50 | Ministral 3 14B BF16, large-scale deployment, Blackwell performance |
Pricing fluctuates based on GPU availability. The prices above are based on 12 May 2026 and may have changed. Check current GPU pricing → for live rates.
Edge-to-Cloud Routing Pattern
A practical deployment pattern for Ministral 3 pairs the 3B at the edge with the 14B Reasoning on cloud, with a lightweight router dispatching based on query complexity.
The routing logic:
- Simple queries (short, factual, single-turn): Ministral 3 3B on a low-cost fractional GPU
- Complex queries (multi-step, code, long context, or explicit reasoning requested): Ministral 3 14B Reasoning on Spheron cloud
For a complete LLM routing implementation with classification approaches and cost analysis, see the LLM inference router guide.
Here is a simplified Python example of the routing logic:
import openai
TIER1_URL = "http://edge-node:8000/v1" # Ministral 3 3B
TIER2_URL = "http://cloud-node:8000/v1" # Ministral 3 14B Reasoning
def classify_complexity(query: str) -> str:
"""Returns 'simple' or 'complex' based on query heuristics."""
complex_signals = [
len(query) > 400,
any(kw in query.lower() for kw in ["step by step", "explain why", "compare", "analyze"]),
query.count("?") > 2,
]
return "complex" if any(complex_signals) else "simple"
def route_query(query: str) -> str:
tier = classify_complexity(query)
if tier == "simple":
client = openai.OpenAI(base_url=TIER1_URL, api_key="token")
model = "mistralai/Ministral-3-3B-Instruct-2512"
else:
client = openai.OpenAI(base_url=TIER2_URL, api_key="token")
model = "mistralai/Ministral-3-14B-Reasoning-2512"
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": query}],
max_tokens=1024,
)
return response.choices[0].message.contentFor production, replace the heuristic classifier with an embedding-based classifier (e.g., sentence-transformers/all-MiniLM-L6-v2) for better accuracy without adding significant latency. The LLM inference router guide covers embedding classifiers, NGINX proxy setup, and multi-tier monitoring in depth.
For use cases that span cloud and on-device deployment, see the hybrid cloud-edge AI inference guide.
Production Checklist
Before moving a Ministral 3 deployment to production, cover these areas:
- Observability. vLLM exposes Prometheus metrics at
/metricsby default. Trackvllm:num_requests_running,vllm:gpu_cache_usage_perc, andvllm:e2e_request_latency_seconds. Alert on cache usage above 90% and p95 latency above your SLA threshold.
- Guardrails. For user-facing deployments, add input/output content filtering before and after the model. A lightweight classifier (e.g., a fine-tuned BERT-class model) running on CPU is sufficient for most content policies and adds under 5ms per request.
- Fine-tuning. The instruct variants support LoRA fine-tuning. For domain adaptation on the 8B, Unsloth is an efficient option: it reduces memory overhead by 60-70% compared to standard Hugging Face PEFT, letting you fine-tune on a single L40S in a few hours.
- Structured output. For agentic or data extraction workloads, start vLLM with
--guided-decoding-backend xgrammar(oroutlinesfor older versions) and passresponse_formatorguided_jsonfields in each request to enforce output schemas. The reasoning variant handles JSON schema constraints well because the scratchpad phase can plan the structure before outputting.
- Spot preemption handling. If using spot instances on Spheron for batch workloads, implement checkpoint saves after each request or batch segment. Store checkpoints on persistent storage, not ephemeral instance storage. The router's fallback chain should automatically requeue preempted requests.
- Auto-scaling. For variable traffic, Spheron's per-second billing makes burst scaling practical. Keep a warm single-GPU Tier 1 (3B) instance running permanently and scale Tier 2/3 (14B) up during peak hours. Scale down during off-peak to save 70%+ on the 14B GPU cost.
Ministral 3's multi-SKU design means you can start with the 3B for low-cost inference and graduate to the 14B Reasoning variant as your workload grows, without rewriting your serving stack.
