Tutorial

Deploy Mistral Large 3 on GPU Cloud: Self-Host the 675B MoE with vLLM and Expert Parallelism (2026)

Mistral Large 3Deploy Mistral Large 3Mistral Large 3 GPU RequirementsMistral Large 3 vLLMSelf-Host Mistral Large 3Mistral Large 3 MoEExpert ParallelismGPU CloudH200B200
Deploy Mistral Large 3 on GPU Cloud: Self-Host the 675B MoE with vLLM and Expert Parallelism (2026)

Mistral Large 3 is a 675B-parameter Mixture of Experts model with only 41B parameters active per token, released in December 2025 under the Apache 2.0 license. It supports both text and image inputs, with a native vision encoder for multimodal understanding. The model positions itself as a Western-lab alternative to the DeepSeek family at similar active-parameter scale, with a smaller total footprint than DeepSeek V4's 1T parameters. The Apache 2.0 license removes the deployment restrictions that come with many frontier models, which matters if you're building commercial products or need to modify the serving stack. This guide covers everything from VRAM planning to production vLLM configuration on multi-GPU instances.

Mistral Large 3 at a Glance

Before touching any configuration, it helps to know the model's architecture numbers:

PropertyValue
Total parameters675B
Active parameters per token41B
LicenseApache 2.0
Context window256K tokens (262,144)
Release dateDecember 2025
ModalityText + Images (native multimodal, native vision encoder)
ArchitectureMixture of Experts (MoE)

Mistral has not published official benchmark results for Mistral Large 3 at time of writing. Check the official Mistral release post for updated numbers as they become available.

The Apache 2.0 license is the practical differentiator for many teams. Unlike models with custom non-commercial or research-only licenses, Apache 2.0 allows unrestricted commercial use, modification, and redistribution. You can embed Mistral Large 3 in a product, modify the model, or build derivative works without negotiating a separate license.

VRAM Requirements: FP8, Q4, and GGUF Footprints

The most common misconception about MoE models is that VRAM scales with active parameters, not total parameters. It does not. Mistral Large 3 has 675B total parameters, and all 675B must reside in GPU VRAM at inference time. The router selects which 41B activate per forward pass, but it cannot do that selection without reading all expert weights first. Size your cluster for the full model.

VRAM formula: total_vram = (total_params_GB × dtype_bytes) + (total_params_GB × dtype_bytes × 0.05) + kv_cache_budget

QuantizationWeight sizeRuntime overheadMin VRAM neededMin GPU config
FP8675 GB~34 GB~710 GB8x H200 141GB (1128 GB) or 4x B200 192GB (768 GB)
INT4/GPTQ~338 GB~17 GB~355 GB5x H100 SXM5 80GB (400 GB), or 3x H200 141GB (423 GB)
Q4_K_M (GGUF)~338 GB~17 GB~355 GB3x H200 141GB (423 GB) or 5x H100 80GB (400 GB)
NVFP4~338 GB~17 GB~355 GB3x B200 SXM6 192GB (576 GB), Blackwell only, experimental

A few notes on this table:

FP8 is the preferred quantization for H200 and B200 hardware. It uses native hardware tensor cores, delivers near-lossless quality (less than 1% degradation on most benchmarks), and cuts memory to roughly half of BF16.

INT4/GPTQ reduces the footprint enough to fit on 5x H100, but you need a pre-quantized checkpoint from Hugging Face and should expect 3-5% quality degradation on reasoning-heavy tasks. It works well for batch inference or use cases where you care more about cost than peak accuracy.

NVFP4 is experimental as of mid-2026 and requires Blackwell hardware specifically (B200 or B300). It is not available on H100 or H200. If you are on Blackwell, it roughly halves the memory requirement versus FP8, but the quantization tooling is still maturing. Verify model support before betting production traffic on it.

Expert Parallelism and Tensor Parallelism: When to Use Each

Tensor parallelism (TP) splits individual weight matrices across GPUs. Every forward pass requires an all-reduce collective across all GPUs in the TP group, which synchronizes every layer. This is efficient on fast interconnects (NVLink) but becomes a communication bottleneck when scaling across multiple nodes connected over InfiniBand.

Expert parallelism (EP) assigns entire MoE expert layers to specific GPUs. A given GPU holds a subset of the experts, and tokens are routed to whichever GPU owns the relevant expert. The communication pattern shifts from all-reduce per layer to all-to-all per routing step. For large MoE models where most computation happens inside expert layers, EP scales better across nodes because you only communicate at routing boundaries, not inside each layer.

ConfigUse caseCommunication patternWhen to prefer
TP onlySingle-node, up to 8 GPUsAll-reduce per layerLow-latency serving, small batch sizes, NVLink-connected GPUs
EP onlyMulti-node, high throughputAll-to-all at routingBatch inference, large concurrent request queues
TP + EP hybridVery large clustersBothHigh throughput + multi-node scale

For most Mistral Large 3 deployments, single-node tensor parallelism on 4x B200 or 8x H200 is the practical choice. Multi-node EP adds significant operational complexity (Ray cluster setup, network topology requirements) and is only worth it when your throughput requirements exceed what a single node provides. For deeper coverage of MoE parallelism strategies, the MoE inference optimization guide covers expert routing, memory planning, and framework comparisons in detail.

Step-by-Step: Deploy Mistral Large 3 with vLLM

Prerequisites

  • GPU instance with 8x H200 SXM5 141GB, 4x B200 SXM6 192GB, or 8x H100 SXM5 80GB at minimum
  • CUDA 12.1 or later, Python 3.10 or later
  • ~700 GB persistent storage for FP8 weights (or ~350 GB for INT4)
  • Hugging Face account; model access if the repository is gated

Install vLLM

bash
pip install -U vllm  # always install the latest version

# Verify version
python -c "import vllm; print(vllm.__version__)"

The latest vLLM is required for native support of Mistral Large 3's tokenizer format and configuration. Run pip install -U vllm to ensure you have the current version. Earlier versions may fail to load the model or produce incorrect outputs. If you see tokenizer or model loading errors, update vLLM.

Download Model Weights

bash
huggingface-cli download mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --local-dir /data/models/mistral-large-3

# Verify download completed
du -sh /data/models/mistral-large-3/

Weight download takes 1-2 hours depending on bandwidth. The FP8 checkpoint is approximately 675 GB. If you plan to use GPTQ INT4, find a pre-quantized checkpoint on Hugging Face (search for Mistral-Large-3-675B-Instruct-2512-GPTQ variants) rather than quantizing at runtime.

bash
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --quantization fp8 \
  --tensor-parallel-size 8 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --gpu-memory-utilization 0.88 \
  --max-model-len 32768 \
  --max-num-seqs 64 \
  --kv-cache-dtype fp8 \
  --enable-chunked-prefill \
  --port 8000

Flag notes:

  • --quantization fp8: activates FP8 weight loading. Requires hardware that supports FP8 tensor cores (H100, H200, B200).
  • --tensor-parallel-size 8: splits every attention and FFN layer across 8 GPUs. Requires NVLink-connected GPUs for efficient communication. The official recipe from Mistral recommends TP=8 to match the model's attention-head configuration.
  • --tokenizer_mode mistral --config_format mistral --load_format mistral: required flags for correct Mistral model loading. Without these the server will not load the model correctly.
  • --gpu-memory-utilization 0.88: leaves 12% of each GPU's VRAM for CUDA context and framework overhead. If you see OOM errors on startup, lower to 0.85.
  • --max-model-len 32768: context window cap. Increase to 65536 or 131072 if your application needs longer contexts; KV cache will grow accordingly.
  • --kv-cache-dtype fp8: stores the KV cache in FP8 rather than BF16, cutting KV cache memory by roughly half.
  • --enable-chunked-prefill: processes long prompts in chunks rather than blocking the decode queue. Improves latency under mixed-length workloads.

Launch: 4x B200 FP8

bash
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --gpu-memory-utilization 0.92 \
  --max-model-len 16384 \
  --max-num-seqs 64 \
  --kv-cache-dtype fp8 \
  --enable-chunked-prefill \
  --port 8000

The B200 has 192 GB per GPU, so 4 GPUs total 768 GB. With --gpu-memory-utilization 0.92, vLLM caps its entire allocation at ~706 GB. FP8 weights consume roughly 675 GB of that budget, leaving ~32 GB for KV cache. Use --max-model-len 16384 for safe operation at this utilization level. If you need the full 32K context window, raise --gpu-memory-utilization to 0.96 instead.

Launch: Multi-Node with Expert Parallelism

For deployments that need to span multiple physical nodes, set up a Ray cluster first:

bash
# On the head node
ray start --head --port 6379

# On each worker node
ray start --address='<head_node_ip>:6379'

Then launch vLLM with combined tensor and expert parallelism:

bash
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --gpu-memory-utilization 0.88 \
  --max-model-len 32768 \
  --port 8000

--enable-expert-parallel activates expert parallelism, distributing MoE expert shards across nodes. The expert-parallel degree is derived automatically from the TP and DP configuration; combined with --tensor-parallel-size 4, each node runs TP across its 4 GPUs while EP handles cross-node expert routing. This requires the latest vLLM and a working Ray cluster. For most teams, single-node deployment on 4x B200 or 8x H200 is simpler and covers the majority of throughput requirements.

Test the API

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-required",
)

response = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-675B-Instruct-2512",
    messages=[{"role": "user", "content": "Explain MoE routing in 3 sentences."}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Or with curl:

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Large-3-675B-Instruct-2512",
    "messages": [{"role": "user", "content": "Explain MoE routing in 3 sentences."}],
    "max_tokens": 256
  }'

Throughput and Latency Benchmarks

The numbers below are derived from vLLM benchmarks on similar MoE architectures at comparable active-parameter scale (41B active parameters). Actual throughput varies with sequence length, batch size, and request concurrency. Treat these as planning estimates, not guarantees.

GPU ConfigQuantizationBatch SizeThroughput (tok/s)TTFT (ms)Cost/M tokens (spot)
4x B200 SXM6FP832~3,000~120~$2.00
8x H200 SXM5FP832~2,500~150~$1.57
8x H100 SXM5INT432~2,000~250~$1.59

Throughput figures assume 512-token average output length and NVLink-connected GPUs. For workloads with longer outputs (2K+ tokens), throughput drops proportionally. For short-context completion tasks (under 256 tokens), throughput is higher.

Pricing fluctuates based on GPU availability. The prices above are based on 09 Jun 2026 and may have changed. Check current GPU pricing for live rates.

Quantization Options to Cut the GPU Bill

FP8 is the default recommendation for H200 and B200 hardware. It uses native tensor cores available on Hopper and Blackwell GPUs, delivers quality within 1% of BF16 on most benchmarks, and halves the memory footprint versus FP16. Enable it with --quantization fp8.

GPTQ INT4 cuts memory in half again versus FP8, but you need a pre-quantized checkpoint from Hugging Face. Runtime quantization with GPTQ is slow. Use --quantization gptq and point at a pre-quantized model ID. Quality drops 3-5% on reasoning tasks but is acceptable for many production use cases. This is the practical path if you need to fit on A100 80GB or older H100 PCIe hardware.

GGUF via llama.cpp works for CPU-GPU hybrid inference. Throughput is lower than vLLM but memory flexibility is higher since you can offload layers to CPU RAM. Useful for local experiments or edge deployments where GPU VRAM is the bottleneck. Not recommended for serving more than a few concurrent users.

NVFP4 is Blackwell-specific (B200, B300) and experimental as of mid-2026. It halves memory versus FP8, which means a small number of B200 GPUs could serve the model. The tooling is maturing rapidly, but verify that Mistral Large 3 supports NVFP4 in your vLLM version before relying on it for production.

FormatVRAM saving vs FP16Quality delta (MMLU)Recommended hardwarevLLM flag
FP8~50%<1%H100, H200, B200--quantization fp8
GPTQ INT4~75%3-5%All NVIDIA, requires pre-quantized checkpoint--quantization gptq
GGUF Q4_K_M~75%3-5%CPU+GPU hybridUse llama.cpp, not vLLM
NVFP4~75%TBDB200, B300 onlyExperimental

Production Serving: OpenAI-Compatible API, Autoscaling, and KV Cache Tuning

To expose an OpenAI-compatible endpoint with a specific model name:

bash
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --quantization fp8 \
  --tensor-parallel-size 8 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name mistral-large-3

--served-model-name sets the model field in API responses. Any client using model: "mistral-large-3" will route to this endpoint without needing the full HuggingFace path.

For KV cache tuning, three flags have the most impact:

  • --kv-cache-dtype fp8: halves KV cache memory on H100/H200/B200. Enables longer contexts or more concurrent requests on the same hardware. Do not combine with --enable-prefix-caching (they are mutually exclusive).
  • --max-num-batched-tokens 8192: controls how many tokens vLLM processes in a single scheduler step. Higher values improve throughput at the cost of first-token latency.
  • --enable-chunked-prefill: breaks long prompts into smaller chunks processed across multiple scheduler steps. Prevents long prefill operations from blocking the decode queue.

For horizontal autoscaling, run multiple vLLM instances behind an NGINX upstream:

nginx
upstream mistral_large {
  least_conn;
  server gpu-node-1:8000;
  server gpu-node-2:8000;
}

Each instance should run on its own multi-GPU node. NGINX least_conn distributes requests to the instance with the fewest active connections. For full setup details, the vLLM production deployment guide covers Docker configuration, load balancer setup, and systemd service management.

The key Prometheus metrics to monitor:

  • vllm:kv_cache_usage_perc: KV cache fill rate. Keep below 85% to avoid scheduling stalls where new requests queue behind a full cache.
  • vllm:num_requests_waiting: requests queued but not yet scheduled. Sustained values above 10 indicate you need more throughput capacity.
  • vllm:time_to_first_token_seconds: TTFT latency histogram. Track the p95 and p99 to catch tail latency issues under load.

Spheron GPU Pricing for Mistral Large 3

Prices below are derived from the Spheron GPU pricing API as of 09 Jun 2026. The on-demand figures use the lowest available per-GPU rate for non-spot offers; spot figures use the lowest per-GPU spot rate across available bundles.

GPU ConfigOn-Demand/hrSpot/hrNotes
4x B200 SXM6~$34.44~$21.36FP8, recommended; combine two 2-GPU bundles
8x H200 SXM5 141GB~$36.32~$14.16FP8 with KV cache headroom; four 2-GPU bundles
8x H100 SXM5 80GB~$20.32~$11.44INT4 only for Mistral Large 3; 8-GPU bundle available

Spot instances reduce hourly cost by 40-60% versus on-demand. They are suitable for batch inference, offline evaluation, and workloads that can handle preemption. On-demand is the right choice for production APIs where uptime matters.

Pricing fluctuates based on GPU availability. The prices above are based on 09 Jun 2026 and may have changed. Check current GPU pricing for live rates.

Mistral Large 3 vs DeepSeek V4: Which to Deploy?

PropertyMistral Large 3DeepSeek V4
Total parameters675B~1T
Active parameters41B~37B
LicenseApache 2.0Permissive (check specific terms)
Min GPU config (FP8)4x B200 or 8x H2008x H200 or larger
Approx. on-demand/hr~$34.44 (4x B200)~$36.32+ (8x H200)
Context window256K~1M

Choose Mistral Large 3 when the smaller GPU footprint matters, when Apache 2.0 licensing is a requirement for your deployment, or when you want a Western-lab model without export control concerns. The 675B total parameter count fits on fewer GPUs than DeepSeek V4, which translates directly to lower hourly cost for the same request volume.

Choose DeepSeek V4 when you need a longer context window (1M tokens versus 256K), when benchmark performance on specific tasks like math or reasoning is the primary criteria, or when you already have 8x H200 nodes provisioned for other workloads. For a full setup walkthrough with that model, see the DeepSeek V4 deployment guide.

For most teams deciding between these two, the hardware cost difference is the deciding factor. Mistral Large 3's ability to fit on 4x B200 ($34.44/hr) versus DeepSeek V4's minimum of 8x H200 ($36.32/hr) means lower upfront cost. The 4x B200 config provides a modest KV cache budget sufficient for 16K-context workloads at the recommended --max-model-len setting; for longer context headroom, the 8x H200 config is the better fit.


Mistral Large 3 fits on 4x B200 or 8x H200 at FP8, both configs available on-demand or as spot on Spheron.

H200 capacity on Spheron | B200 capacity on Spheron | View all GPU pricing

STEPS / 06

Quick Setup Guide

  1. Calculate VRAM requirements for your target quantization

    Mistral Large 3 has 675B total parameters. At FP8 (1 byte/param), weights alone require 675 GB. Add roughly 5% runtime overhead for inference-time activations and framework state to get ~710 GB minimum (excluding KV cache budget). At 4-bit (0.5 bytes/param), weights are ~338 GB plus overhead, giving ~355 GB minimum. Choose your quantization based on quality requirements: FP8 delivers near-identical quality to FP16, while Q4_K_M saves ~50% VRAM at a 3-5% quality penalty on most benchmarks.

  2. Choose a multi-GPU configuration on Spheron

    Provision your GPU instance at app.spheron.ai. Recommended configs: 4x B200 SXM6 (768 GB) for FP8 single-node; 8x H200 SXM5 141GB (1128 GB) for FP8 single-node with generous KV headroom; 4x H200 SXM5 141GB (564 GB) for Q4_K_M. Ensure at least 700 GB persistent storage for weight downloads. SSH in and verify connectivity with nvidia-smi -L.

  3. Install vLLM and download model weights

    Install with pip install -U vllm (always install the latest version). Download weights from Hugging Face: huggingface-cli download mistralai/Mistral-Large-3-675B-Instruct-2512 --local-dir /data/models/mistral-large-3. This may take 1-2 hours depending on your bandwidth.

  4. Launch the vLLM inference server with expert parallelism

    For an 8x H200 FP8 deployment: vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 --quantization fp8 --tensor-parallel-size 8 --tokenizer_mode mistral --config_format mistral --load_format mistral --gpu-memory-utilization 0.88 --max-model-len 32768 --port 8000. For a 4x B200 deployment: use --tensor-parallel-size 4. For multi-node with expert parallelism: add --enable-expert-parallel (a boolean flag) to enable expert routing across nodes; the expert-parallel degree is derived automatically from your TP and DP configuration.

  5. Test the OpenAI-compatible API endpoint

    Send a test request to http://localhost:8000/v1/chat/completions with model set to your model ID. Monitor aggregate throughput with nvidia-smi dmon and tune --max-num-seqs (start at 32, increase until GPU memory pressure appears). Verify TTFT and TPS match your latency SLA before directing production traffic.

  6. Tune KV cache and enable continuous batching

    For production workloads, add --enable-chunked-prefill --max-num-batched-tokens 8192 to improve latency under load. Set --kv-cache-dtype fp8 on H100/H200/B200 hardware to cut KV cache memory by 50% and fit longer context windows. Monitor vllm:kv_cache_usage_perc in Prometheus; keep it below 85% to avoid scheduling stalls.

FAQ / 06

Frequently Asked Questions

At FP8, the 675B total parameter footprint is approximately 675 GB. With roughly 5% runtime overhead for inference-time activations and framework state, you need around 710 GB of VRAM minimum, putting 4x B200 192GB (768 GB total) or 8x H200 141GB (1128 GB total) within comfortable range. At 4-bit (Q4_K_M), the footprint drops to roughly 340 GB, putting a 4x H200 141GB (564 GB total) within reach. Always budget for KV cache on top of model weights.

Tensor parallelism splits each weight matrix across GPUs and requires all GPUs to synchronize on every forward pass via an all-reduce collective. Expert parallelism routes tokens to different GPUs based on which expert handles them; each GPU holds a subset of the MoE expert layers and only the active expert's GPU contributes to a given token. For MoE models like Mistral Large 3 where only a small fraction of total parameters activate per token (41B of 675B), expert parallelism avoids the communication cost of splitting large matrices and scales more efficiently across nodes.

Yes, but only with INT4 quantization. The 675 GB FP8 weight footprint exceeds the 640 GB available on 8x H100 SXM5 80GB, so FP8 is not feasible on H100. With INT4/Q4_K_M, the footprint drops to roughly 340 GB, fitting on 8x H100 (640 GB) with about 300 GB of KV cache headroom. H100 configurations work best for low-concurrency or batch workloads due to INT4's 3-5% quality penalty versus FP8. For FP8 serving, the practical minimum is 4x B200 192GB (768 GB) or 8x H200 141GB (1128 GB).

vLLM supports Mixtral-class MoE architectures natively, and Mistral Large 3 uses the same architecture family. Install the latest vLLM (`pip install -U vllm`). Pass --tensor-parallel-size to set GPU count and --enable-expert-parallel (a boolean flag) to activate expert parallelism for multi-node deployments; the expert-parallel degree is derived automatically. You must also pass --tokenizer_mode mistral --config_format mistral --load_format mistral for correct model loading. If the model class is not recognized, add --trust-remote-code.

Cost per million tokens depends on your throughput and GPU configuration. At sustained throughput, an 8x H200 SXM5 configuration on Spheron costs roughly $36.32/hr on-demand or $14.16/hr on spot (as of June 2026). At approximately 2,500 tokens/second aggregate throughput, that works out to roughly $4.04/M tokens on-demand or ~$1.57/M tokens on spot. Spot instances typically reduce cost by 40-60% for workloads that can tolerate preemption.

Mistral Large 3 (675B total, 41B active) and DeepSeek V4 (1T total, 37B active) have similar active parameter counts, so inference cost per token is comparable. However, Mistral Large 3's smaller total parameter footprint (675B vs 1T) means it fits on fewer GPUs. A 4x B200 config (~$34.44/hr on-demand) handles Mistral Large 3 at FP8, while DeepSeek V4's 1T parameter footprint requires at least 8x H200 (~$36.32/hr), with a more practical recommendation of 10+ H200 for adequate KV cache headroom. The hardware difference favors Mistral Large 3 for teams optimizing deployment cost.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.