Deploy Mistral Large 3 on GPU Cloud: Self-Host the 675B MoE with vLLM and Expert Parallelism (2026)

Q: How much VRAM does Mistral Large 3 require?

At FP8, the 675B total parameter footprint is approximately 675 GB. With roughly 5% runtime overhead for inference-time activations and framework state, you need around 710 GB of VRAM minimum, putting 4x B200 192GB (768 GB total) or 8x H200 141GB (1128 GB total) within comfortable range. At 4-bit (Q4_K_M), the footprint drops to roughly 340 GB, putting a 4x H200 141GB (564 GB total) within reach. Always budget for KV cache on top of model weights.

Q: What is the difference between expert parallelism and tensor parallelism for Mistral Large 3?

Tensor parallelism splits each weight matrix across GPUs and requires all GPUs to synchronize on every forward pass via an all-reduce collective. Expert parallelism routes tokens to different GPUs based on which expert handles them; each GPU holds a subset of the MoE expert layers and only the active expert's GPU contributes to a given token. For MoE models like Mistral Large 3 where only a small fraction of total parameters activate per token (41B of 675B), expert parallelism avoids the communication cost of splitting large matrices and scales more efficiently across nodes.

Q: Can I run Mistral Large 3 on H100 GPUs?

Yes, but only with INT4 quantization. The 675 GB FP8 weight footprint exceeds the 640 GB available on 8x H100 SXM5 80GB, so FP8 is not feasible on H100. With INT4/Q4_K_M, the footprint drops to roughly 340 GB, fitting on 8x H100 (640 GB) with about 300 GB of KV cache headroom. H100 configurations work best for low-concurrency or batch workloads due to INT4's 3-5% quality penalty versus FP8. For FP8 serving, the practical minimum is 4x B200 192GB (768 GB) or 8x H200 141GB (1128 GB).

Q: Does vLLM support Mistral Large 3 natively?

vLLM supports Mixtral-class MoE architectures natively, and Mistral Large 3 uses the same architecture family. Install the latest vLLM (`pip install -U vllm`). Pass --tensor-parallel-size to set GPU count and --enable-expert-parallel (a boolean flag) to activate expert parallelism for multi-node deployments; the expert-parallel degree is derived automatically. You must also pass --tokenizer_mode mistral --config_format mistral --load_format mistral for correct model loading. If the model class is not recognized, add --trust-remote-code.

Q: What is the cost per million tokens for Mistral Large 3 on Spheron?

Cost per million tokens depends on your throughput and GPU configuration. At sustained throughput, an 8x H200 SXM5 configuration on Spheron costs roughly $36.32/hr on-demand or $14.16/hr on spot (as of June 2026). At approximately 2,500 tokens/second aggregate throughput, that works out to roughly $4.04/M tokens on-demand or ~$1.57/M tokens on spot. Spot instances typically reduce cost by 40-60% for workloads that can tolerate preemption.

Q: How does Mistral Large 3 compare to DeepSeek V4 for deployment cost?

Mistral Large 3 (675B total, 41B active) and DeepSeek V4 (1T total, 37B active) have similar active parameter counts, so inference cost per token is comparable. However, Mistral Large 3's smaller total parameter footprint (675B vs 1T) means it fits on fewer GPUs. A 4x B200 config (~$34.44/hr on-demand) handles Mistral Large 3 at FP8, while DeepSeek V4's 1T parameter footprint requires at least 8x H200 (~$36.32/hr), with a more practical recommendation of 10+ H200 for adequate KV cache headroom. The hardware difference favors Mistral Large 3 for teams optimizing deployment cost.

Mistral Large 3 is a 675B-parameter Mixture of Experts model with only 41B parameters active per token, released in December 2025 under the Apache 2.0 license. It supports both text and image inputs, with a native vision encoder for multimodal understanding. The model positions itself as a Western-lab alternative to the DeepSeek family at similar active-parameter scale, with a smaller total footprint than DeepSeek V4's 1T parameters. The Apache 2.0 license removes the deployment restrictions that come with many frontier models, which matters if you're building commercial products or need to modify the serving stack. This guide covers everything from VRAM planning to production vLLM configuration on multi-GPU instances.

Mistral Large 3 at a Glance

Before touching any configuration, it helps to know the model's architecture numbers:

Property	Value
Total parameters	675B
Active parameters per token	41B
License	Apache 2.0
Context window	256K tokens (262,144)
Release date	December 2025
Modality	Text + Images (native multimodal, native vision encoder)
Architecture	Mixture of Experts (MoE)

Mistral has not published official benchmark results for Mistral Large 3 at time of writing. Check the official Mistral release post for updated numbers as they become available.

The Apache 2.0 license is the practical differentiator for many teams. Unlike models with custom non-commercial or research-only licenses, Apache 2.0 allows unrestricted commercial use, modification, and redistribution. You can embed Mistral Large 3 in a product, modify the model, or build derivative works without negotiating a separate license.

VRAM Requirements: FP8, Q4, and GGUF Footprints

The most common misconception about MoE models is that VRAM scales with active parameters, not total parameters. It does not. Mistral Large 3 has 675B total parameters, and all 675B must reside in GPU VRAM at inference time. The router selects which 41B activate per forward pass, but it cannot do that selection without reading all expert weights first. Size your cluster for the full model.

VRAM formula: total_vram = (total_params_GB × dtype_bytes) + (total_params_GB × dtype_bytes × 0.05) + kv_cache_budget

Quantization	Weight size	Runtime overhead	Min VRAM needed	Min GPU config
FP8	675 GB	~34 GB	~710 GB	8x H200 141GB (1128 GB) or 4x B200 192GB (768 GB)
INT4/GPTQ	~338 GB	~17 GB	~355 GB	5x H100 SXM5 80GB (400 GB), or 3x H200 141GB (423 GB)
Q4_K_M (GGUF)	~338 GB	~17 GB	~355 GB	3x H200 141GB (423 GB) or 5x H100 80GB (400 GB)
NVFP4	~338 GB	~17 GB	~355 GB	3x B200 SXM6 192GB (576 GB), Blackwell only, experimental

A few notes on this table:

FP8 is the preferred quantization for H200 and B200 hardware. It uses native hardware tensor cores, delivers near-lossless quality (less than 1% degradation on most benchmarks), and cuts memory to roughly half of BF16.

INT4/GPTQ reduces the footprint enough to fit on 5x H100, but you need a pre-quantized checkpoint from Hugging Face and should expect 3-5% quality degradation on reasoning-heavy tasks. It works well for batch inference or use cases where you care more about cost than peak accuracy.

NVFP4 is experimental as of mid-2026 and requires Blackwell hardware specifically (B200 or B300). It is not available on H100 or H200. If you are on Blackwell, it roughly halves the memory requirement versus FP8, but the quantization tooling is still maturing. Verify model support before betting production traffic on it.

Expert Parallelism and Tensor Parallelism: When to Use Each

Tensor parallelism (TP) splits individual weight matrices across GPUs. Every forward pass requires an all-reduce collective across all GPUs in the TP group, which synchronizes every layer. This is efficient on fast interconnects (NVLink) but becomes a communication bottleneck when scaling across multiple nodes connected over InfiniBand.

Expert parallelism (EP) assigns entire MoE expert layers to specific GPUs. A given GPU holds a subset of the experts, and tokens are routed to whichever GPU owns the relevant expert. The communication pattern shifts from all-reduce per layer to all-to-all per routing step. For large MoE models where most computation happens inside expert layers, EP scales better across nodes because you only communicate at routing boundaries, not inside each layer.

Config	Use case	Communication pattern	When to prefer
TP only	Single-node, up to 8 GPUs	All-reduce per layer	Low-latency serving, small batch sizes, NVLink-connected GPUs
EP only	Multi-node, high throughput	All-to-all at routing	Batch inference, large concurrent request queues
TP + EP hybrid	Very large clusters	Both	High throughput + multi-node scale

For most Mistral Large 3 deployments, single-node tensor parallelism on 4x B200 or 8x H200 is the practical choice. Multi-node EP adds significant operational complexity (Ray cluster setup, network topology requirements) and is only worth it when your throughput requirements exceed what a single node provides. For deeper coverage of MoE parallelism strategies, the MoE inference optimization guide covers expert routing, memory planning, and framework comparisons in detail.

Step-by-Step: Deploy Mistral Large 3 with vLLM

Prerequisites

GPU instance with 8x H200 SXM5 141GB, 4x B200 SXM6 192GB, or 8x H100 SXM5 80GB at minimum
CUDA 12.1 or later, Python 3.10 or later
~700 GB persistent storage for FP8 weights (or ~350 GB for INT4)
Hugging Face account; model access if the repository is gated

Install vLLM

bash

pip install -U vllm  # always install the latest version

# Verify version
python -c "import vllm; print(vllm.__version__)"

The latest vLLM is required for native support of Mistral Large 3's tokenizer format and configuration. Run pip install -U vllm to ensure you have the current version. Earlier versions may fail to load the model or produce incorrect outputs. If you see tokenizer or model loading errors, update vLLM.

Download Model Weights

bash

huggingface-cli download mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --local-dir /data/models/mistral-large-3

# Verify download completed
du -sh /data/models/mistral-large-3/

Weight download takes 1-2 hours depending on bandwidth. The FP8 checkpoint is approximately 675 GB. If you plan to use GPTQ INT4, find a pre-quantized checkpoint on Hugging Face (search for Mistral-Large-3-675B-Instruct-2512-GPTQ variants) rather than quantizing at runtime.

Launch: 8x H200 FP8 (Recommended)

bash

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --quantization fp8 \
  --tensor-parallel-size 8 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --gpu-memory-utilization 0.88 \
  --max-model-len 32768 \
  --max-num-seqs 64 \
  --kv-cache-dtype fp8 \
  --enable-chunked-prefill \
  --port 8000

Flag notes:

--quantization fp8: activates FP8 weight loading. Requires hardware that supports FP8 tensor cores (H100, H200, B200).
--tensor-parallel-size 8: splits every attention and FFN layer across 8 GPUs. Requires NVLink-connected GPUs for efficient communication. The official recipe from Mistral recommends TP=8 to match the model's attention-head configuration.
--tokenizer_mode mistral --config_format mistral --load_format mistral: required flags for correct Mistral model loading. Without these the server will not load the model correctly.
--gpu-memory-utilization 0.88: leaves 12% of each GPU's VRAM for CUDA context and framework overhead. If you see OOM errors on startup, lower to 0.85.
--max-model-len 32768: context window cap. Increase to 65536 or 131072 if your application needs longer contexts; KV cache will grow accordingly.
--kv-cache-dtype fp8: stores the KV cache in FP8 rather than BF16, cutting KV cache memory by roughly half.
--enable-chunked-prefill: processes long prompts in chunks rather than blocking the decode queue. Improves latency under mixed-length workloads.

Launch: 4x B200 FP8

bash

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --gpu-memory-utilization 0.92 \
  --max-model-len 16384 \
  --max-num-seqs 64 \
  --kv-cache-dtype fp8 \
  --enable-chunked-prefill \
  --port 8000

The B200 has 192 GB per GPU, so 4 GPUs total 768 GB. With --gpu-memory-utilization 0.92, vLLM caps its entire allocation at ~706 GB. FP8 weights consume roughly 675 GB of that budget, leaving ~32 GB for KV cache. Use --max-model-len 16384 for safe operation at this utilization level. If you need the full 32K context window, raise --gpu-memory-utilization to 0.96 instead.

Launch: Multi-Node with Expert Parallelism

For deployments that need to span multiple physical nodes, set up a Ray cluster first:

bash

# On the head node
ray start --head --port 6379

# On each worker node
ray start --address='<head_node_ip>:6379'

Then launch vLLM with combined tensor and expert parallelism:

bash

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --gpu-memory-utilization 0.88 \
  --max-model-len 32768 \
  --port 8000

--enable-expert-parallel activates expert parallelism, distributing MoE expert shards across nodes. The expert-parallel degree is derived automatically from the TP and DP configuration; combined with --tensor-parallel-size 4, each node runs TP across its 4 GPUs while EP handles cross-node expert routing. This requires the latest vLLM and a working Ray cluster. For most teams, single-node deployment on 4x B200 or 8x H200 is simpler and covers the majority of throughput requirements.

Test the API

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-required",
)

response = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-675B-Instruct-2512",
    messages=[{"role": "user", "content": "Explain MoE routing in 3 sentences."}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Or with curl:

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Large-3-675B-Instruct-2512",
    "messages": [{"role": "user", "content": "Explain MoE routing in 3 sentences."}],
    "max_tokens": 256
  }'

Throughput and Latency Benchmarks

The numbers below are derived from vLLM benchmarks on similar MoE architectures at comparable active-parameter scale (41B active parameters). Actual throughput varies with sequence length, batch size, and request concurrency. Treat these as planning estimates, not guarantees.

GPU Config	Quantization	Batch Size	Throughput (tok/s)	TTFT (ms)	Cost/M tokens (spot)
4x B200 SXM6	FP8	32	~3,000	~120	~$2.00
8x H200 SXM5	FP8	32	~2,500	~150	~$1.57
8x H100 SXM5	INT4	32	~2,000	~250	~$1.59

Throughput figures assume 512-token average output length and NVLink-connected GPUs. For workloads with longer outputs (2K+ tokens), throughput drops proportionally. For short-context completion tasks (under 256 tokens), throughput is higher.

Pricing fluctuates based on GPU availability. The prices above are based on 09 Jun 2026 and may have changed. Check current GPU pricing for live rates.

Quantization Options to Cut the GPU Bill

FP8 is the default recommendation for H200 and B200 hardware. It uses native tensor cores available on Hopper and Blackwell GPUs, delivers quality within 1% of BF16 on most benchmarks, and halves the memory footprint versus FP16. Enable it with --quantization fp8.

GPTQ INT4 cuts memory in half again versus FP8, but you need a pre-quantized checkpoint from Hugging Face. Runtime quantization with GPTQ is slow. Use --quantization gptq and point at a pre-quantized model ID. Quality drops 3-5% on reasoning tasks but is acceptable for many production use cases. This is the practical path if you need to fit on A100 80GB or older H100 PCIe hardware.

GGUF via llama.cpp works for CPU-GPU hybrid inference. Throughput is lower than vLLM but memory flexibility is higher since you can offload layers to CPU RAM. Useful for local experiments or edge deployments where GPU VRAM is the bottleneck. Not recommended for serving more than a few concurrent users.

NVFP4 is Blackwell-specific (B200, B300) and experimental as of mid-2026. It halves memory versus FP8, which means a small number of B200 GPUs could serve the model. The tooling is maturing rapidly, but verify that Mistral Large 3 supports NVFP4 in your vLLM version before relying on it for production.

Format	VRAM saving vs FP16	Quality delta (MMLU)	Recommended hardware	vLLM flag
FP8	~50%	<1%	H100, H200, B200	`--quantization fp8`
GPTQ INT4	~75%	3-5%	All NVIDIA, requires pre-quantized checkpoint	`--quantization gptq`
GGUF Q4_K_M	~75%	3-5%	CPU+GPU hybrid	Use llama.cpp, not vLLM
NVFP4	~75%	TBD	B200, B300 only	Experimental

Production Serving: OpenAI-Compatible API, Autoscaling, and KV Cache Tuning

To expose an OpenAI-compatible endpoint with a specific model name:

bash

vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
  --quantization fp8 \
  --tensor-parallel-size 8 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name mistral-large-3

--served-model-name sets the model field in API responses. Any client using model: "mistral-large-3" will route to this endpoint without needing the full HuggingFace path.

For KV cache tuning, three flags have the most impact:

--kv-cache-dtype fp8: halves KV cache memory on H100/H200/B200. Enables longer contexts or more concurrent requests on the same hardware. Do not combine with --enable-prefix-caching (they are mutually exclusive).
--max-num-batched-tokens 8192: controls how many tokens vLLM processes in a single scheduler step. Higher values improve throughput at the cost of first-token latency.
--enable-chunked-prefill: breaks long prompts into smaller chunks processed across multiple scheduler steps. Prevents long prefill operations from blocking the decode queue.

For horizontal autoscaling, run multiple vLLM instances behind an NGINX upstream:

nginx

upstream mistral_large {
  least_conn;
  server gpu-node-1:8000;
  server gpu-node-2:8000;
}

Each instance should run on its own multi-GPU node. NGINX least_conn distributes requests to the instance with the fewest active connections. For full setup details, the vLLM production deployment guide covers Docker configuration, load balancer setup, and systemd service management.

The key Prometheus metrics to monitor:

vllm:kv_cache_usage_perc: KV cache fill rate. Keep below 85% to avoid scheduling stalls where new requests queue behind a full cache.
vllm:num_requests_waiting: requests queued but not yet scheduled. Sustained values above 10 indicate you need more throughput capacity.
vllm:time_to_first_token_seconds: TTFT latency histogram. Track the p95 and p99 to catch tail latency issues under load.

Spheron GPU Pricing for Mistral Large 3

Prices below are derived from the Spheron GPU pricing API as of 09 Jun 2026. The on-demand figures use the lowest available per-GPU rate for non-spot offers; spot figures use the lowest per-GPU spot rate across available bundles.

GPU Config	On-Demand/hr	Spot/hr	Notes
4x B200 SXM6	~$34.44	~$21.36	FP8, recommended; combine two 2-GPU bundles
8x H200 SXM5 141GB	~$36.32	~$14.16	FP8 with KV cache headroom; four 2-GPU bundles
8x H100 SXM5 80GB	~$20.32	~$11.44	INT4 only for Mistral Large 3; 8-GPU bundle available

Spot instances reduce hourly cost by 40-60% versus on-demand. They are suitable for batch inference, offline evaluation, and workloads that can handle preemption. On-demand is the right choice for production APIs where uptime matters.

Pricing fluctuates based on GPU availability. The prices above are based on 09 Jun 2026 and may have changed. Check current GPU pricing for live rates.

Mistral Large 3 vs DeepSeek V4: Which to Deploy?

Property	Mistral Large 3	DeepSeek V4
Total parameters	675B	~1T
Active parameters	41B	~37B
License	Apache 2.0	Permissive (check specific terms)
Min GPU config (FP8)	4x B200 or 8x H200	8x H200 or larger
Approx. on-demand/hr	~$34.44 (4x B200)	~$36.32+ (8x H200)
Context window	256K	~1M

Choose Mistral Large 3 when the smaller GPU footprint matters, when Apache 2.0 licensing is a requirement for your deployment, or when you want a Western-lab model without export control concerns. The 675B total parameter count fits on fewer GPUs than DeepSeek V4, which translates directly to lower hourly cost for the same request volume.

Choose DeepSeek V4 when you need a longer context window (1M tokens versus 256K), when benchmark performance on specific tasks like math or reasoning is the primary criteria, or when you already have 8x H200 nodes provisioned for other workloads. For a full setup walkthrough with that model, see the DeepSeek V4 deployment guide.

For most teams deciding between these two, the hardware cost difference is the deciding factor. Mistral Large 3's ability to fit on 4x B200 ($34.44/hr) versus DeepSeek V4's minimum of 8x H200 ($36.32/hr) means lower upfront cost. The 4x B200 config provides a modest KV cache budget sufficient for 16K-context workloads at the recommended --max-model-len setting; for longer context headroom, the 8x H200 config is the better fit.

Mistral Large 3 fits on 4x B200 or 8x H200 at FP8, both configs available on-demand or as spot on Spheron.
H200 capacity on Spheron | B200 capacity on Spheron | View all GPU pricing

STEPS / 06

Quick Setup Guide

Calculate VRAM requirements for your target quantization
Mistral Large 3 has 675B total parameters. At FP8 (1 byte/param), weights alone require 675 GB. Add roughly 5% runtime overhead for inference-time activations and framework state to get ~710 GB minimum (excluding KV cache budget). At 4-bit (0.5 bytes/param), weights are ~338 GB plus overhead, giving ~355 GB minimum. Choose your quantization based on quality requirements: FP8 delivers near-identical quality to FP16, while Q4_K_M saves ~50% VRAM at a 3-5% quality penalty on most benchmarks.
Choose a multi-GPU configuration on Spheron
Provision your GPU instance at app.spheron.ai. Recommended configs: 4x B200 SXM6 (768 GB) for FP8 single-node; 8x H200 SXM5 141GB (1128 GB) for FP8 single-node with generous KV headroom; 4x H200 SXM5 141GB (564 GB) for Q4_K_M. Ensure at least 700 GB persistent storage for weight downloads. SSH in and verify connectivity with nvidia-smi -L.
Install vLLM and download model weights
Install with pip install -U vllm (always install the latest version). Download weights from Hugging Face: huggingface-cli download mistralai/Mistral-Large-3-675B-Instruct-2512 --local-dir /data/models/mistral-large-3. This may take 1-2 hours depending on your bandwidth.
Launch the vLLM inference server with expert parallelism
For an 8x H200 FP8 deployment: vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 --quantization fp8 --tensor-parallel-size 8 --tokenizer_mode mistral --config_format mistral --load_format mistral --gpu-memory-utilization 0.88 --max-model-len 32768 --port 8000. For a 4x B200 deployment: use --tensor-parallel-size 4. For multi-node with expert parallelism: add --enable-expert-parallel (a boolean flag) to enable expert routing across nodes; the expert-parallel degree is derived automatically from your TP and DP configuration.
Test the OpenAI-compatible API endpoint
Send a test request to http://localhost:8000/v1/chat/completions with model set to your model ID. Monitor aggregate throughput with nvidia-smi dmon and tune --max-num-seqs (start at 32, increase until GPU memory pressure appears). Verify TTFT and TPS match your latency SLA before directing production traffic.
Tune KV cache and enable continuous batching
For production workloads, add --enable-chunked-prefill --max-num-batched-tokens 8192 to improve latency under load. Set --kv-cache-dtype fp8 on H100/H200/B200 hardware to cut KV cache memory by 50% and fit longer context windows. Monitor vllm:kv_cache_usage_perc in Prometheus; keep it below 85% to avoid scheduling stalls.

FAQ / 06

Frequently Asked Questions

At FP8, the 675B total parameter footprint is approximately 675 GB. With roughly 5% runtime overhead for inference-time activations and framework state, you need around 710 GB of VRAM minimum, putting 4x B200 192GB (768 GB total) or 8x H200 141GB (1128 GB total) within comfortable range. At 4-bit (Q4_K_M), the footprint drops to roughly 340 GB, putting a 4x H200 141GB (564 GB total) within reach. Always budget for KV cache on top of model weights.

Tensor parallelism splits each weight matrix across GPUs and requires all GPUs to synchronize on every forward pass via an all-reduce collective. Expert parallelism routes tokens to different GPUs based on which expert handles them; each GPU holds a subset of the MoE expert layers and only the active expert's GPU contributes to a given token. For MoE models like Mistral Large 3 where only a small fraction of total parameters activate per token (41B of 675B), expert parallelism avoids the communication cost of splitting large matrices and scales more efficiently across nodes.

Yes, but only with INT4 quantization. The 675 GB FP8 weight footprint exceeds the 640 GB available on 8x H100 SXM5 80GB, so FP8 is not feasible on H100. With INT4/Q4_K_M, the footprint drops to roughly 340 GB, fitting on 8x H100 (640 GB) with about 300 GB of KV cache headroom. H100 configurations work best for low-concurrency or batch workloads due to INT4's 3-5% quality penalty versus FP8. For FP8 serving, the practical minimum is 4x B200 192GB (768 GB) or 8x H200 141GB (1128 GB).

vLLM supports Mixtral-class MoE architectures natively, and Mistral Large 3 uses the same architecture family. Install the latest vLLM (`pip install -U vllm`). Pass --tensor-parallel-size to set GPU count and --enable-expert-parallel (a boolean flag) to activate expert parallelism for multi-node deployments; the expert-parallel degree is derived automatically. You must also pass --tokenizer_mode mistral --config_format mistral --load_format mistral for correct model loading. If the model class is not recognized, add --trust-remote-code.

Cost per million tokens depends on your throughput and GPU configuration. At sustained throughput, an 8x H200 SXM5 configuration on Spheron costs roughly $36.32/hr on-demand or $14.16/hr on spot (as of June 2026). At approximately 2,500 tokens/second aggregate throughput, that works out to roughly $4.04/M tokens on-demand or ~$1.57/M tokens on spot. Spot instances typically reduce cost by 40-60% for workloads that can tolerate preemption.

Mistral Large 3 (675B total, 41B active) and DeepSeek V4 (1T total, 37B active) have similar active parameter counts, so inference cost per token is comparable. However, Mistral Large 3's smaller total parameter footprint (675B vs 1T) means it fits on fewer GPUs. A 4x B200 config (~$34.44/hr on-demand) handles Mistral Large 3 at FP8, while DeepSeek V4's 1T parameter footprint requires at least 8x H200 (~$36.32/hr), with a more practical recommendation of 10+ H200 for adequate KV cache headroom. The hardware difference favors Mistral Large 3 for teams optimizing deployment cost.

Mistral Large 3 at a Glance

VRAM Requirements: FP8, Q4, and GGUF Footprints

Expert Parallelism and Tensor Parallelism: When to Use Each

Step-by-Step: Deploy Mistral Large 3 with vLLM

Prerequisites

Install vLLM

Download Model Weights

Launch: 8x H200 FP8 (Recommended)

Launch: 4x B200 FP8

Launch: Multi-Node with Expert Parallelism

Test the API

Throughput and Latency Benchmarks

Quantization Options to Cut the GPU Bill

Production Serving: OpenAI-Compatible API, Autoscaling, and KV Cache Tuning

Spheron GPU Pricing for Mistral Large 3

Mistral Large 3 vs DeepSeek V4: Which to Deploy?

Quick Setup Guide

Calculate VRAM requirements for your target quantization

Choose a multi-GPU configuration on Spheron

Install vLLM and download model weights

Launch the vLLM inference server with expert parallelism

Test the OpenAI-compatible API endpoint

Tune KV cache and enable continuous batching

Frequently Asked Questions

01How much VRAM does Mistral Large 3 require?

02What is the difference between expert parallelism and tensor parallelism for Mistral Large 3?

03Can I run Mistral Large 3 on H100 GPUs?

04Does vLLM support Mistral Large 3 natively?

05What is the cost per million tokens for Mistral Large 3 on Spheron?

06How does Mistral Large 3 compare to DeepSeek V4 for deployment cost?

Try It on Real GPUs