Mixture of Experts is the architecture behind virtually all major frontier open models released in 2026. DeepSeek V3.2 Speciale (685B total, 37B active), Llama 4 Maverick (400B total, 17B active), Kimi K2 (1T total, ~32B active), and Mixtral 8x22B (141B total, 39B active) all use MoE to deliver dense-model quality at a fraction of the compute cost per token. The GPU challenge is specific: you need enough VRAM to hold all expert weights in memory, but only a small subset activate on each forward pass. Getting the memory math right and choosing the correct parallelism strategy is what separates a working deployment from one that OOMs or runs at 10% efficiency. For general VRAM planning, see the GPU memory requirements guide. For production vLLM setup beyond MoE specifics, see vLLM production deployment.
Why MoE Models Dominate in 2026
Dense models activate every parameter on every token. A 70B dense model runs 70B parameter computations per forward pass. MoE models route tokens through a small subset of experts, so a 685B MoE model with 37B active parameters only runs 37B parameter computations per forward pass, roughly the same compute as a 37B dense model but with the representational capacity of 685B parameters.
The routing mechanism works like this: each MoE layer contains N expert FFN networks. A router network scores each token against all experts and selects the top-K highest-scoring experts to process that token. For DeepSeek V3.2 Speciale, that's top-8 from 256 experts. For Llama 4 Maverick, it's top-1 from 128 experts.
Here are the five major MoE models worth knowing for 2026 deployments:
| Model | Total Params | Active Params | Experts | Top-K | Architecture |
|---|---|---|---|---|---|
| DeepSeek V3.2 Speciale | 685B | 37B | 256 | 8 | DSA + MoE |
| Llama 4 Maverick | 400B | 17B | 128 | 1 | MoE |
| Kimi K2 | 1T | ~32B | 384 | 8 | MoE |
| Mixtral 8x22B | 141B | 39B | 8 | 2 | MoE |
| Qwen2-57B-A14B | 57B | 14B | 64 | 8 | MoE |
For more on Kimi K2 deployment specifics, see the Kimi K2 deployment guide. For Llama 4 Scout and Maverick setup, see Llama 4 on GPU cloud.
The practical consequence is that MoE models are efficient to run but expensive to fit. You're paying for VRAM to hold all experts even though most of them sit idle for any given token. This changes the GPU selection math entirely.
GPU Memory Math for MoE: What Actually Fits in VRAM
This is the part most deployment guides get wrong. Active parameter count has nothing to do with how much GPU memory you need. Every expert's weights must be resident in VRAM at all times so the router can dispatch tokens to any of them.
The formula for GPU memory planning:
total_vram = (total_params x bytes_per_dtype x 1.15) + kv_cache_budgetBytes per data type:
| Dtype | Bytes per Parameter |
|---|---|
| BF16 | 2 |
| FP8 | 1 |
| INT8 | 1 |
| INT4 | 0.5 |
The 1.15 multiplier covers activation memory, framework overhead, and routing buffers. KV cache budget depends on your target context length and batch size. For a detailed KV cache calculation method, see the KV cache optimization guide.
Here's what the major MoE models need across quantization levels:
| Model | BF16 Weights | FP8 Weights | INT4 Weights | Min GPU Config (FP8) |
|---|---|---|---|---|
| DeepSeek V3.2 Speciale (685B) | ~1,370 GB | ~685 GB | ~343 GB | 8x H200 141GB |
| Kimi K2 (1T) | ~2,000 GB | ~1,000 GB | ~500 GB | 16x H100 80GB |
| Llama 4 Maverick (400B) | ~800 GB | ~400 GB | ~200 GB | 6x H100 80GB |
| Mixtral 8x22B (141B) | ~282 GB | ~141 GB | ~70 GB | 3x H100 80GB |
| Qwen2-57B-A14B (57B) | ~114 GB | ~57 GB | ~28 GB | 1x H100 80GB |
Three constraints stand out from these figures:
First, Kimi K2 at 1T parameters breaks the single-node boundary on H100 and H200 hardware even at FP8. Sixteen H100 80GB GPUs gives 1,280 GB total, but the weights at FP8 are approximately 1 TB plus overhead (~1,150 GB). Eight H200 141GB GPUs gives only 1,128 GB, also short. Both configurations require a multi-node Ray cluster. The exception is 8x B200 SXM6, which provides 1,536 GB total and can fit Kimi K2 FP8 on a single node. More on that in the multi-GPU section.
Second, the "min GPU config" figures assume you have almost no KV cache headroom. For production use with real context lengths and concurrent requests, add 20-40% more VRAM budget. The minimum configs work for smoke tests and single-request validation, not high-throughput serving.
Third, KV cache for MoE models is the same size as for a dense model with the same active architecture. DeepSeek V3.2 Speciale's KV cache is sized by its 37B active parameter architecture, not its 685B total. This is one place where the MoE efficiency advantage shows up directly in memory budgeting.
Expert Parallelism vs Tensor Parallelism: Choosing the Right Strategy
Dense models only need tensor parallelism or pipeline parallelism. MoE models add a third option: expert parallelism. Choosing between them has a direct effect on throughput and GPU utilization.
Expert Parallelism
Expert parallelism assigns entire experts to specific GPUs. GPU 0 holds experts 0-31, GPU 1 holds experts 32-63, and so on. When the router selects an expert, the token is dispatched to the GPU holding that expert via all-to-all communication.
The advantage: each GPU does focused, cache-friendly compute on its own experts. The communication pattern is all-to-all at the routing step, but individual GPU compute is efficient.
Best conditions for expert parallelism:
- GPU count is 2-8 (matches the expert distribution)
- High-bandwidth NVLink interconnect (SXM variants)
- Individual expert weight matrices fit comfortably in one GPU's VRAM
- High-traffic workloads where routing overhead amortizes well
vLLM flag: --enable-expert-parallel
Tensor Parallelism
Tensor parallelism splits each layer's weight matrices across GPUs column-wise or row-wise. All GPUs participate in every layer computation, synchronizing via all-reduce at each layer boundary.
For MoE models, tensor parallelism applies to both attention layers and expert FFN layers. It's a good choice when individual expert weights are too large for a single GPU's VRAM.
Best conditions for tensor parallelism:
- Single expert layers exceed single-GPU VRAM (rare except for giant MoE models)
- You want to minimize time-to-first-token by distributing prefill
- PCIe-connected GPUs where all-to-all expert dispatch is especially costly
vLLM flag: --tensor-parallel-size N
Pipeline Parallelism
Pipeline parallelism assigns transformer layers to GPU stages. GPU 0 handles layers 0-15, GPU 1 handles layers 16-31, etc. Tokens flow through stages sequentially.
The downsides: sequential execution creates pipeline bubbles (idle GPUs waiting for previous stages), and latency per token is higher than tensor parallelism. For MoE models, pipeline parallelism also complicates expert routing since experts may span stage boundaries.
Use pipeline parallelism only when memory is the binding constraint and you have more GPUs than you can fit under the expert or tensor parallelism models. Avoid it for interactive inference.
vLLM flag: --pipeline-parallel-size N
Decision Matrix
| Scenario | Recommended Strategy | vLLM Flags |
|---|---|---|
| Mixtral 8x7B on 2x GPU | Expert parallelism | --tensor-parallel-size 2 --enable-expert-parallel |
| DeepSeek V3.2 on 8x H200 141GB | Expert parallelism | --tensor-parallel-size 8 --enable-expert-parallel |
| Kimi K2 on 16x H100 | Tensor + Pipeline combined | --tensor-parallel-size 8 --pipeline-parallel-size 2 |
| Mixtral 8x22B on 4x A100 | Tensor + Expert parallelism | --tensor-parallel-size 4 --enable-expert-parallel |
For head-to-head inference framework benchmarks, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
Deploying MoE Models with vLLM: Configuration Deep Dive
Environment Setup
Before launching vLLM, verify your GPU configuration. NVLink topology matters for expert parallelism performance.
# Verify GPU configuration
nvidia-smi topo -m # Check NVLink topology
nvidia-smi --query-gpu=name,memory.total --format=csv
# Install vLLM with full dependencies
pip install "vllm>=0.6.0" --upgrade
pip install flash-attn --no-build-isolation # Optional: faster attention
# For DeepSeek models only
pip install git+https://github.com/deepseek-ai/DeepGEMMThe nvidia-smi topo -m output tells you whether your GPUs have NVLink connections. Look for "NV" labels in the connection matrix. SXM H100s in the same node will show NVLink; PCIe variants will show PCIe or SYS connections with much lower bandwidth.
Mixtral 8x22B on 4x A100: Recommended Entry Point
Mixtral 8x22B is the best starting point for MoE deployment. At FP8, the weights are approximately 141 GB; with 15% framework overhead that's roughly 162 GB, leaving around 158 GB for the KV cache across 4x A100 PCIe 80GB (320 GB total). A100 PCIe GPUs are available at $1.07/hr on-demand. Four GPUs at $1.07/hr runs you $4.28/hr total.
vllm serve mistralai/Mixtral-8x22B-Instruct-v0.1 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--dtype fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 32 \
--port 8000Flag breakdown:
--tensor-parallel-size 4: splits model across 4 GPUs--enable-expert-parallel: activates expert parallelism for the MoE layers--dtype fp8: FP8 is required to fit Mixtral 8x22B on 4x A100 80GB. BF16 weights are ~282 GB and with 15% overhead (~324 GB) exceed the 320 GB available. FP8 on A100 falls back to software emulation rather than hardware acceleration (A100 lacks native FP8 Tensor Cores), so throughput will be lower than on H100/H200. If BF16 performance is a requirement, use 5x A100 80GB (400 GB total) instead.--max-model-len 16384: sets the maximum context length; lower values leave more VRAM for the KV cache pool--gpu-memory-utilization 0.92: vLLM pre-allocates 92% of free VRAM for the KV cache after weights load--max-num-seqs 32: maximum concurrent sequences in flight
DeepSeek V3.2 Speciale on 8x H200 141GB
DeepSeek V3.2 Speciale requires at minimum 8x H200 141GB (1,128 GB total VRAM). At FP8, the weights are approximately 685 GB plus activation and routing overhead, which exceeds what 8x H100 80GB (640 GB) can hold. H200 141GB is the minimum viable single-node configuration, leaving around 340 GB headroom for the KV cache.
vllm serve deepseek-ai/DeepSeek-V3.2-Speciale \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--max-num-seqs 16 \
--port 8000Two Hopper-specific additions here:
--dtype fp8 works on H200 (and H100) because Hopper GPUs have hardware FP8 Tensor Cores. This halves the model weight memory footprint compared to BF16.
--kv-cache-dtype fp8 stores KV cache tensors in FP8 instead of BF16, roughly halving KV cache VRAM consumption. This flag is also Hopper-specific; do not use it on A100 or Ampere GPUs.
For DeepSeek specifically, also set VLLM_USE_DEEP_GEMM=1 as an environment variable before launching:
export VLLM_USE_DEEP_GEMM=1
vllm serve deepseek-ai/DeepSeek-V3.2-Speciale \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--max-num-seqs 16 \
--port 8000DeepGEMM provides optimized matrix multiplication kernels for DeepSeek's MoE layers. Without it, vLLM falls back to generic GEMM implementations that are noticeably slower. See the DeepSeek V3.2 deployment guide for the full model download and setup walkthrough.
SGLang Alternative
SGLang is a competing inference server that sometimes outperforms vLLM for MoE models in throughput and token generation speed. The launch command for the same DeepSeek workload:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3.2-Speciale \
--tp-size 8 \
--dp-size 1 \
--dtype fp8 \
--context-length 32768 \
--port 8000SGLang's --tp-size maps to vLLM's --tensor-parallel-size. The --dp-size parameter controls data parallelism across router replicas. SGLang 0.4.x introduced a cache-aware load balancer that optimizes KV cache hit rates across router replicas, improving throughput for chat workloads with shared prefixes. Expert parallelism-specific features for MoE were added in later SGLang versions beyond 0.4.x.
Where SGLang tends to win: workloads with uneven expert routing, high concurrency with short sequences, and benchmarks that emphasize tokens/sec at batch size >1.
Where vLLM tends to win: model compatibility breadth, prefix caching for chat workloads, and operational maturity. If you're deploying something other than DeepSeek or Mixtral, vLLM likely has better support. See vLLM vs TensorRT-LLM vs SGLang benchmarks for head-to-head numbers.
Note: SGLang configuration examples here use SGLang 0.4.x. Earlier versions have different flag names and behavior.
Testing the Deployment
Once the server is running, test with a basic completions request:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.2-Speciale",
"prompt": "Explain the difference between expert parallelism and tensor parallelism.",
"max_tokens": 200
}'For a Python client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3.2-Speciale",
messages=[{"role": "user", "content": "What is expert parallelism?"}],
max_tokens=200,
)
print(response.choices[0].message.content)To benchmark throughput:
python benchmarks/benchmark_serving.py \
--model deepseek-ai/DeepSeek-V3.2-Speciale \
--num-prompts 100 \
--request-rate 4Then run nvidia-smi dmon -d 1 in a separate terminal to watch per-GPU utilization. With expert parallelism working correctly, all 8 GPUs should show similar utilization. If one GPU is consistently at 95% while others idle at 30%, expert routing is imbalanced.
Multi-GPU Strategies: 2x, 4x, and 8x Configurations
The right GPU count depends on three factors: model size, context length, and whether you're using SXM or PCIe variants. SXM variants use NVLink (900 GB/s bidirectional bandwidth), which is critical for expert parallelism performance. PCIe variants use PCIe Gen5 (~128 GB/s), which adds significant all-to-all communication latency for expert dispatch.
For MoE inference, always prefer SXM variants when deploying across 4+ GPUs.
Here are the practical GPU configurations for the major MoE models:
| Model | Min Config | Recommended | Optimal | Notes |
|---|---|---|---|---|
| Mixtral 8x7B (47B) | 1x H100 80GB (FP8) | 2x H100 | 2x H100 SXM5 | Fits in 1 GPU at FP8 |
| Mixtral 8x22B (141B) | 3x H100 80GB (FP8) | 4x H100 | 4x H100 SXM5 | NVLink preferred |
| Qwen2-57B-A14B | 1x H100 80GB | 2x H100 | 2x H200 | Easily single-GPU at FP8 |
| Llama 4 Maverick (400B) | 6x H100 80GB (FP8) | 8x H100 | 4x H200 141GB | Expert parallelism critical |
| DeepSeek V3.2 Speciale (685B) | 8x H200 141GB (FP8) | 8x H200 | 8x B200 | DeepGEMM required |
| Kimi K2 (1T) | 16x H100 80GB (FP8) | 16x H200 141GB | 8x B200 SXM6 | Multi-node required for H100/H200; single-node on 8x B200 SXM6 |
The Kimi K2 row deserves special attention. At 1T parameters and FP8, the weights alone require approximately 1 TB of VRAM plus ~15% overhead, totaling roughly 1,150 GB. A single node of 8x H100 80GB gives you 640 GB total, and 8x H200 141GB gives 1,128 GB, both of which fall short. Those configurations require a multi-node Ray cluster. However, 8x B200 SXM6 provides 1,536 GB total VRAM (192 GB per GPU), which is enough to fit Kimi K2 at FP8 on a single node without multi-node setup. For Ray cluster networking and multi-node inference setup when using H100 or H200 configs, see multi-node GPU networking without InfiniBand.
For single-node deployments, the NVLink bandwidth difference between SXM and PCIe is significant enough to affect whether expert parallelism is worth enabling. On PCIe setups, the all-to-all communication overhead for expert dispatch can eat into the throughput gains from parallel expert compute. Run benchmarks before committing to a parallelism strategy on PCIe hardware.
To check per-GPU utilization in real time and verify expert load distribution:
nvidia-smi dmon -d 1This shows per-GPU utilization, memory usage, and temperature at 1-second intervals. For expert parallelism to work correctly, you want roughly equal utilization across all GPUs. Significant imbalance indicates skewed expert routing.
MoE vs Dense Models: Cost-per-Token on H100 and A100
The argument for MoE economics is that you pay more in VRAM (more GPUs) but get more tokens per second because active compute is lower than a comparably capable dense model. Here's the math on real configurations.
Dense baseline: Llama 3 70B on 1x H100 PCIe
- Hardware cost: $2.01/hr on-demand
- Throughput: ~2,500 tokens/sec (typical vLLM BF16, moderate batch)
- Cost per million output tokens: ($2.01 / 3600) / 2,500 x 1,000,000 = ~$0.223/M tokens
MoE comparison: Mixtral 8x22B on 4x A100 PCIe 80G
- Hardware cost: $1.07/hr x 4 = $4.28/hr on-demand
- Throughput: ~4,000 tokens/sec (39B active params across 4 GPUs)
- Cost per million output tokens: ($4.28 / 3600) / 4,000 x 1,000,000 = ~$0.297/M tokens
These are illustrative calculations. Real throughput depends on batch size, context length distribution, and routing efficiency. With optimal batching, MoE throughput advantages compound because the lower per-token compute allows larger effective batch sizes before GPU saturation.
Full comparison table:
| Setup | GPUs | $/hr (on-demand) | $/hr (spot) | Est. Throughput | $/M tokens (best available) |
|---|---|---|---|---|---|
| Llama 3 70B dense | 1x H100 PCIe | $2.01 | N/A | ~2,500 tok/s | ~$0.223 (on-demand) |
| Mixtral 8x22B MoE | 4x A100 PCIe 80G | $4.28 | $4.56 | ~4,000 tok/s | ~$0.297 (on-demand) |
| DeepSeek V3.2 MoE | 8x B200 SXM6 | $59.44 | $13.36 | ~8,000 tok/s | ~$0.464 (spot) |
Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing for live rates.
The DeepSeek row shows that very large MoE models cost more per token than smaller dense models, even on spot pricing. The value proposition for DeepSeek V3.2 Speciale is quality per token, not cost per token. You're paying for frontier-model output quality, and the MoE architecture makes that output achievable at ~$0.46/M tokens (B200 SXM6 spot) instead of the $1+/M that a comparable dense model would cost.
Spheron GPU Pricing for MoE Workloads: Spot Instance Strategy
Current Spheron pricing for the GPUs most commonly used for MoE inference:
| GPU | On-Demand (lowest) | Spot (lowest) | Spot Savings | Best For |
|---|---|---|---|---|
| H100 PCIe | $2.01/hr | N/A | - | Mixtral, Llama 4, smaller MoE |
| A100 80G PCIe | $1.07/hr | $1.14/hr | -6.5% (spot costlier) | Mixtral 8x22B, cost-optimized |
| B200 SXM6 | $7.43/hr | $1.67/hr | ~78% | DeepSeek V3.2, peak throughput workloads |
Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing for live rates.
Note: A100 80G PCIe spot pricing is currently slightly above on-demand, which is atypical. This reflects current market demand. Check live pricing before choosing spot over on-demand for A100 workloads.
Spot instance strategy for MoE workloads:
Use spot for: batch inference jobs (document processing, offline analysis, dataset annotation), dev and staging environments, evaluation pipelines, and any workload that can checkpoint and restart. See the Spheron docs for configuration options and interruption handling.
Use on-demand for: production interactive APIs where guaranteed availability matters, multi-day training jobs where interruption cost is high, and latency-sensitive applications where cold restarts are unacceptable.
Cost math for a batch job: 8x B200 SXM6 running 10 hours per day:
- On-demand: $7.43 x 8 x 10 = $594.40/day
- Spot: $1.67 x 8 x 10 = $133.60/day
- Savings: $460.80/day, ~$13,824/month
For more on structuring GPU spend across workload types, see serverless GPU vs on-demand vs reserved and the GPU cost optimization playbook.
Production Checklist: Load Balancing, Autoscaling, and Monitoring
Load Balancing Multiple MoE Instances
For high-traffic production deployments, run multiple vLLM instances behind a load balancer. MoE models have an advantage over LoRA setups here: there are no per-request adapter weights, so any instance can handle any request. Session affinity is not required.
Basic nginx upstream configuration:
upstream vllm_moe {
least_conn;
server 10.0.0.1:8000;
server 10.0.0.2:8000;
server 10.0.0.3:8000;
keepalive 32;
}
server {
listen 80;
location /v1/ {
proxy_pass http://vllm_moe;
proxy_read_timeout 300s;
proxy_set_header Connection "";
proxy_http_version 1.1;
}
}Use least_conn load balancing so nginx routes new requests to the instance with the fewest active connections. This roughly balances throughput across instances without needing GPU utilization metrics in the proxy layer.
Key Monitoring Metrics for MoE Inference
| Metric | Tool | Healthy Range | Alert Threshold |
|---|---|---|---|
| GPU utilization per GPU | nvidia-smi | 80-95% | <60% or >98% |
| Expert load balance (per-expert token count) | vLLM Prometheus | Uniform +/-20% | >50% skew |
| All-to-all communication time | NCCL profiler | <5ms per forward pass | >20ms |
| Tokens/sec per GPU | vLLM /metrics endpoint | Model-dependent | -20% from baseline |
| VRAM used | nvidia-smi | 85-92% | >95% |
| KV cache hit rate | vLLM /metrics | >40% for chat | <20% |
The expert load balance metric is specific to MoE. If one expert is handling >40% of tokens while others handle <5%, you have expert collapse. This causes GPU utilization imbalance when using expert parallelism: the GPU holding the overloaded expert runs hot while others idle. Monitor this with vLLM's Prometheus endpoint at /metrics.
For a full monitoring setup guide, see GPU monitoring for ML.
Autoscaling Triggers
Scale up when:
- Request queue depth exceeds 10
- p95 latency exceeds 2 seconds
- GPU utilization sustained above 90% for 5+ minutes
Scale down when:
- GPU utilization below 30% for 15 consecutive minutes
- Queue depth at 0 for 10 consecutive minutes
Pre-warm strategy: keep one instance always running to absorb sudden traffic spikes. MoE model load times are significant (DeepSeek V3.2 Speciale takes 8-12 minutes to load on 8x H200 141GB), so cold-starting an instance on demand is not viable for latency-sensitive traffic.
Production Deployment Checklist
- Verify NVLink topology with
nvidia-smi topo -mbefore launch - Test model load time: DeepSeek 685B FP8 takes 8-12 minutes to load on 8x H200 141GB
- Set
--gpu-memory-utilization 0.90not 0.95+ to leave headroom for routing overhead - Enable
--enable-prefix-cachingfor chat workloads to reduce KV cache pressure (note: incompatible with--kv-cache-dtype fp8— use one or the other) - Monitor per-expert token routing to detect expert collapse (one expert handles >40% of tokens)
- Set hard
--max-model-lenlimits: MoE KV cache at long context can OOM unexpectedly - Run
python benchmarks/benchmark_serving.py --model <your-model>to establish throughput baseline before production traffic - Use
--kv-cache-dtype fp8on H100/H200 for 2x KV cache capacity (note: incompatible with--enable-prefix-caching— use one or the other)
For speculative decoding techniques, see the speculative decoding production guide.
MoE models are the practical way to access frontier-model quality without the compute bill of a dense 400B+ model. The right GPU setup - enough VRAM for all experts, fast enough interconnect for routing - is what makes or breaks MoE inference economics. Spheron's H100 and A100 spot instances give you the multi-GPU clusters you need, with per-second billing so you only pay for active inference time.
Rent H100 → | Rent A100 → | View all GPU pricing → | Get started on Spheron →
