Engineering

MoE Model Inference on GPU Cloud: Expert Parallelism, Memory, and Cost (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 2, 2026
MoE InferenceMixture of ExpertsExpert ParallelismvLLMSGLangGPU CloudH100A100LLM InferenceDeepSeekLlama 4
MoE Model Inference on GPU Cloud: Expert Parallelism, Memory, and Cost (2026)

Mixture of Experts is the architecture behind virtually all major frontier open models released in 2026. DeepSeek V3.2 Speciale (685B total, 37B active), Llama 4 Maverick (400B total, 17B active), Kimi K2 (1T total, ~32B active), and Mixtral 8x22B (141B total, 39B active) all use MoE to deliver dense-model quality at a fraction of the compute cost per token. The GPU challenge is specific: you need enough VRAM to hold all expert weights in memory, but only a small subset activate on each forward pass. Getting the memory math right and choosing the correct parallelism strategy is what separates a working deployment from one that OOMs or runs at 10% efficiency. For general VRAM planning, see the GPU memory requirements guide. For production vLLM setup beyond MoE specifics, see vLLM production deployment.

Why MoE Models Dominate in 2026

Dense models activate every parameter on every token. A 70B dense model runs 70B parameter computations per forward pass. MoE models route tokens through a small subset of experts, so a 685B MoE model with 37B active parameters only runs 37B parameter computations per forward pass, roughly the same compute as a 37B dense model but with the representational capacity of 685B parameters.

The routing mechanism works like this: each MoE layer contains N expert FFN networks. A router network scores each token against all experts and selects the top-K highest-scoring experts to process that token. For DeepSeek V3.2 Speciale, that's top-8 from 256 experts. For Llama 4 Maverick, it's top-1 from 128 experts.

Here are the five major MoE models worth knowing for 2026 deployments:

ModelTotal ParamsActive ParamsExpertsTop-KArchitecture
DeepSeek V3.2 Speciale685B37B2568DSA + MoE
Llama 4 Maverick400B17B1281MoE
Kimi K21T~32B3848MoE
Mixtral 8x22B141B39B82MoE
Qwen2-57B-A14B57B14B648MoE

For more on Kimi K2 deployment specifics, see the Kimi K2 deployment guide. For Llama 4 Scout and Maverick setup, see Llama 4 on GPU cloud.

The practical consequence is that MoE models are efficient to run but expensive to fit. You're paying for VRAM to hold all experts even though most of them sit idle for any given token. This changes the GPU selection math entirely.

GPU Memory Math for MoE: What Actually Fits in VRAM

This is the part most deployment guides get wrong. Active parameter count has nothing to do with how much GPU memory you need. Every expert's weights must be resident in VRAM at all times so the router can dispatch tokens to any of them.

The formula for GPU memory planning:

total_vram = (total_params x bytes_per_dtype x 1.15) + kv_cache_budget

Bytes per data type:

DtypeBytes per Parameter
BF162
FP81
INT81
INT40.5

The 1.15 multiplier covers activation memory, framework overhead, and routing buffers. KV cache budget depends on your target context length and batch size. For a detailed KV cache calculation method, see the KV cache optimization guide.

Here's what the major MoE models need across quantization levels:

ModelBF16 WeightsFP8 WeightsINT4 WeightsMin GPU Config (FP8)
DeepSeek V3.2 Speciale (685B)~1,370 GB~685 GB~343 GB8x H200 141GB
Kimi K2 (1T)~2,000 GB~1,000 GB~500 GB16x H100 80GB
Llama 4 Maverick (400B)~800 GB~400 GB~200 GB6x H100 80GB
Mixtral 8x22B (141B)~282 GB~141 GB~70 GB3x H100 80GB
Qwen2-57B-A14B (57B)~114 GB~57 GB~28 GB1x H100 80GB

Three constraints stand out from these figures:

First, Kimi K2 at 1T parameters breaks the single-node boundary on H100 and H200 hardware even at FP8. Sixteen H100 80GB GPUs gives 1,280 GB total, but the weights at FP8 are approximately 1 TB plus overhead (~1,150 GB). Eight H200 141GB GPUs gives only 1,128 GB, also short. Both configurations require a multi-node Ray cluster. The exception is 8x B200 SXM6, which provides 1,536 GB total and can fit Kimi K2 FP8 on a single node. More on that in the multi-GPU section.

Second, the "min GPU config" figures assume you have almost no KV cache headroom. For production use with real context lengths and concurrent requests, add 20-40% more VRAM budget. The minimum configs work for smoke tests and single-request validation, not high-throughput serving.

Third, KV cache for MoE models is the same size as for a dense model with the same active architecture. DeepSeek V3.2 Speciale's KV cache is sized by its 37B active parameter architecture, not its 685B total. This is one place where the MoE efficiency advantage shows up directly in memory budgeting.

Expert Parallelism vs Tensor Parallelism: Choosing the Right Strategy

Dense models only need tensor parallelism or pipeline parallelism. MoE models add a third option: expert parallelism. Choosing between them has a direct effect on throughput and GPU utilization.

Expert Parallelism

Expert parallelism assigns entire experts to specific GPUs. GPU 0 holds experts 0-31, GPU 1 holds experts 32-63, and so on. When the router selects an expert, the token is dispatched to the GPU holding that expert via all-to-all communication.

The advantage: each GPU does focused, cache-friendly compute on its own experts. The communication pattern is all-to-all at the routing step, but individual GPU compute is efficient.

Best conditions for expert parallelism:

  • GPU count is 2-8 (matches the expert distribution)
  • High-bandwidth NVLink interconnect (SXM variants)
  • Individual expert weight matrices fit comfortably in one GPU's VRAM
  • High-traffic workloads where routing overhead amortizes well

vLLM flag: --enable-expert-parallel

Tensor Parallelism

Tensor parallelism splits each layer's weight matrices across GPUs column-wise or row-wise. All GPUs participate in every layer computation, synchronizing via all-reduce at each layer boundary.

For MoE models, tensor parallelism applies to both attention layers and expert FFN layers. It's a good choice when individual expert weights are too large for a single GPU's VRAM.

Best conditions for tensor parallelism:

  • Single expert layers exceed single-GPU VRAM (rare except for giant MoE models)
  • You want to minimize time-to-first-token by distributing prefill
  • PCIe-connected GPUs where all-to-all expert dispatch is especially costly

vLLM flag: --tensor-parallel-size N

Pipeline Parallelism

Pipeline parallelism assigns transformer layers to GPU stages. GPU 0 handles layers 0-15, GPU 1 handles layers 16-31, etc. Tokens flow through stages sequentially.

The downsides: sequential execution creates pipeline bubbles (idle GPUs waiting for previous stages), and latency per token is higher than tensor parallelism. For MoE models, pipeline parallelism also complicates expert routing since experts may span stage boundaries.

Use pipeline parallelism only when memory is the binding constraint and you have more GPUs than you can fit under the expert or tensor parallelism models. Avoid it for interactive inference.

vLLM flag: --pipeline-parallel-size N

Decision Matrix

ScenarioRecommended StrategyvLLM Flags
Mixtral 8x7B on 2x GPUExpert parallelism--tensor-parallel-size 2 --enable-expert-parallel
DeepSeek V3.2 on 8x H200 141GBExpert parallelism--tensor-parallel-size 8 --enable-expert-parallel
Kimi K2 on 16x H100Tensor + Pipeline combined--tensor-parallel-size 8 --pipeline-parallel-size 2
Mixtral 8x22B on 4x A100Tensor + Expert parallelism--tensor-parallel-size 4 --enable-expert-parallel

For head-to-head inference framework benchmarks, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

Deploying MoE Models with vLLM: Configuration Deep Dive

Environment Setup

Before launching vLLM, verify your GPU configuration. NVLink topology matters for expert parallelism performance.

bash
# Verify GPU configuration
nvidia-smi topo -m  # Check NVLink topology
nvidia-smi --query-gpu=name,memory.total --format=csv

# Install vLLM with full dependencies
pip install "vllm>=0.6.0" --upgrade
pip install flash-attn --no-build-isolation  # Optional: faster attention

# For DeepSeek models only
pip install git+https://github.com/deepseek-ai/DeepGEMM

The nvidia-smi topo -m output tells you whether your GPUs have NVLink connections. Look for "NV" labels in the connection matrix. SXM H100s in the same node will show NVLink; PCIe variants will show PCIe or SYS connections with much lower bandwidth.

Mixtral 8x22B on 4x A100: Recommended Entry Point

Mixtral 8x22B is the best starting point for MoE deployment. At FP8, the weights are approximately 141 GB; with 15% framework overhead that's roughly 162 GB, leaving around 158 GB for the KV cache across 4x A100 PCIe 80GB (320 GB total). A100 PCIe GPUs are available at $1.07/hr on-demand. Four GPUs at $1.07/hr runs you $4.28/hr total.

bash
vllm serve mistralai/Mixtral-8x22B-Instruct-v0.1 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --dtype fp8 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 32 \
  --port 8000

Flag breakdown:

  • --tensor-parallel-size 4: splits model across 4 GPUs
  • --enable-expert-parallel: activates expert parallelism for the MoE layers
  • --dtype fp8: FP8 is required to fit Mixtral 8x22B on 4x A100 80GB. BF16 weights are ~282 GB and with 15% overhead (~324 GB) exceed the 320 GB available. FP8 on A100 falls back to software emulation rather than hardware acceleration (A100 lacks native FP8 Tensor Cores), so throughput will be lower than on H100/H200. If BF16 performance is a requirement, use 5x A100 80GB (400 GB total) instead.
  • --max-model-len 16384: sets the maximum context length; lower values leave more VRAM for the KV cache pool
  • --gpu-memory-utilization 0.92: vLLM pre-allocates 92% of free VRAM for the KV cache after weights load
  • --max-num-seqs 32: maximum concurrent sequences in flight

DeepSeek V3.2 Speciale on 8x H200 141GB

DeepSeek V3.2 Speciale requires at minimum 8x H200 141GB (1,128 GB total VRAM). At FP8, the weights are approximately 685 GB plus activation and routing overhead, which exceeds what 8x H100 80GB (640 GB) can hold. H200 141GB is the minimum viable single-node configuration, leaving around 340 GB headroom for the KV cache.

bash
vllm serve deepseek-ai/DeepSeek-V3.2-Speciale \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --max-num-seqs 16 \
  --port 8000

Two Hopper-specific additions here:

--dtype fp8 works on H200 (and H100) because Hopper GPUs have hardware FP8 Tensor Cores. This halves the model weight memory footprint compared to BF16.

--kv-cache-dtype fp8 stores KV cache tensors in FP8 instead of BF16, roughly halving KV cache VRAM consumption. This flag is also Hopper-specific; do not use it on A100 or Ampere GPUs.

For DeepSeek specifically, also set VLLM_USE_DEEP_GEMM=1 as an environment variable before launching:

bash
export VLLM_USE_DEEP_GEMM=1
vllm serve deepseek-ai/DeepSeek-V3.2-Speciale \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --max-num-seqs 16 \
  --port 8000

DeepGEMM provides optimized matrix multiplication kernels for DeepSeek's MoE layers. Without it, vLLM falls back to generic GEMM implementations that are noticeably slower. See the DeepSeek V3.2 deployment guide for the full model download and setup walkthrough.

SGLang Alternative

SGLang is a competing inference server that sometimes outperforms vLLM for MoE models in throughput and token generation speed. The launch command for the same DeepSeek workload:

bash
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.2-Speciale \
  --tp-size 8 \
  --dp-size 1 \
  --dtype fp8 \
  --context-length 32768 \
  --port 8000

SGLang's --tp-size maps to vLLM's --tensor-parallel-size. The --dp-size parameter controls data parallelism across router replicas. SGLang 0.4.x introduced a cache-aware load balancer that optimizes KV cache hit rates across router replicas, improving throughput for chat workloads with shared prefixes. Expert parallelism-specific features for MoE were added in later SGLang versions beyond 0.4.x.

Where SGLang tends to win: workloads with uneven expert routing, high concurrency with short sequences, and benchmarks that emphasize tokens/sec at batch size >1.

Where vLLM tends to win: model compatibility breadth, prefix caching for chat workloads, and operational maturity. If you're deploying something other than DeepSeek or Mixtral, vLLM likely has better support. See vLLM vs TensorRT-LLM vs SGLang benchmarks for head-to-head numbers.

Note: SGLang configuration examples here use SGLang 0.4.x. Earlier versions have different flag names and behavior.

Testing the Deployment

Once the server is running, test with a basic completions request:

bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.2-Speciale",
    "prompt": "Explain the difference between expert parallelism and tensor parallelism.",
    "max_tokens": 200
  }'

For a Python client:

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.2-Speciale",
    messages=[{"role": "user", "content": "What is expert parallelism?"}],
    max_tokens=200,
)
print(response.choices[0].message.content)

To benchmark throughput:

bash
python benchmarks/benchmark_serving.py \
  --model deepseek-ai/DeepSeek-V3.2-Speciale \
  --num-prompts 100 \
  --request-rate 4

Then run nvidia-smi dmon -d 1 in a separate terminal to watch per-GPU utilization. With expert parallelism working correctly, all 8 GPUs should show similar utilization. If one GPU is consistently at 95% while others idle at 30%, expert routing is imbalanced.

Multi-GPU Strategies: 2x, 4x, and 8x Configurations

The right GPU count depends on three factors: model size, context length, and whether you're using SXM or PCIe variants. SXM variants use NVLink (900 GB/s bidirectional bandwidth), which is critical for expert parallelism performance. PCIe variants use PCIe Gen5 (~128 GB/s), which adds significant all-to-all communication latency for expert dispatch.

For MoE inference, always prefer SXM variants when deploying across 4+ GPUs.

Here are the practical GPU configurations for the major MoE models:

ModelMin ConfigRecommendedOptimalNotes
Mixtral 8x7B (47B)1x H100 80GB (FP8)2x H1002x H100 SXM5Fits in 1 GPU at FP8
Mixtral 8x22B (141B)3x H100 80GB (FP8)4x H1004x H100 SXM5NVLink preferred
Qwen2-57B-A14B1x H100 80GB2x H1002x H200Easily single-GPU at FP8
Llama 4 Maverick (400B)6x H100 80GB (FP8)8x H1004x H200 141GBExpert parallelism critical
DeepSeek V3.2 Speciale (685B)8x H200 141GB (FP8)8x H2008x B200DeepGEMM required
Kimi K2 (1T)16x H100 80GB (FP8)16x H200 141GB8x B200 SXM6Multi-node required for H100/H200; single-node on 8x B200 SXM6

The Kimi K2 row deserves special attention. At 1T parameters and FP8, the weights alone require approximately 1 TB of VRAM plus ~15% overhead, totaling roughly 1,150 GB. A single node of 8x H100 80GB gives you 640 GB total, and 8x H200 141GB gives 1,128 GB, both of which fall short. Those configurations require a multi-node Ray cluster. However, 8x B200 SXM6 provides 1,536 GB total VRAM (192 GB per GPU), which is enough to fit Kimi K2 at FP8 on a single node without multi-node setup. For Ray cluster networking and multi-node inference setup when using H100 or H200 configs, see multi-node GPU networking without InfiniBand.

For single-node deployments, the NVLink bandwidth difference between SXM and PCIe is significant enough to affect whether expert parallelism is worth enabling. On PCIe setups, the all-to-all communication overhead for expert dispatch can eat into the throughput gains from parallel expert compute. Run benchmarks before committing to a parallelism strategy on PCIe hardware.

To check per-GPU utilization in real time and verify expert load distribution:

bash
nvidia-smi dmon -d 1

This shows per-GPU utilization, memory usage, and temperature at 1-second intervals. For expert parallelism to work correctly, you want roughly equal utilization across all GPUs. Significant imbalance indicates skewed expert routing.

MoE vs Dense Models: Cost-per-Token on H100 and A100

The argument for MoE economics is that you pay more in VRAM (more GPUs) but get more tokens per second because active compute is lower than a comparably capable dense model. Here's the math on real configurations.

Dense baseline: Llama 3 70B on 1x H100 PCIe

  • Hardware cost: $2.01/hr on-demand
  • Throughput: ~2,500 tokens/sec (typical vLLM BF16, moderate batch)
  • Cost per million output tokens: ($2.01 / 3600) / 2,500 x 1,000,000 = ~$0.223/M tokens

MoE comparison: Mixtral 8x22B on 4x A100 PCIe 80G

  • Hardware cost: $1.07/hr x 4 = $4.28/hr on-demand
  • Throughput: ~4,000 tokens/sec (39B active params across 4 GPUs)
  • Cost per million output tokens: ($4.28 / 3600) / 4,000 x 1,000,000 = ~$0.297/M tokens

These are illustrative calculations. Real throughput depends on batch size, context length distribution, and routing efficiency. With optimal batching, MoE throughput advantages compound because the lower per-token compute allows larger effective batch sizes before GPU saturation.

Full comparison table:

SetupGPUs$/hr (on-demand)$/hr (spot)Est. Throughput$/M tokens (best available)
Llama 3 70B dense1x H100 PCIe$2.01N/A~2,500 tok/s~$0.223 (on-demand)
Mixtral 8x22B MoE4x A100 PCIe 80G$4.28$4.56~4,000 tok/s~$0.297 (on-demand)
DeepSeek V3.2 MoE8x B200 SXM6$59.44$13.36~8,000 tok/s~$0.464 (spot)

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing for live rates.

The DeepSeek row shows that very large MoE models cost more per token than smaller dense models, even on spot pricing. The value proposition for DeepSeek V3.2 Speciale is quality per token, not cost per token. You're paying for frontier-model output quality, and the MoE architecture makes that output achievable at ~$0.46/M tokens (B200 SXM6 spot) instead of the $1+/M that a comparable dense model would cost.

Spheron GPU Pricing for MoE Workloads: Spot Instance Strategy

Current Spheron pricing for the GPUs most commonly used for MoE inference:

GPUOn-Demand (lowest)Spot (lowest)Spot SavingsBest For
H100 PCIe$2.01/hrN/A-Mixtral, Llama 4, smaller MoE
A100 80G PCIe$1.07/hr$1.14/hr-6.5% (spot costlier)Mixtral 8x22B, cost-optimized
B200 SXM6$7.43/hr$1.67/hr~78%DeepSeek V3.2, peak throughput workloads

Pricing fluctuates based on GPU availability. The prices above are based on 02 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Note: A100 80G PCIe spot pricing is currently slightly above on-demand, which is atypical. This reflects current market demand. Check live pricing before choosing spot over on-demand for A100 workloads.

Spot instance strategy for MoE workloads:

Use spot for: batch inference jobs (document processing, offline analysis, dataset annotation), dev and staging environments, evaluation pipelines, and any workload that can checkpoint and restart. See the Spheron docs for configuration options and interruption handling.

Use on-demand for: production interactive APIs where guaranteed availability matters, multi-day training jobs where interruption cost is high, and latency-sensitive applications where cold restarts are unacceptable.

Cost math for a batch job: 8x B200 SXM6 running 10 hours per day:

  • On-demand: $7.43 x 8 x 10 = $594.40/day
  • Spot: $1.67 x 8 x 10 = $133.60/day
  • Savings: $460.80/day, ~$13,824/month

For more on structuring GPU spend across workload types, see serverless GPU vs on-demand vs reserved and the GPU cost optimization playbook.

Production Checklist: Load Balancing, Autoscaling, and Monitoring

Load Balancing Multiple MoE Instances

For high-traffic production deployments, run multiple vLLM instances behind a load balancer. MoE models have an advantage over LoRA setups here: there are no per-request adapter weights, so any instance can handle any request. Session affinity is not required.

Basic nginx upstream configuration:

nginx
upstream vllm_moe {
    least_conn;
    server 10.0.0.1:8000;
    server 10.0.0.2:8000;
    server 10.0.0.3:8000;
    keepalive 32;
}

server {
    listen 80;
    location /v1/ {
        proxy_pass http://vllm_moe;
        proxy_read_timeout 300s;
        proxy_set_header Connection "";
        proxy_http_version 1.1;
    }
}

Use least_conn load balancing so nginx routes new requests to the instance with the fewest active connections. This roughly balances throughput across instances without needing GPU utilization metrics in the proxy layer.

Key Monitoring Metrics for MoE Inference

MetricToolHealthy RangeAlert Threshold
GPU utilization per GPUnvidia-smi80-95%<60% or >98%
Expert load balance (per-expert token count)vLLM PrometheusUniform +/-20%>50% skew
All-to-all communication timeNCCL profiler<5ms per forward pass>20ms
Tokens/sec per GPUvLLM /metrics endpointModel-dependent-20% from baseline
VRAM usednvidia-smi85-92%>95%
KV cache hit ratevLLM /metrics>40% for chat<20%

The expert load balance metric is specific to MoE. If one expert is handling >40% of tokens while others handle <5%, you have expert collapse. This causes GPU utilization imbalance when using expert parallelism: the GPU holding the overloaded expert runs hot while others idle. Monitor this with vLLM's Prometheus endpoint at /metrics.

For a full monitoring setup guide, see GPU monitoring for ML.

Autoscaling Triggers

Scale up when:

  • Request queue depth exceeds 10
  • p95 latency exceeds 2 seconds
  • GPU utilization sustained above 90% for 5+ minutes

Scale down when:

  • GPU utilization below 30% for 15 consecutive minutes
  • Queue depth at 0 for 10 consecutive minutes

Pre-warm strategy: keep one instance always running to absorb sudden traffic spikes. MoE model load times are significant (DeepSeek V3.2 Speciale takes 8-12 minutes to load on 8x H200 141GB), so cold-starting an instance on demand is not viable for latency-sensitive traffic.

Production Deployment Checklist

  1. Verify NVLink topology with nvidia-smi topo -m before launch
  2. Test model load time: DeepSeek 685B FP8 takes 8-12 minutes to load on 8x H200 141GB
  3. Set --gpu-memory-utilization 0.90 not 0.95+ to leave headroom for routing overhead
  4. Enable --enable-prefix-caching for chat workloads to reduce KV cache pressure (note: incompatible with --kv-cache-dtype fp8 — use one or the other)
  5. Monitor per-expert token routing to detect expert collapse (one expert handles >40% of tokens)
  6. Set hard --max-model-len limits: MoE KV cache at long context can OOM unexpectedly
  7. Run python benchmarks/benchmark_serving.py --model <your-model> to establish throughput baseline before production traffic
  8. Use --kv-cache-dtype fp8 on H100/H200 for 2x KV cache capacity (note: incompatible with --enable-prefix-caching — use one or the other)

For speculative decoding techniques, see the speculative decoding production guide.


MoE models are the practical way to access frontier-model quality without the compute bill of a dense 400B+ model. The right GPU setup - enough VRAM for all experts, fast enough interconnect for routing - is what makes or breaks MoE inference economics. Spheron's H100 and A100 spot instances give you the multi-GPU clusters you need, with per-second billing so you only pay for active inference time.

Rent H100 → | Rent A100 → | View all GPU pricing → | Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.