Fine-Tune MoE LLMs on GPU Cloud: Expert Parallelism, Load Balancing, and MoE-LoRA for DeepSeek V4, Qwen3, and GLM-5.2 (2026 Guide)

Every major 2026 frontier open-weight model is a sparse Mixture of Experts. DeepSeek V4 (1T total, 37B active), Qwen3-235B-A22B (235B total, 22B active), GLM-5.2 (744B total, 40B active), Mistral Large 3, Kimi K2: they all use MoE routing to deliver frontier quality without proportional compute cost per token. But fine-tuning them is not the same as fine-tuning a dense model. Naively applying standard SFT to a 685B+ MoE without expert parallelism hits OOM. Applying vanilla LoRA without accounting for the router causes dead-expert collapse or load imbalance after a few hundred steps.

For an introduction to MoE inference patterns and VRAM sizing at serving time, start with the MoE inference guide. For the dense-model SFT baseline before tackling MoE-specific challenges, see how to fine-tune LLMs in 2026. This post covers what changes when the model you are fine-tuning is sparse.

Why MoE Fine-Tuning Differs from Dense Model Training

The MoE training challenge has four components that do not exist in dense model fine-tuning.

Sparse activation with full-weight residency. The router picks top-K experts per token, but all expert weights must stay in GPU VRAM during training. This is the same constraint as inference, but training adds gradient buffers and optimizer states on top of the weight footprint. For a 235B MoE model at FP8, you are loading 235 GB of weights before adding any training overhead. Active parameter count is irrelevant to memory planning.

Dead-expert collapse. Experts that the router consistently routes tokens away from receive no gradient signal. Over a fine-tuning run, their weights degrade or stagnate. The base model's auxiliary load-balancing loss was trained to prevent this, but fine-tuning shifts the input distribution and can break the router's learned balance. Without explicit auxiliary loss during fine-tuning, you can lose 10-30% of your expert capacity by step 1000.

Router instability. The gate network's weights shift during fine-tuning. Without z-loss stabilization, the router's softmax output can become peaky: all probability mass on the same 1-2 experts. Once that happens, every token routes to those experts regardless of input, effectively converting your MoE into a very small dense model. Recovery requires resetting the router weights and restarting fine-tuning from a checkpoint.

Load imbalance and capacity overflow. Each expert has a capacity buffer: the maximum number of tokens it can process in a given batch. Without capacity factor tuning, a single expert can receive all tokens in a batch when the router assigns high probability to it. Tokens that overflow an expert's capacity are either dropped (hard cap) or buffered (soft cap). Dropped tokens produce incorrect gradients and can corrupt fine-tuning runs silently.

VRAM Sizing for MoE Fine-Tuning

The formula for LoRA-based fine-tuning memory:

VRAM = base_model_weights + adapter_weights + optimizer_state + gradient_buffers + activations

For LoRA-based fine-tuning (the most common approach), the base model is frozen and loaded at FP8 or INT4 via BitsAndBytes. Only the adapter weights are trained in BF16:

Base model: total_params × bytes_per_dtype (FP8 = 1 byte, INT4 = 0.5 bytes)
Adapter weights: rank × layer_count × hidden_dim × 2 (BF16 = 2 bytes) × 2 matrices
Optimizer state: AdamW on adapter params only, 3× adapter params at FP32
Gradient buffers: 1× adapter params at BF16

For adapter-only training, the optimizer state is small relative to the base model footprint. A rank-32 adapter on 235B Qwen3 has roughly 50M trainable parameters: optimizer state adds ~600 MB, not GB.

Important distinction from inference VRAM: inference-only VRAM figures do not include gradient buffers or optimizer state. Always size your fine-tuning cluster from the training formula above, not the inference sizing guide.

Model	Total Params	FP8 Base Weights	INT4 (QLoRA) Base	Min GPU Config (LoRA r=32)
Qwen3-235B-A22B	235B	~235 GB	~120 GB	2x H200 SXM5 (282 GB) at FP8 LoRA
DeepSeek V4	~1T (provisional)	~1000 GB	~500 GB	8x H200 SXM5 (1128 GB) at FP8 LoRA
GLM-5.2	744B	~744 GB	~375 GB	6x H200 SXM5 (846 GB) at FP8 LoRA

DeepSeek V4 specs are based on pre-release information (1T total parameters, ~37B active). Verify against the official model card before provisioning hardware. The ~1000 GB FP8 figure is consistent with what we use in the DeepSeek V4 deployment guide.

GLM-5.2 at INT4 needs ~375 GB for weights alone. Even with QLoRA (INT4 base, BF16 adapters), the gradient computation runs in BF16 internally, so effective VRAM during a forward-backward pass is higher than the static weight footprint. Budget at least 6x H200 for GLM-5.2 QLoRA.

GLM-5.2 figures are based on pre-release specifications (744B total, 40B active). Verify against the official model card before provisioning hardware.

For Spheron H200 instances at $4.54/hr on-demand or $3.31/hr spot, a 4x H200 node for DeepSeek V4 QLoRA costs roughly $18/hr on-demand or $13/hr spot. B200 SXM6 is available on spot at $5.34/hr per GPU when you need the higher HBM3e bandwidth for expert dispatch, though on-demand B200 availability has been limited.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Jul 2026 and may have changed. Check current GPU pricing → for live rates.

Parallelism Strategy: Expert Parallelism for Training, Not Just Inference

Dense model training uses data parallelism, tensor parallelism, or pipeline parallelism. MoE fine-tuning adds a fourth: expert parallelism. The choice between them has a direct effect on memory usage, communication overhead, and gradient quality.

Expert parallelism (EP): each GPU holds a complete set of expert weights for a subset of the expert groups. During forward pass, all-to-all dispatch routes tokens to the GPU holding the selected expert. During backward pass, gradients flow back via the same all-to-all path. EP is appropriate when the total expert weight footprint fits across your GPU cluster and your GPU count roughly matches the number of expert groups.

Tensor parallelism (TP): splits individual expert weight matrices across GPUs column-wise or row-wise. All GPUs participate in every layer computation with all-reduce at each layer boundary. Use TP when individual expert layers are too large for a single GPU's VRAM. This is uncommon for most current MoE models but relevant for hypothetical models with very large per-expert FFN dimensions.

Pipeline parallelism (PP): assigns transformer layers to GPU stages sequentially. Generally avoid for MoE fine-tuning. Pipeline bubbles compound with expert dispatch stalls, and expert routing across pipeline stage boundaries adds correctness complexity for backward pass gradient accumulation.

FSDP2 with expert grouping: for single-node or 2-node setups, FSDP2's fully_shard() applied per expert group keeps each expert's weights on one GPU without the full all-to-all overhead of EP. This is simpler to configure than pure expert parallelism and works well for Qwen3-235B on 2x-4x H200 nodes.

Recommended hybrid for multi-node: EP × TP. On a 2-node 16x H200 cluster, use EP=8 × TP=2 so each node holds a full expert set and attention layers are split within the node via NVLink (900 GB/s) while expert dispatch crosses InfiniBand between nodes.

ZeRO-3 incompatibility: DeepSpeed ZeRO-3 shards all parameters, including expert weights, across data-parallel ranks. This conflicts directly with expert parallelism, which requires each GPU to own a complete expert set. When using DeepSpeed with MoE models, use ZeRO-2 (gradient and optimizer state sharding only) rather than ZeRO-3. ZeRO-3 + MoE is an active area of DeepSpeed development, but the default config as of mid-2026 scatters expert weights in a way that breaks expert parallel dispatch.

For FSDP2 and DeepSpeed ZeRO configuration basics, see the distributed LLM training guide.

Load Balancing During Training: Auxiliary Loss, Router Z-Loss, and Capacity Factors

This is the section most MoE fine-tuning guides skip, and it is where most failed fine-tuning runs originate.

Auxiliary Load-Balancing Loss

The auxiliary load-balancing loss penalizes routing imbalance by computing a weighted sum over all experts of (expert usage fraction × router probability). A coefficient of zero means no load balancing enforcement; experts drift freely toward whatever the router prefers. A coefficient of 0.01 (the DeepSeek training default) is the standard starting point.

python

from transformers import MixtralConfig

config = MixtralConfig.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
config.router_aux_loss_coef = 0.01  # standard starting point

For DeepSeek-family models, the equivalent config field is moe_aux_loss_coeff. For Qwen2-MoE models, it is router_aux_loss_coef. Always check the specific model's config schema.

Too low (< 0.005): dead experts appear after 500-1000 training steps. You will see some experts receiving 0% of tokens in routing histograms.

Too high (> 0.05): routing collapses to near-uniform distribution. Every token gets the same expert assignment probabilities regardless of content, which defeats the purpose of MoE routing and reduces the effective model capacity.

Monitor per-expert token frequency every 50 steps. A healthy distribution has all active experts receiving between 1/(2N) and 3/(2N) of tokens (where N is the number of active experts), with occasional deviations for highly specialized inputs.

Router Z-Loss

Introduced by STMoE (Zoph et al. 2022), router z-loss penalizes large logits from the router gate network:

z_loss = z_loss_coeff * mean(log(sum(exp(router_logits)))^2)

Large router logits make the softmax output peaky, pushing the router toward always selecting the same few experts. Z-loss regularizes this by penalizing high router logit magnitude, keeping the softmax numerically stable and the routing more diverse.

Standard coefficient: router_z_loss_coef = 0.001. Available in Hugging Face Transformers for Mixtral-family models:

python

config.router_z_loss_coef = 0.001

Z-loss is especially important during the first 200-500 training steps when the fine-tuning objective is pulling the model distribution away from the pre-trained state. Without it, router logit magnitudes can spike early and set a bad routing pattern that persists throughout training.

Capacity Factors

Each expert processes at most capacity_factor × (tokens_per_batch / num_experts) tokens. Tokens routed to an overloaded expert are either dropped (hard capacity) or processed at the cost of batch efficiency (soft capacity).

During fine-tuning, set capacity factor slightly higher than inference defaults:

Training: capacity_factor = 1.25-1.5 to absorb routing instability during the first few hundred steps
Inference (deployed model): capacity_factor = 1.0-1.1 for production efficiency

python

# Example for Transformers-based MoE model
config.expert_capacity_factor = 1.25  # training default

Setting it too low during early training causes token drops during the instability phase, producing incorrect batch gradients. Setting it too high wastes memory on expert buffers that never fill.

Expert Temperature

The router softmax temperature controls routing sharpness:

Default (1.0): standard routing behavior from pre-training
Lower (0.7-0.8): sharpens routing, reduces expert coverage, faster convergence on narrow domains
Higher (1.2-1.5): smooths routing, improves expert utilization coverage, better for general-purpose adaptation

For task-specific fine-tuning (coding, math, instruction following on a specific domain), start at 1.0 and reduce to 0.8 if you see load imbalance after 500 steps. For broad instruction tuning, keep at 1.0 or raise slightly to 1.1.

Parameter-Efficient MoE Fine-Tuning: MoELoRA, MixLoRA, and LoRAMoE

Several approaches exist for parameter-efficient MoE fine-tuning, each targeting a different problem.

Standard LoRA on Attention Layers Only

The simplest approach: target q_proj, k_proj, v_proj, o_proj. Expert FFN weights stay frozen. Cheapest in VRAM and training time.

python

from peft import LoraConfig, get_peft_model, TaskType

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)
model = get_peft_model(model, config)

Use this when: you are adapting style, format, or instruction-following behavior where routing patterns do not need to change. The fine-tuned model will use the same expert specializations as the base model but with adapted attention patterns.

Limitation: if the task requires different expert coverage than the base model learned (e.g., adapting a general-purpose model to a narrow domain), attention-only LoRA cannot change how the router distributes tokens across experts.

MoELoRA: Per-Expert FFN Adapters

MoELoRA applies separate LoRA adapters to each expert's FFN layers. The A matrices (down-projection into bottleneck) are shared within an expert group; the B matrices (up-projection from bottleneck) are independent per expert.

First, enumerate the actual layer names in your model. Naming conventions differ between architectures:

python

# Verify layer names before setting target_modules
for name, module in model.named_modules():
    if "expert" in name.lower() or "mlp" in name.lower():
        print(name)

For Mixtral-family: look for block_sparse_moe.experts.N.w1, w2, w3. For DeepSeek-family: mlp.experts.N.gate_proj, up_proj, down_proj. For Qwen2-MoE: mlp.experts.N.gate_proj, up_proj, down_proj.

python

# MoELoRA targeting expert FFN layers (Qwen2-MoE / DeepSeek naming)
config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
)

PEFT applies a separate adapter per matching named module, so this automatically creates per-expert adapters for each expert's FFN layers. The adapter parameter count scales with the number of experts: for a 64-expert model with rank-16 adapters on 3 FFN layers, you get 64 × 3 adapter pairs.

VRAM overhead vs attention-only LoRA: roughly 5-10% more for the additional adapter weights and their optimizer states. Still a fraction of the base model footprint.

MixLoRA: Building Sparse MoE from a Dense Model

MixLoRA (TUDB-Labs, arXiv 2404.15159) works differently from MoELoRA: rather than adding adapters to an existing sparse MoE model, it inserts LoRA-based expert layers into a frozen dense model's FFN blocks to construct a new sparse MoE structure from scratch. A top-k router is trained alongside the LoRA experts to select which expert activates per token. The base model's original FFN weights stay frozen.

This is a different use case from fine-tuning an already-sparse model like DeepSeek V4 or Qwen3-235B-A22B. MixLoRA is the right tool when you want to inject sparse routing into a dense base, for example converting a dense Qwen2-7B into a sparse MoE with domain-specialized expert groups, not when adapting an existing MoE architecture.

For API usage and supported dense base models, verify against the official TUDB-Labs/MixLoRA repository, as the interface details should be confirmed against the repo before using in production.

LoRAMoE: Preserving World Knowledge During Task Fine-Tuning

LoRAMoE (arXiv 2312.09979) targets a different problem: catastrophic forgetting. When fine-tuning on a narrow task dataset, models tend to lose the general world knowledge they acquired during pre-training. LoRAMoE addresses this by inserting multiple LoRA experts per layer with a lightweight router, assigning some experts to the fine-tuning task and regularizing others to retain general knowledge.

Two expert groups train with separate objectives. Task experts minimize the loss on your fine-tuning dataset. Knowledge-preservation experts are regularized to stay close to the base model's predictions on general text, preventing drift toward task-specific behavior. The router learns to dispatch each token to the appropriate expert type.

This approach is most useful when the training dataset is small enough to cause overfitting, or when off-task queries must still work after fine-tuning: coding agents that also answer general questions, narrow-domain assistants where general reasoning must stay intact, or instruction tuning where world knowledge loss would degrade model usefulness.

LoRAMoE is not a standalone training CLI. Implementations typically modify PEFT's LoRA adapter structure to use multiple adapter pairs per layer with a shared router. Refer to arXiv 2312.09979 for implementation details.

Decision Matrix

Method	Adapts Routing	VRAM Overhead vs Base	Best For
Standard LoRA (attention only)	No	~1-3%	Style, format, instruction tuning
MoELoRA (per-expert MLP)	Partially	~5-10%	Domain adaptation, expert specialization
LoRAMoE	No	~8-15%	Task fine-tuning without world-knowledge forgetting
MixLoRA	N/A (dense → MoE)	~8-15%	Converting a dense model into a sparse MoE
Full fine-tune	Yes	3-4× base weights	Maximum quality, maximum cost

For non-MoE PEFT method alternatives (DoRA, GaLore, PiSSA), see the PEFT methods guide. Standard LoRA applied only to attention layers cannot shift expert routing; for MoE models where routing patterns need to change, MoELoRA is the right tool.

Framework Setup: Axolotl, LLaMA-Factory, and MixLoRA

Axolotl

For Mixtral-family and Qwen2-MoE models, Axolotl handles MoE fine-tuning with minimal config:

yaml

# axolotl_moe_config.yaml
base_model: mistralai/Mixtral-8x7B-Instruct-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: false  # use FP8 base weights if supported, else BF16
bf16: true

adapter: lora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

sequence_len: 4096
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 3

deepspeed: /path/to/zero2_config.json  # NOT zero3 for MoE

Note the deepspeed: zero2_config.json: use ZeRO-2, not ZeRO-3, for MoE. ZeRO-3 shards expert weights in a way that conflicts with expert parallel dispatch.

Multi-GPU launch:

bash

accelerate launch --num_processes 8 --config_file accelerate_zero2.yaml \
  -m axolotl.cli.train axolotl_moe_config.yaml

Patch in the auxiliary loss coefficient via a model config override in your script:

python

model.config.router_aux_loss_coef = 0.01
model.config.router_z_loss_coef = 0.001

LLaMA-Factory

LLaMA-Factory supports DeepSeek-V3 and Qwen-MoE families out of the box with finetuning_type: lora:

bash

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 llamafactory-cli train \
  --model_name_or_path Qwen/Qwen2-57B-A14B-Instruct \
  --finetuning_type lora \
  --lora_target all \
  --lora_rank 16 \
  --lora_alpha 32 \
  --template qwen \
  --dataset your_dataset \
  --cutoff_len 4096 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --num_train_epochs 3 \
  --bf16 true \
  --deepspeed examples/deepspeed/ds_z2_config.json

Add --moe_aux_loss_coef 0.01 as an extra model config argument via the --model_kwargs flag.

MixLoRA

MixLoRA is appropriate for the dense→sparse use case described above, not for fine-tuning already-sparse models. If you are using it, verify the current CLI and flag names against the TUDB-Labs/MixLoRA repository before running, as the interface may differ from what is shown below:

bash

# Verify flags against repo before use
python -m mixlora.train \
  --config mixlora_config.yaml \
  --dataset_path your_dataset.jsonl \
  --output_dir ./mixlora_output \
  --num_train_epochs 3 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --learning_rate 2e-4

For framework selection beyond the MoE-specific considerations above, see the Axolotl vs Unsloth vs TorchTune comparison.

Multi-Node Setup on Spheron for Expert-Parallel Training

Step 1: Provision and Verify Hardware

Provision 2x 8-GPU H200 SXM5 nodes at app.spheron.ai. After SSH in, verify NVLink topology within each node:

bash

nvidia-smi topo -m

You want NVLink (NV) connections between all 8 GPUs on the same node. PCIe connections (PIX, PXB) are significantly slower for all-to-all expert dispatch. Verify InfiniBand between nodes:

bash

ibstat
# Look for Port State: Active with Physical state: LinkUp
ib_send_bw --duration 5  # cross-node bandwidth test

See the Spheron docs for SSH configuration, private networking, and node IP assignment details.

Step 2: Configure NCCL for Expert Dispatch

bash

export MASTER_ADDR=<node-0-private-ip>
export MASTER_PORT=29500
export NCCL_IB_HCA=<infiniband-hca-device>  # e.g., mlx5_0
export NCCL_IB_GID_INDEX=3  # for RoCE; set to 0 for standard IB
export NCCL_DEBUG=INFO  # for initial runs; remove after verifying

Step 3: Launch Expert-Parallel Training

On both nodes simultaneously:

bash

# Node 0
torchrun \
  --nnodes 2 \
  --nproc_per_node 8 \
  --node_rank 0 \
  --master_addr $MASTER_ADDR \
  --master_port $MASTER_PORT \
  train_moe.py \
  --expert_parallel_size 8 \
  --tensor_parallel_size 2

# Node 1
torchrun \
  --nnodes 2 \
  --nproc_per_node 8 \
  --node_rank 1 \
  --master_addr $MASTER_ADDR \
  --master_port $MASTER_PORT \
  train_moe.py \
  --expert_parallel_size 8 \
  --tensor_parallel_size 2

With EP=8 and TP=2, each of the 8 GPUs per node holds one expert group. Attention layers are split across pairs of GPUs within a node via NVLink (fast). Expert dispatch crosses InfiniBand between nodes (slower, but only at expert routing boundaries).

For H200 SXM5 nodes on Spheron at $4.54/hr on-demand, a 2x 8-GPU (16 total) cluster for a 72-hour DeepSeek V4 fine-tuning run costs approximately $5,230. On H200 spot at $3.31/hr that drops to $3,813 if you run with checkpointing.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Jul 2026 and may have changed. Check current GPU pricing → for live rates.

Checkpointing and Spot-Instance Resilience for Long Expert-Parallel Runs

Expert-parallel fine-tuning runs on MoE models are long: 72-240 hours for a full SFT pass on 100K+ examples with a 1T parameter model. Running on spot instances cuts costs significantly, but requires robust checkpointing.

SIGTERM Handler for Preemption

python

import signal
import sys

def preemption_handler(sig, frame):
    print("Preemption signal received, saving checkpoint...")
    trainer.save_checkpoint()
    print("Checkpoint saved. Exiting.")
    sys.exit(0)

signal.signal(signal.SIGTERM, preemption_handler)

On Spheron spot instances, preemption webhooks fire with 30-120 seconds of warning. The SIGTERM handler above gives the trainer time to write the checkpoint synchronously before the instance is reclaimed.

Checkpoint Frequency and Location

python

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="/mnt/persistent-storage/checkpoints",  # mounted NFS or S3
    save_steps=100,          # incremental checkpoint every 100 steps
    save_total_limit=5,      # keep last 5 checkpoints to limit storage
    # full model checkpoint every 500 steps:
    save_strategy="steps",
)

Save to a network-mounted path (NFS or S3-compatible via s3fs), not local NVMe. Local storage does not survive instance preemption.

Expert-Parallel Checkpoint Specifics

Expert-parallel checkpoints are rank-sharded by default in FSDP2. This means a checkpoint saved from a 16-GPU EP=8 TP=2 configuration cannot be loaded on a different GPU topology without reconstruction.

To avoid this constraint, save the full model state dict instead of rank-sharded shards:

python

# After training or at checkpoint time
with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT):
    state_dict = model.state_dict()
    if trainer.is_world_process_zero():
        torch.save(state_dict, "/mnt/persistent-storage/full_model.pt")

This produces a single file containing all expert weights, resumable on any topology. The tradeoff is that the save takes longer (all GPUs gather to rank 0 before writing). For 235B Qwen3 at FP8, expect a full checkpoint save to take 3-8 minutes.

Cost Math with Spot Checkpointing

B200 SXM6 spot is available at $5.34/hr per GPU. For an 8-GPU B200 node running a 72-hour Qwen3-235B-A22B fine-tuning run:

B200 spot: $5.34 × 8 × 72 = $3,076
H200 on-demand (same task, same time): $4.54 × 8 × 72 = $2,615

B200 provides higher HBM3e bandwidth (8.0 TB/s vs 4.8 TB/s on H200) which speeds up expert dispatch. For a run where expert dispatch is the bottleneck, B200 may finish in 60 hours vs H200's 72 hours, making total cost $5.34 × 8 × 60 = $2,563. That is cheaper than H200 on-demand and faster.

With checkpointing every 100 steps at roughly 30 seconds/step, a preemption event costs at most 100 steps (50 minutes) of wasted work, or 50 steps on average, plus 5-8 minutes for checkpoint load on restart. At 30s/step and 72-hour total runtime, even 100 wasted steps is less than 2% overhead for the ~27% cost reduction of spot over on-demand. For spot instance checkpointing patterns and async offload strategies, see the spot GPU training resilience guide.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Jul 2026 and may have changed. Check current GPU pricing → for live rates.

Serving the Fine-Tuned MoE: vLLM and SGLang Expert-Parallel Handoff

Merge the LoRA Adapter

For attention-only LoRA:

python

from peft import PeftModel

model = PeftModel.from_pretrained(base_model, "path/to/adapter")
model = model.merge_and_unload()
model.save_pretrained("path/to/merged-model")

For MoELoRA with per-expert adapters, use safe_merge=True to prevent numerical issues when merging adapters into experts with very different activation scales:

python

model = model.merge_and_unload(safe_merge=True)

For MixLoRA (dense→MoE conversion), consult the TUDB-Labs/MixLoRA repository for the correct merge API, as the exact call signature should be verified against the current release before use.

vLLM with Expert Parallelism

bash

vllm serve /path/to/merged-model \
  --enable-expert-parallel \
  --tensor-parallel-size 8 \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --port 8000

For the DeepGEMM kernel optimizations that improve expert dispatch throughput on Hopper and Blackwell, see the DeepEP and DeepGEMM deployment guide.

For speculative decoding acceleration on the serving side after fine-tuning, see speculative decoding for MoE models. Draft-head speculation can stack on top of expert parallelism, but requires coordinated --speculative-max-model-len and --speculative-num-speculative-tokens flags to avoid draft head VRAM conflicts with expert dispatch buffers.

SGLang with Expert Parallelism

bash

python -m sglang.launch_server \
  --model-path /path/to/merged-model \
  --enable-ep \
  --tp 8 \
  --dtype fp8 \
  --max-total-tokens 32768 \
  --port 8080

SGLang with DeepEP/DeepGEMM installed gives better all-to-all throughput for expert dispatch than vanilla vLLM on Hopper hardware. Follow the SGLang expert-parallel setup in the MoE inference guide for the full environment setup.

For V4-specific inference configuration including the DeepSeek Sparse Attention setup, see deploy DeepSeek V4 on GPU cloud. For GLM-5.2-specific inference config with 1M-token context, see deploy GLM-5.2 on GPU cloud.

Validate Expert Routing Health

After deploying the fine-tuned model, verify that the auxiliary loss worked and expert routing is healthy:

python

import json
import requests

# Run 100 test prompts
for i, prompt in enumerate(test_prompts[:100]):
    response = requests.post(
        "http://localhost:8000/v1/completions",
        json={"model": "fine-tuned-moe", "prompt": prompt, "max_tokens": 50}
    )
    # Log expert routing info if your vLLM build exposes it via metrics

# Check vLLM metrics endpoint for expert utilization
metrics = requests.get("http://localhost:8000/metrics").text

A healthy fine-tuned MoE distributes requests across at least 80% of active experts over a 100-request batch. If you see 50%+ of requests routing to the same 3 experts on a 256-expert model, the auxiliary loss coefficient was too low during training. Retrain from a checkpoint where routing was still balanced, with moe_aux_loss_coeff increased to 0.02-0.05.

MoE fine-tuning on multi-node GPU cloud works well when the interconnect, pricing, and checkpoint infrastructure are all in the same place. Spheron's H200 and B200 SXM nodes have the NVLink bandwidth expert parallelism needs, per-minute spot pricing with preemption webhooks for checkpoint-resilient workflows, and a clean path to vLLM/SGLang expert-parallel inference on the same cluster.
H200 SXM5 on Spheron → | B200 spot pricing → | View all GPU pricing →

STEPS / 07

Quick Setup Guide

Calculate VRAM requirements for MoE fine-tuning
Use total parameter count (not active parameter count) multiplied by bytes per precision for base model VRAM. Add optimizer state (3x adapter params at FP32 for AdamW), gradient buffers (1x adapter params at BF16), and 10-15% framework overhead. For Qwen3-235B-A22B at FP8 LoRA with rank-32, that is ~235 GB weights + ~800 MB adapter state (adapter weights ~100 MB + optimizer ~600 MB + gradients ~100 MB) = ~2x H200 minimum. For DeepSeek V4 at INT4 QLoRA, ~500 GB weights + ~15 GB adapter state = ~4x H200 minimum with rank 16 on attention only.
Choose your parallelism strategy
For models that fit on one node (Qwen3-235B-A22B at FP8 on 2x H200), use FSDP2 with fully_shard() over transformer layers. For models requiring multi-node (DeepSeek V4, GLM-5.2 at FP8), use expert parallelism EP=num_gpus_per_node combined with tensor parallelism TP=2 for attention layers. Avoid DeepSpeed ZeRO-3 with expert parallelism: ZeRO-3 shards all parameters across data-parallel ranks, which conflicts with expert parallelism's assumption that each GPU owns a complete expert set.
Configure auxiliary load-balancing loss and router z-loss
In your model config or Trainer call, set moe_aux_loss_coeff=0.01 (DeepSeek-style default) and router z_loss_coeff=0.001. Monitor expert utilization per expert every 50 steps. If any expert receives fewer than 5% of tokens over a 200-step window, the auxiliary loss coefficient is too low and you risk dead-expert collapse. If all experts receive nearly equal tokens regardless of input type, the coefficient is too high and routing quality degrades.
Set up MoELoRA or MixLoRA adapters with PEFT
For attention-only LoRA: target_modules=['q_proj','k_proj','v_proj','o_proj'] in LoraConfig. For MoELoRA: extend target_modules to include 'gate_proj','up_proj','down_proj'. Verify actual layer names for your model with model.named_modules() since DeepSeek-family uses different naming than Mixtral-family. For MixLoRA: pip install mixlora then configure with the YAML recipe shown in the Framework Setup section.
Launch multi-node expert-parallel training on Spheron
Provision 2x 8x H200 SXM5 nodes at app.spheron.ai. Verify NVLink with nvidia-smi topo -m on each node. Set MASTER_ADDR to node-0's private IP, MASTER_PORT=29500. Launch: torchrun --nnodes 2 --nproc_per_node 8 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT train.py. Set expert_parallel_size=8 in your training config.
Checkpoint to persistent storage for spot resilience
Mount an NFS or S3-compatible volume before training. Set checkpoint_dir to the mounted path, not local NVMe. Add a SIGTERM handler: import signal; signal.signal(signal.SIGTERM, lambda sig, frame: trainer.save_checkpoint()). Set save_steps=100 for incremental checkpoints. For expert-parallel training, use model.state_dict() directly rather than FSDP's rank-sharded format so you can resume on the same topology.
Serve the fine-tuned MoE with vLLM or SGLang expert parallelism
Merge LoRA adapter with model.merge_and_unload() from PEFT (use safe_merge=True for MoELoRA per-expert merging). Load in vLLM: vllm serve merged-model --enable-expert-parallel --tensor-parallel-size N --dtype fp8. For SGLang: python -m sglang.launch_server --model-path merged-model --enable-ep --tp N. Verify expert routing health by checking that at least 80% of experts receive some tokens over a 100-request test batch.

FAQ / 05

Frequently Asked Questions

Dense model fine-tuning requires gradient storage for every parameter. MoE fine-tuning requires that same VRAM footprint for all expert weights (all must stay resident in GPU memory even though only a few activate per forward pass), plus additional challenges: the router gate can collapse, load imbalance between experts causes dropped tokens, and vanilla LoRA applied only to attention layers cannot adapt expert routing behavior. You need auxiliary load-balancing loss, router z-loss stabilization, capacity factor tuning, and MoE-aware PEFT methods (MoELoRA or LoRAMoE) to get stable results.

Expert parallelism assigns entire expert groups to specific GPUs so each GPU holds a complete expert set. During training, tokens are dispatched via all-to-all communication to the GPU holding the chosen expert, both in the forward and backward pass. You need expert parallelism when your model's total expert weight footprint exceeds a single GPU's VRAM. For DeepSeek V4 at FP8, that is true on any GPU available today: the 1T parameter weight footprint requires ~1000 GB minimum across 8x H200 SXM5. Expert parallelism also improves throughput on NVLink-connected nodes by keeping expert compute cache-local.

Standard LoRA targets attention projection layers (q_proj, k_proj, v_proj, o_proj) and leaves expert FFN weights frozen. This is cheap but cannot change how the router assigns tokens to experts. MoELoRA applies a separate LoRA adapter to each expert's gate_proj, up_proj, and down_proj layers, with a shared A matrix per expert group and independent B matrices per expert. This enables per-expert specialization while keeping trainable parameters proportional to the number of experts. For tasks requiring different domain coverage from the base model, MoELoRA outperforms attention-only LoRA at the cost of 5-10% more VRAM.

DeepSeek V4 at INT4 via BitsAndBytes (QLoRA) requires approximately 500 GB for base model weights. On top of that, LoRA adapter weights in BF16, optimizer states for adapter params, and gradient buffers add another 10-20 GB depending on rank. The practical minimum is 4x H200 SXM5 (564 GB total) for QLoRA with rank 16 on attention layers only. For MoELoRA that also adapts the expert FFN layers, budget 8x H200 to give the optimizer room during the first few hundred steps. Note: these are provisional figures based on pre-release DeepSeek V4 specs. Verify against the official model card once the weights are public.

Yes, with the right checkpointing setup. Set a SIGTERM handler in your training loop so the trainer saves a checkpoint synchronously when a preemption signal arrives. Spheron spot instances fire a preemption webhook with 30-120 seconds of warning, which is enough time to checkpoint an FSDP2 state dict to a network-mounted path. Checkpoint every 100-200 steps as incremental checkpoints and every 500 steps for full model state. With expert-parallel training, checkpoints are rank-sharded and must be resumed on the same GPU topology - always save the full state_dict, not rank-only shards, unless you control both save and resume topology.

Why MoE Fine-Tuning Differs from Dense Model Training

VRAM Sizing for MoE Fine-Tuning

Parallelism Strategy: Expert Parallelism for Training, Not Just Inference

Load Balancing During Training: Auxiliary Loss, Router Z-Loss, and Capacity Factors

Auxiliary Load-Balancing Loss

Router Z-Loss

Capacity Factors

Expert Temperature

Parameter-Efficient MoE Fine-Tuning: MoELoRA, MixLoRA, and LoRAMoE

Standard LoRA on Attention Layers Only

MoELoRA: Per-Expert FFN Adapters

MixLoRA: Building Sparse MoE from a Dense Model

LoRAMoE: Preserving World Knowledge During Task Fine-Tuning

Decision Matrix

Framework Setup: Axolotl, LLaMA-Factory, and MixLoRA

Axolotl

LLaMA-Factory

MixLoRA

Multi-Node Setup on Spheron for Expert-Parallel Training

Step 1: Provision and Verify Hardware

Step 2: Configure NCCL for Expert Dispatch

Step 3: Launch Expert-Parallel Training

Checkpointing and Spot-Instance Resilience for Long Expert-Parallel Runs

SIGTERM Handler for Preemption

Checkpoint Frequency and Location

Expert-Parallel Checkpoint Specifics

Cost Math with Spot Checkpointing

Serving the Fine-Tuned MoE: vLLM and SGLang Expert-Parallel Handoff

Merge the LoRA Adapter

vLLM with Expert Parallelism

SGLang with Expert Parallelism

Validate Expert Routing Health

Quick Setup Guide

Calculate VRAM requirements for MoE fine-tuning

Choose your parallelism strategy

Configure auxiliary load-balancing loss and router z-loss

Set up MoELoRA or MixLoRA adapters with PEFT

Launch multi-node expert-parallel training on Spheron

Checkpoint to persistent storage for spot resilience

Serve the fine-tuned MoE with vLLM or SGLang expert parallelism

Frequently Asked Questions

01How is fine-tuning a MoE model different from fine-tuning a dense LLM?

02What is expert parallelism for MoE fine-tuning and when do I need it?

03What is MoELoRA and how does it differ from standard LoRA on a MoE model?

04How much VRAM does fine-tuning DeepSeek V4 with QLoRA require?

05Can I fine-tune a MoE model on spot GPU instances?

Build what's next.