Engineering

Distributed LLM Training on GPU Cloud: FSDP, DeepSpeed ZeRO-3, and Megatron-Core Multi-Node Setup Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 29, 2026
Distributed LLM Training GPU CloudFSDP DeepSpeed ZeRO-3 SetupMegatron-Core Multi-Node TrainingMulti-Node LLM Training TutorialPyTorch FSDP2DeepSpeed ZeRO-3 ConfigurationTensor Parallelism Pipeline ParallelismLLM Training GPU CostGPU Cloudtorchrun SLURM GPU TrainingZeRO-Infinity NVMe OffloadMegatron-Core 3D ParallelismContext Parallelism LLM Training
Distributed LLM Training on GPU Cloud: FSDP, DeepSpeed ZeRO-3, and Megatron-Core Multi-Node Setup Guide (2026)

Training a 70B+ model is a different engineering discipline than running inference. You are not tuning a single GPU kernel or swapping quantization formats. You are coordinating dozens or hundreds of GPUs across multiple machines, managing memory across sharded parameters, and fighting communication overhead that compounds with every additional node. This post covers the full stack: choosing a parallelism strategy, configuring FSDP2 and DeepSpeed ZeRO-3, using Megatron-Core for very large models, launching multi-node jobs, tuning NCCL for your fabric, and estimating what it actually costs. Aimed at researchers and infra engineers who are ready to run real multi-node jobs.

Choosing Your Parallelism Strategy

Parallelism strategy is the first architectural decision, not a tuning knob you revisit after writing training code. Getting it wrong means either running out of GPU memory on the first attempt or leaving half your cluster idle due to communication bottlenecks. Pick the strategy before writing a single line of training code.

Model SizeGPU VRAMRecommended StrategyNotes
< 7B40GB+DDP or ZeRO-1Single node, no sharding needed
7B-13B80GBFSDP2 (reshard_after_forward=False) or ZeRO-2Fits in 80GB with grad checkpointing
30B-70B8x 80GBFSDP2 (reshard_after_forward=True) or ZeRO-3Shards across 8 GPUs; requires CPUOffloadPolicy for optimizer states on 50B+ models
70B+Multi-nodeMegatron-Core TP+PP or ZeRO-3+DPCommunication-bound, needs fast fabric
405B+16+ nodesMegatron-Core 3D + ZeRO-1Expert parallelism for MoE

Data Parallelism (DP): Each GPU gets a full model copy and trains on different data. Gradients are synchronized at the end of each step via all-reduce. Simple to implement. The problem: a 70B model in BF16 takes 140GB of memory per GPU just for weights. That exceeds any single GPU today. DP alone cannot fit large models.

Tensor Parallelism (TP): Splits individual weight matrices across GPUs. A matrix multiply that would normally run on one GPU is distributed across multiple GPUs that each compute a shard, then all-reduce the partial results. Keeps communication within a node (NVLink bandwidth is 900 GB/s on H100 SXM5), which is why TP degree rarely exceeds 8. Cross-node TP hits InfiniBand instead of NVLink and the communication overhead eats into compute time fast.

Pipeline Parallelism (PP): Assigns different model layers to different GPU groups (stages). Forward activations pass forward through stages; backward gradients pass backward. Introduces pipeline bubbles at the boundaries where one stage waits for another to finish. Megatron-Core's interleaved 1F1B schedule reduces bubble ratio significantly compared to a naive implementation.

Expert Parallelism (EP): Specific to mixture-of-experts (MoE) architectures. Different expert subnetworks run on different GPUs. Requires all-to-all communication to route tokens to the right experts. Adds complexity but is required for 405B+ MoE models.

FSDP (Fully Sharded Data Parallelism): Shards model parameters, gradients, and optimizer states across all GPUs in a group. Each GPU holds only a fraction of the parameters at rest, and reconstructs full parameter shards via all-gather only during the forward and backward pass of its local computation. Achieves near-ZeRO-3 memory savings with a cleaner PyTorch-native API.

For very large models (70B+ on multi-node), combining strategies is standard. Megatron-Core's 3D parallelism stacks TP + PP + DP together. The typical configuration for a 70B model on 64x H100: TP=8 within each node (NVLink), PP=4 across nodes (InfiniBand), DP=2 for data throughput. This keeps the expensive all-reduce within the NVLink domain and uses InfiniBand only for the smaller pipeline boundary activations.

FSDP2 in PyTorch 2.6: Setup and Sharding

FSDP2, available in PyTorch 2.2+ and the recommended path in 2.6+, replaces the original FSDP API with a DTensor-based implementation. The API is cleaner: instead of wrapping modules with FullyShardedDataParallel(), you call fully_shard() directly on model submodules. The underlying sharding is now expressed as DTensor with a Replicate and Shard device mesh placement, which makes it composable with torch.compile and other DTensor-aware tools.

python
# Illustrative FSDP2 setup - PyTorch 2.6
from torch.distributed._composable.fsdp import fully_shard, CPUOffloadPolicy
import torch.distributed as dist

dist.init_process_group(backend="nccl")

model = MyTransformerModel(...)

# Shard each transformer layer independently
# reshard_after_forward=True is the FSDP2 equivalent of FULL_SHARD
for layer in model.layers:
    fully_shard(layer, reshard_after_forward=True)

# Shard the root module last
fully_shard(model, reshard_after_forward=True)

# Optional: CPU offload for models that barely fit even when sharded
# fully_shard(model, reshard_after_forward=True,
#             offload_policy=CPUOffloadPolicy())

FSDP2 sharding options:

OptionMemoryCommunicationUse case
reshard_after_forward=TrueLowestAll-gather + reduce-scatter every step70B+ models, memory-bound (equivalent to FSDP1 FULL_SHARD)
reshard_after_forward=FalseMediumParams kept assembled after forward; reduce-scatter on grads13B-30B on 80GB GPUs (equivalent to FSDP1 SHARD_GRAD_OP)
Device mesh with sub-groupsMediumShards within a node mesh, replicated across nodesMulti-node clusters where intra-node NVLink is fast; replaces FSDP1 HYBRID_SHARD

Memory math for a 70B model: BF16 weights take 140GB. Gradients in BF16 add another 140GB. AdamW optimizer states (fp32 params + fp32 momentum + fp32 variance) add 840GB. Total naive memory: ~1,120GB across the full training stack. With reshard_after_forward=True across 8x H100 80GB GPUs and CPUOffloadPolicy for optimizer states, each GPU holds ~35GB in params and gradients (17.5GB each). That leaves ~45GB free for activations, which is comfortable for most batch sizes without gradient checkpointing. Without optimizer offload, the per-GPU footprint grows to ~140GB (adding 105GB of optimizer states sharded across 8 GPUs), which exceeds the 80GB VRAM limit and requires CPUOffloadPolicy (or NVMe offload) to stay within bounds. Gradient checkpointing reduces activation memory but has no effect on optimizer state footprint.

CPUOffloadPolicy moves optimizer states to CPU RAM between optimization steps. On a node with 2TB of system RAM (common on HGX configurations), this extends practical model size significantly at the cost of CPU-GPU transfer time during the optimizer step.

DeepSpeed ZeRO-3: Configuration and NVMe Offload

Choose ZeRO-3 over FSDP2 when: you are already using a DeepSpeed codebase, you need NVMe offload for models that exceed even CPU RAM, or you are using Hugging Face Accelerate (which has a mature ZeRO-3 integration path). The memory savings are equivalent to FSDP2 with reshard_after_forward=True. The API surface is different.

A production-ready ds_config.json:

json
{
  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "nvme",
      "nvme_path": "/local_nvme"
    }
  },
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": false
  },
  "gradient_clipping": 1.0,
  "bf16": { "enabled": true }
}

The offload_optimizer setting moves optimizer states to CPU RAM. The offload_param setting with device: "nvme" is ZeRO-Infinity: model parameters at rest on NVMe storage, only brought to GPU VRAM when needed. This lets you technically run models much larger than your aggregate GPU VRAM, at the cost of NVMe I/O throughput becoming the bottleneck. Bare-metal nodes with local NVMe (available on Spheron's dedicated GPU nodes) are required for this - a network-attached NFS mount is too slow.

ZeRO stage comparison:

StageWhat is shardedPer-GPU memory reductionCommunication overhead
ZeRO-1Optimizer states only4x at 64 GPUsMinimal (scatter/gather only at optimizer step)
ZeRO-2Optimizer states + gradients8x at 64 GPUsModerate (gradient reduce-scatter)
ZeRO-3Optimizer states + gradients + parameters64x at 64 GPUsHigh (param all-gather every forward/backward)

ZeRO-3's communication overhead comes from reconstructing parameter shards on every forward and backward pass. This is why overlap_comm: true matters: DeepSpeed prefetches the next parameter shard while computing on the current one, hiding much of the all-gather latency behind useful compute. Disable this only if you are debugging correctness issues.

Megatron-Core 0.16+: Tensor, Pipeline, and Context Parallelism

Megatron-Core is the library layer underneath NVIDIA NeMo, NVIDIA's own pretraining stack, and several community projects. It provides the three parallelism modes you need for models above 70B on multi-node clusters.

Tensor Parallelism

TP splits individual weight matrices across GPUs. A self-attention layer with a 4096-dim QKV projection can be split across 8 GPUs: each GPU handles 512 of the 4096 columns, computes its partial output, and then an all-reduce synchronizes the result. This all-reduce happens within the NVLink domain (TP always stays within a node), so the bandwidth is 900 GB/s on H100 SXM5 rather than the 25-400 GB/s of the inter-node fabric.

The rule: tensor_model_parallel_size should not exceed the number of GPUs per node unless you accept cross-node TP. Keep TP at 4 or 8 for H100/H200 8-GPU nodes.

Pipeline Parallelism

PP assigns contiguous layer groups to different GPU groups (stages). Stage 0 might hold layers 0-15, stage 1 holds layers 16-31, and so on. Forward activations flow forward through stages; backward gradients flow backward. The inefficiency is pipeline bubbles: stage 1 sits idle while stage 0 is processing the first microbatch.

Megatron-Core 0.16+'s interleaved 1F1B schedule (enabled via virtual_pipeline_model_parallel_size) interleaves multiple microbatches to reduce bubble ratio. With 8 pipeline stages and virtual_pipeline_model_parallel_size=2, the bubble ratio drops from ~12% to ~6%.

Context Parallelism

CP splits the sequence dimension across GPUs. Instead of one GPU handling a full 32K-token sequence, 8 GPUs each handle 4K tokens. This is necessary for long-context training (32K+ sequences) where activation memory from attention matrices becomes prohibitive. Enable it by setting context_parallel_size to the number of GPUs that will share sequence chunks.

python
# Key Megatron-Core parallelism config variables
model_args = {
    "tensor_model_parallel_size": 8,     # within-node TP
    "pipeline_model_parallel_size": 4,   # across-node PP
    "context_parallel_size": 1,          # CP for long sequences
    "virtual_pipeline_model_parallel_size": 2,  # interleaved PP for lower bubble ratio
}

Recommended configurations:

ModelGPUsTPPPDPCPNotes
70B64x H1008421All TP within node, PP across nodes
405B256x H1008822CP=2 for 32K+ sequences
405B MoE256x H1008841Add EP for expert routing

Launching Multi-Node Jobs: torchrun, SLURM, Ray Train, and SkyPilot

torchrun

The standard PyTorch launcher for multi-node jobs:

bash
torchrun \
  --nnodes $NUM_NODES \
  --node_rank $NODE_RANK \
  --nproc_per_node 8 \
  --master_addr $MASTER_ADDR \
  --master_port 29500 \
  train.py

NODE_RANK is 0 on the first node, 1 on the second, and so on. MASTER_ADDR is the IP of the rank-0 node. Run this command identically on every node simultaneously. For elastic training (dynamic node join/leave), add --rdzv_backend=c10d and --rdzv_endpoint=$MASTER_ADDR:29500.

SLURM

For managed cluster environments, SLURM handles the launch orchestration:

bash
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --time=48:00:00

# SLURM sets SLURM_NODELIST; extract the first node as master
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=29500

srun torchrun \
  --nnodes $SLURM_NNODES \
  --node_rank $SLURM_NODEID \
  --nproc_per_node 8 \
  --master_addr $MASTER_ADDR \
  --master_port $MASTER_PORT \
  train.py

SLURM allocates one task per node; torchrun manages the 8 GPU worker processes within each node. SLURM_NODEID gives each node its rank automatically.

Ray Train

For clusters already running Ray, TorchTrainer provides a higher-level interface:

python
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    scaling_config=ScalingConfig(num_workers=32, use_gpu=True),
)
trainer.fit()

Ray Train handles process group initialization and NCCL setup automatically. Useful when you are already using Ray for experiment management or hyperparameter search.

SkyPilot

For launching multi-node jobs across cloud GPU providers, SkyPilot provides a YAML task definition:

yaml
resources:
  accelerators: H100:8
  cloud: aws
  region: us-east-1  # pick a region with IB-capable instances (e.g. p4d/p4de clusters)

num_nodes: 4

run: |
  torchrun \
    --nnodes $(echo "$SKYPILOT_NODE_IPS" | wc -l) \
    --node_rank $SKYPILOT_NODE_RANK \
    --nproc_per_node 8 \
    --master_addr $(echo "$SKYPILOT_NODE_IPS" | head -n 1) \
    --master_port 29500 \
    train.py

SkyPilot sets SKYPILOT_NODE_RANK and SKYPILOT_NODE_IPS automatically. Using $(echo "$SKYPILOT_NODE_IPS" | wc -l) for --nnodes derives the actual provisioned node count at launch time, so the value stays correct if you change num_nodes without touching the run block. To use InfiniBand-capable instances, set cloud and region to a provider/region that offers IB-enabled instance types (e.g. AWS p4d.24xlarge in us-east-1), then set NCCL_IB_DISABLE=0 in your run environment.

NCCL Tuning for Cloud GPU Clusters

NCCL sits between PyTorch and the network fabric. The right environment variables can double effective bandwidth; the wrong ones leave you at 30% of theoretical throughput with no visible error.

VariableValuePurpose
NCCL_IB_DISABLE0Enable InfiniBand RDMA (disable to force Ethernet)
NCCL_IB_HCAmlx5_0:1Select the HCA device (verify with ibstat)
NCCL_IB_GID_INDEX3RoCE GID for IPv4 routing
NCCL_SOCKET_IFNAMEeth0 or ens5Ethernet interface for control plane
NCCL_NET_GDR_READ1Enable GPU Direct RDMA reads
NCCL_TOPO_FILE/path/to/topo.xmlCustom topology hints for unusual node configs
NCCL_ALGOTreeTree-based algorithm, better for small messages
NCCL_PROTOSimpleOverride protocol; use for latency-sensitive small ops
NCCL_BUFFSIZE1677721616MB buffer; increase for high-bandwidth links

FSDP2 prefetches the next parameter shard while computing on the current one, using NCCL all-gathers that overlap with the forward pass. DeepSpeed ZeRO-3's overlap_comm: true achieves the same effect for the backward pass. This overlap is what makes both frameworks viable at scale: without it, training throughput would be cut in half waiting for all-gather calls to complete.

Topology awareness: within a node, NCCL detects NVLink automatically via topology discovery and uses it for intra-node collective operations. Across nodes, it falls back to InfiniBand or Ethernet based on what the environment variables tell it. If NCCL is choosing Ethernet when InfiniBand is available (check NCCL_DEBUG=INFO output for NET/IB vs NET/Socket), the fix is usually setting NCCL_IB_DISABLE=0 and confirming the HCA name matches ibstat output.

For a detailed comparison of InfiniBand NDR, RoCEv2, and Spectrum-X for AI clusters, see GPU Networking for AI Clusters: InfiniBand vs RoCE vs Spectrum-X.

Cost Economics: Training 70B and 405B Models

Live pricing fetched from the Spheron API on 29 Apr 2026:

GPUOn-Demand (/hr/GPU)Spot (/hr/GPU)
H100 SXM5$2.90$0.80
H200 SXM5$3.69$1.19
B200 SXM6$7.43$1.71
A100 SXM4 80GB$1.64$0.45

Training a 70B Model

A 70B BF16 model carries 140GB of weights. AdamW optimizer states in FP32 add another 840GB. Total memory pressure without sharding: ~1,120GB. FSDP2 (reshard_after_forward=True) or ZeRO-3 across 8x H100 80GB GPUs brings per-GPU footprint to roughly 35GB for params and gradients combined (17.5GB each). Adding sharded optimizer states without CPU offloading grows the total to ~140GB per GPU, which exceeds the 80GB VRAM limit; CPUOffloadPolicy (or NVMe offload) is required to move optimizer states off the GPU. Gradient checkpointing reduces activation memory but cannot eliminate the 105GB/GPU optimizer state footprint.

On a well-tuned 8x H100 SXM5 cluster with FSDP2 and flash attention, expect roughly 8,500-9,400 tokens/sec aggregate throughput. H100 SXM5 peaks at 989 TFLOPs BF16; a 70B model requires ~4.2×10¹¹ FLOPs/token (6 × parameter count); at 45-50% MFU across 8 GPUs that is ~8,500-9,400 tokens/sec. Pre-training at 1T tokens on 8 GPUs would take roughly 30,000 wall-clock hours. That scale needs a 1,000-16,000 GPU cluster (Meta used 16,384 H100s for Llama 3.1 70B). The 8x H100 node is the right tool for fine-tuning:

  • Fine-tuning on 1B tokens: ~30 hours wall clock
  • H100 cluster cost/hr: 8 × $2.90 = $23.20 on-demand, 8 × $0.80 = $6.40 spot
  • On-demand cost for 1B-token fine-tune: $23.20 × 30h = ~$696
  • Spot cost: $6.40 × 30h = ~$192

For full fine-tuning (not pre-training from scratch) on a supervised dataset, token counts are much smaller. A 50K-example dataset at 2K tokens each is 100M tokens total, finishing in roughly 3 hours on 8x H100s. For framework comparisons, see the guide on FSDP2 fine-tuning frameworks and the end-to-end fine-tuning workflow guide.

Training a 405B Model

A 405B model requires at least 32x H100 80GB with ZeRO-3, or 16x H200 141GB with ZeRO-3. The H200's larger VRAM and higher memory bandwidth make it the cleaner option when available.

With Megatron-Core 3D parallelism on 32x H100 (TP=8, PP=4), a 405B model requires ~2.43×10¹² FLOPs/token; at 35-40% MFU on 32 H100s that is ~4,600-5,200 tokens/sec aggregate throughput. For fine-tuning on a 1B-token dataset:

  • Wall clock: ~50-60 hours
  • H100 cluster cost/hr: 32 × $2.90 = $92.80 on-demand
  • On-demand cost for 1B-token fine-tune: $92.80 × 55h midpoint = ~$5,104

At 4+ weeks of continuous compute (pre-training at 15T tokens), reserved pricing makes the economics significantly better than on-demand rates. Reference current GPU pricing for reserved cluster rates.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Deploying Multi-Node Training on Spheron

Spheron's GPU fleet includes bare-metal H100 SXM5, H200, and B200 nodes interconnected via InfiniBand or RoCE fabric, available through data center partners globally. No shared tenancy on the GPU layer. The node is dedicated to your workload for the duration of the rental.

For multi-node training jobs that run days to weeks, you want bare-metal nodes with a high-bandwidth interconnect, not shared cloud instances with shared NICs. H100 SXM5 clusters on Spheron come with NVLink within the node and InfiniBand between nodes on reserved configurations, which is the setup the cost numbers above assume.

No long-term commitment is required for short runs. Spot pricing cuts costs significantly for jobs that can checkpoint and resume: FSDP2 and DeepSpeed both support periodic checkpoint saves. H200 on Spheron is available for jobs where the 141GB HBM3e per card changes the memory math enough to avoid ZeRO-3 altogether.

For performance comparisons across networking configurations, see Multi-Node GPU Training Without InfiniBand: Tradeoffs and Cost Analysis.

B200 instances for pre-training are now available on Spheron for workloads that can benefit from Blackwell's FP4 support and higher HBM3e bandwidth. At $1.71/hr spot, B200 offers strong value for jobs that checkpoint regularly and can tolerate preemption. On-demand at $7.43/hr reflects Blackwell's significant hardware premium over H100, so spot is the right default unless guaranteed availability is required for a long run. For a full breakdown of when the B200 hardware premium pays off, see the NVIDIA B200 complete guide.

Ready to run multi-node distributed training without a long-term cluster commitment? Spheron provides bare-metal H100, H200, and B200 nodes with InfiniBand fabric, available on-demand or on reserved terms.

Rent H100 clusters → | Rent H200 → | View all pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.