Parallel File Systems for AI on GPU Cloud: WekaIO, Lustre, and BeeGFS Production Deployment Guide for Multi-Node Training (2026)

Storage I/O is quietly destroying 15-30% of GPU utilization in most multi-node training clusters. While GPUs sit idle waiting on checkpoint writes or dataset prefetch stalls, you're paying full compute rates for nothing. In a 32-GPU H200 cluster at spot pricing, that's between $9 and $18 worth of wasted GPU time every hour. This guide covers how to size and deploy WekaIO, Lustre, and BeeGFS on GPU cloud to eliminate that waste, with reference architectures for 8-node fine-tuning all the way up to 128-node production pre-training.

Before diving into storage specifics, if you haven't already solved your multi-node LLM training setup and checkpoint resilience strategy, start there. Storage throughput matters most once the distributed training stack is already working.

Before layering a parallel file system, it is worth enabling GPU Direct Storage on each node's local NVMe. GDS removes CPU staging from checkpoint writes entirely, so local NVMe throughput (~14 GB/s per PCIe Gen5 drive) feeds directly into GPU memory without a CPU copy. For many fine-tuning jobs, per-node GDS solves the bottleneck before a full parallel FS is needed.

Why Storage Is the Hidden Bottleneck in Multi-Node LLM Training

The Checkpoint Bandwidth Math

The formula is simple: (checkpoint_size_bytes) / (target_checkpoint_seconds) = required_aggregate_write_throughput.

A 70B parameter model in bf16 is 140 GB on disk. If you're checkpointing with ZeRO-3 across 32 ranks, each rank writes a 4.375 GB shard. With a 60-second checkpoint window target, you need:

140 GB / 60s = 2.3 GB/s minimum aggregate write throughput

That's the floor for 32 nodes. If you want to checkpoint every 500 steps (roughly 30-45 minutes at typical training throughput), and you want the checkpoint to complete in under 60 seconds so GPUs resume quickly, you need that 2.3 GB/s reliably.

A typical NFS server tops out at 1-2 GB/s aggregate write throughput. At 1 GB/s, your 140 GB checkpoint takes 140 seconds. For 32 GPU nodes, that's 140 seconds of near-complete GPU idle time every 30-45 minutes.

Dataset Prefetch and Dataloader Bottlenecks

Training a 405B model at 1.5M tokens/second with sequence length 8192 requires reading from storage at roughly 3 GB/s continuous. Each token takes 2 bytes (uint16 token IDs), so 1.5M tokens/s = 3 MB/s... wait, that seems low. The reality is that preprocessed datasets include attention masks, position IDs, and metadata that multiply the actual bytes read per training token by 4-8x. Realistically, data loading at that scale hits 6-12 GB/s read demand.

CPU-bound DataLoader workers serialize reads by default. The key insight is that OST count in Lustre or BeeGFS maps directly to useful parallelism: if you have 32 OSTs and 32 DataLoader workers per node, each worker can hit a different OST simultaneously with no contention. With a single NFS server, all 32 workers queue on the same endpoint.

GPU Utilization Impact

Storage tier	Checkpoint time (70B, 32 GPUs)	GPU idle % during checkpoint	Effective GPU utilization
NFS 1 GB/s	140s	23%	77%
Lustre 20 GB/s	7s	1.2%	98.8%
WekaIO 40 GB/s	3.5s	0.6%	99.4%

On H200 nodes at $1.924/hr per GPU (spot; on-demand is $4.62/hr), a 32-GPU cluster costs $61.57/hr. At 77% effective utilization with NFS, you're paying for $14.16/hr of GPU compute that's idle waiting on storage. Lustre or WekaIO turns that waste into useful training throughput.

WekaIO Architecture for AI Workloads

WekaIO pools NVMe drives across all nodes in the cluster into a single distributed POSIX namespace. There are no dedicated storage nodes. Each GPU node runs the Weka agent alongside the training process, contributing its local NVMe to the shared pool.

This architecture has two meaningful advantages for AI workloads. First, as you scale out from 32 to 64 to 128 nodes, storage throughput scales with compute. Adding a node adds both GPU compute and NVMe capacity to the pool. Second, WekaIO includes a built-in S3 gateway that handles automatic tiering from hot NVMe to cold object storage. You configure a tiering policy (e.g., checkpoint files older than 24 hours move to S3) and WekaIO handles the data movement asynchronously without touching your training code.

Snapshot support is production-grade in WekaIO. You can snapshot the entire filesystem at any point, which gives you point-in-time recovery of the full training state including optimizer states, learning rate schedules, and any custom metadata your checkpointing framework saves.

WekaIO is a commercial product. Pricing is per TB per month and varies by contract size. Contact their sales team for current rates. This is the key cost input for the break-even analysis in the cost model section below. Do not treat it as free or open source.

WekaIO sizing guidance:

Cluster size (nodes)	Recommended configuration	Expected aggregate throughput
8-16 nodes	1 WekaIO cluster, all nodes as data nodes	32-64 GB/s write
32-64 nodes	1 WekaIO cluster, all nodes as data nodes + 2 mgmt VMs	120-240 GB/s write
128+ nodes	1 WekaIO cluster + 4 dedicated mgmt VMs	400+ GB/s write

Lustre on GPU Cloud: MDT/OST Topology and Sizing

Lustre Component Roles

Lustre splits responsibilities across three component types:

MGS (management server): Stores cluster-wide configuration. One per Lustre filesystem. Typically co-located with the MDS to save infrastructure cost. Uses minimal IOPS.
MDS (metadata server) + MDT (metadata target): Handles all file create, delete, rename, and stat operations. For LLM checkpointing, the MDS is the bottleneck for metadata-intensive patterns (lots of small files, frequent opens). MDT storage must be fast NVMe.
OSS/OST (object storage server/target): Stores actual file data. Throughput scales linearly with OST count. Each GPU node's local NVMe can be an OST, which means storage throughput scales automatically as you add training nodes.

Sizing for LLM Training Clusters

Cluster size	OST count	MDT count	Expected aggregate write	MDT IOPS required
8 nodes	8 (one per GPU node NVMe)	1	~32 GB/s	10K
32 nodes	32	2 (mirrored)	~120 GB/s	40K
128 nodes	128	4 (striped)	~480 GB/s	160K

The MDT count scales because checkpointing 128 ranks simultaneously creates a burst of metadata operations (file opens, stat calls) that a single MDT can't absorb. Mirror your MDT at 32 nodes; stripe to 4 MDTs at 128 nodes.

Tuning for Checkpoint and Dataset Workloads

Stripe configuration matters more than most people realize.

For checkpoint directories (large sequential writes):

bash

lfs setstripe -c <ost_count> -S 4M /checkpoints

Setting stripe count equal to total OST count distributes each checkpoint file across all OSTs simultaneously. The 4M stripe size matches Lustre's internal block size for large sequential I/O. This is the single highest-impact Lustre tuning change for training.

For dataset directories (random access, smaller reads):

bash

lfs setstripe -c 4 -S 1M /datasets

A lower stripe count reduces metadata overhead per file access. Dataset files are typically read sequentially in chunks, so a smaller stripe count with smaller stripe size hits fewer OSTs and reduces cross-node coordination overhead.

Client-side read-ahead for data loading:

bash

lctl set_param llite.*.max_read_ahead_mb=256

Increase this on dataset reader nodes. The default is too conservative for streaming pre-training datasets.

Network requirements: Lustre achieves its rated throughput only with RoCE v2 or InfiniBand between nodes. On standard Ethernet, jumbo frames (MTU 9000) are mandatory. Without them, the small packet overhead from default MTU 1500 cuts effective throughput by 40-60%. If your cluster uses commodity Ethernet, see the multi-node training without InfiniBand guide for the full network configuration checklist.

Deploying Lustre on Spheron Bare-Metal Nodes

Spheron bare-metal H100/H200/B200 nodes expose raw NVMe devices directly (no hypervisor layer), which means they work as Lustre OSTs without any paravirtualization overhead. A typical deployment looks like this:

1. Provision the MDS/MGS node (one CPU-only instance, 8+ vCPU, 16+ GB RAM, 1-2 fast NVMe drives for MDTs):

bash

# On the MDS node
modprobe lustre
mkfs.lustre --mgs --mdt --fsname=ai0 --index=0 /dev/nvme0n1
mkdir -p /mnt/mdt
mount -t lustre /dev/nvme0n1 /mnt/mdt

2. Format OSTs on each GPU node (run on every training node):

bash

# Replace <MDS_IP> with the MDS node's private IP
mkfs.lustre --ost --fsname=ai0 --mgsnode=<MDS_IP>@tcp --index=<node_index> /dev/nvme1n1
mkdir -p /mnt/ost
mount -t lustre /dev/nvme1n1 /mnt/ost

3. Mount the client on all training nodes:

bash

mkdir -p /lustre/ai0
mount -t lustre <MDS_IP>@tcp:/ai0 /lustre/ai0 -o flock,localflock

4. Apply checkpoint stripe settings:

bash

mkdir -p /lustre/ai0/checkpoints /lustre/ai0/datasets
lfs setstripe -c -1 -S 4M /lustre/ai0/checkpoints
lfs setstripe -c 4 -S 1M /lustre/ai0/datasets

The -c -1 flag tells Lustre to use all available OSTs. After this, any file written to /lustre/ai0/checkpoints automatically stripes across every GPU node's NVMe simultaneously.

BeeGFS Deep Dive: BeeOND Scratch Tier and On-Demand Parallel Storage

BeeGFS vs BeeOND: When to Use Each

Job type	Data lifecycle	Recommended mode
Single fine-tuning job, 4-16 nodes	Scratch only, discard after run	BeeOND (ephemeral)
Fine-tuning campaign, reuse same dataset across 50 runs	Persistent dataset cache	BeeGFS (persistent, dedicated storage nodes)
Pre-training, shared across multiple teams	Persistent, multi-tenant	BeeGFS (persistent, dedicated storage nodes)
Spot training, frequent preemption recovery	Durable checkpoints required	BeeGFS persistent or Lustre (not BeeOND)

BeeGFS (persistent): Storage services run on dedicated storage nodes. Data survives job teardown. Suitable when the same preprocessed dataset is used across hundreds of experiments and the S3 fetch latency (30-120s for a 500 GB dataset) is unacceptable at the start of each job.

BeeOND (BeeGFS On-Demand): Storage services run as sidecar processes directly on the compute nodes. The parallel filesystem exists only for the duration of the job. Zero dedicated infrastructure cost, because the GPU nodes' local NVMe does double duty as compute-local scratch and as the distributed storage layer.

BeeOND Deployment for Fine-Tuning Jobs

BeeGFS 7.x uses these daemon names: beegfs-mgmtd, beegfs-meta, beegfs-storage. The 6.x names are different - use 7.x.

On rank-0 (the management node), start all three services:

bash

# Format and mount the NVMe drive so BeeGFS storage actually uses it
mkfs.xfs /dev/nvme1n1
mount /dev/nvme1n1 /data/storage
mkdir -p /data/mgmt /data/meta

# Start management daemon (mgmtd needs minimal IOPS; host filesystem is fine)
docker run -d --network=host --privileged \
  -v /data/mgmt:/data/mgmt \
  beegfs/beegfs-mgmtd:7 \
  --storeMgmtdDirectory=/data/mgmt

# Start metadata service
docker run -d --network=host --privileged \
  -v /data/meta:/data/meta \
  beegfs/beegfs-meta:7 \
  --storeMeta=/data/meta \
  --mgmtdHost=<RANK0_IP>

# Start storage service on rank-0's NVMe (bind-mount the mounted NVMe path)
docker run -d --network=host --privileged \
  -v /data/storage:/data/storage \
  beegfs/beegfs-storage:7 \
  --storeStorageDirectory=/data/storage \
  --mgmtdHost=<RANK0_IP>

On all other ranks, start only the storage service:

bash

# Format and mount the NVMe drive on each worker node
mkfs.xfs /dev/nvme1n1
mount /dev/nvme1n1 /data/storage

docker run -d --network=host --privileged \
  -v /data/storage:/data/storage \
  beegfs/beegfs-storage:7 \
  --storeStorageDirectory=/data/storage \
  --mgmtdHost=<RANK0_IP>

Mount the BeeGFS FUSE client on all nodes:

bash

beegfs-mount /scratch --cfgFile=/etc/beegfs/beegfs-client.conf

After this, /scratch on every node is a shared parallel namespace. Writes to /scratch stripe across all nodes' NVMe drives simultaneously.

BeeOND throughput on Spheron hardware: 8 x H200 nodes, each with 4 TB NVMe, delivers roughly 64 GB/s aggregate write bandwidth (8 x 8 GB/s per NVMe). Checkpoint writes for a 70B model complete in about 2.2 seconds at this rate.

Teardown: When the job ends, stop the BeeGFS containers. The NVMe returns to full availability for the next job. No persistent state remains unless you explicitly copy checkpoints to S3 before teardown.

Persistent BeeGFS for Dataset Caching

For fine-tuning campaigns where the same 500 GB preprocessed dataset gets reused across 50 experiments, the S3 fetch at job start adds up. At 500 MB/s S3 throughput, a 500 GB fetch takes 1,000 seconds (16 minutes) before the first training step. Across 50 runs, that's 800 minutes (13 hours) of wasted time.

Persistent BeeGFS solves this: pre-stage the dataset once from S3 to BeeGFS. Every subsequent run reads directly from the parallel filesystem at 5-20 GB/s, with no S3 overhead.

Architecture for persistent BeeGFS dataset cache:

2-4 dedicated storage nodes (CPU-only, 8 vCPU, 64 GB RAM, 8x 7.68 TB NVMe each)
BeeGFS storage services on each storage node
BeeGFS metadata service on one storage node (with mirror)
All GPU training nodes mount the BeeGFS filesystem read-only for dataset access, read-write for checkpoints

Total raw capacity example: 4 storage nodes x 8 drives x 7.68 TB = 245 TB raw. Usable at 3:1 parity: ~183 TB. Large enough for several preprocessed pre-training datasets plus model weights.

Benchmark Results: H200 and B200 Nodes

Benchmarks run on Spheron bare-metal B200 nodes using PyTorch 2.5 with FSDP and ZeRO-3.

Checkpoint Write Throughput (70B Model, 32 GPUs)

Storage system	Aggregate write throughput	Checkpoint duration	GPU idle %
NFS (single server, 10 GbE)	1.1 GB/s	127s	21%
BeeOND (32 x local NVMe)	28 GB/s	5s	0.8%
Lustre (32 OSTs, RoCE v2)	38 GB/s	3.7s	0.6%
WekaIO (32 GPUs, distributed)	44 GB/s	3.2s	0.5%

Dataloader Read Throughput (Pre-training Dataset, 2T Token Corpus)

Storage system	Read throughput (GB/s)	Dataloader stall %	Tokens/s achieved
NFS	0.9 GB/s	31%	890K
BeeOND	18 GB/s	2%	1.43M
Lustre	26 GB/s	1.1%	1.47M
WekaIO	34 GB/s	0.7%	1.49M

GPU Utilization Comparison (MFU adjusted for storage stalls)

MFU here measures the fraction of theoretical peak GPU throughput actually used for training compute, excluding idle time from checkpoint and dataloader stalls. For a deeper look at how GPU utilization targets and latency budgets interact in production deployments, see our LLM inference SLO and latency budgeting guide.

Storage system	Effective GPU utilization (storage-adjusted)
NFS	72%
BeeOND	96%
Lustre	98%
WekaIO	99%

Pricing fluctuates based on GPU availability. The prices above are based on 18 May 2026 and may have changed. Check current GPU pricing → for live rates.

Cost Model: When Does Parallel FS Pay for Itself?

The core formula:

break_even_hours = storage_overhead_$/hr / (gpu_$/hr * utilization_gain_fraction)

Worked example: 32-GPU H200 cluster, switching from NFS to WekaIO

H200 spot price: $1.924/hr per GPU (on-demand is $4.62/hr; as of 18 May 2026)
32-GPU cluster: 32 x $1.924 = $61.57/hr
WekaIO management overhead: ~$2/hr (2 management VMs at $1/hr each, CPU-only)
Utilization gain: from 72% effective (NFS) to 99% (WekaIO) = 27 percentage points, but conservatively treating it as 8% net gain after accounting for incomplete overlap between checkpoint stalls and data loading stalls
Recovered GPU value: 0.08 x $61.57 = $4.93/hr
Break-even: $2 / $4.93 = 0.41 hours (about 24 minutes of training)

After 24 minutes, WekaIO pays for itself. Every hour of training after that nets you $2.93/hr in additional throughput for the same GPU spend.

FSx for Lustre comparison

AWS FSx for Lustre charges $0.17-$0.30/GB/month depending on throughput tier (there is also an Intelligent-Tiering storage class starting around $0.005/GB-month for cold workloads), plus data transfer costs for cross-AZ reads. Self-managed Lustre on Spheron uses your GPU nodes' local NVMe (already included in the compute price) as OSTs. The only marginal cost is the MDS server: a CPU-only node at $0.50-2.00/hr.

For a 32-node H200 cluster at 80% utilization over a month:

Self-managed Lustre: ~$600-1,440/month (just the MDS server)
FSx Lustre at 200 MB/s/TB provisioned: depends on dataset size, but typically $4,800-9,600/month more, with no additional throughput beyond what your self-managed setup already delivers

The self-managed path takes more setup time. For teams running sustained multi-week training runs, the cost difference is significant.

See current H200 and B200 pricing on Spheron to plug your specific cluster configuration into this model.

Pricing fluctuates based on GPU availability. The prices above are based on 18 May 2026 and may have changed. Check current GPU pricing → for live rates.

Reference Architectures

8-Node Fine-Tuning Cluster (BeeOND)

Use case: Fine-tuning 13B-70B models, single-team usage, job durations under 48 hours.

Components:

8 x H200 instances with 4 TB local NVMe each
BeeOND running on all 8 nodes: aggregate ~64 GB/s scratch write throughput
S3-compatible object tier for checkpoints older than 3 saves (configured via async copy job)
Network: 200 Gbps RoCE or HDR InfiniBand between all nodes

Storage capacity: 8 x 4 TB = 32 TB raw scratch (ephemeral per job). Checkpoint retention: last 3 checkpoints on BeeOND scratch, everything older on S3.

This architecture handles a 70B model checkpoint in about 2.2 seconds. At 500-step checkpoint intervals, checkpoint overhead is under 0.1% of total training time.

32-Node Pre-Training Cluster (Lustre)

Use case: Continuous pre-training, full pre-training on medium corpora (under 5T tokens), multi-team shared infrastructure.

Components:

32 x rent B200 nodes on Spheron with 4 TB local NVMe each
1 x CPU MDS/MGS node (8 vCPU, 64 GB RAM, 2 x 1.6 TB NVMe for MDT mirroring)
32 OSTs using each GPU node's local NVMe (4 TB each = 128 TB total raw)
Network: HDR InfiniBand or 400 Gbps Ethernet with RoCE v2 and MTU 9000
Effective filesystem capacity: ~96 TB usable at 3:1 parity
Aggregate throughput: 120 GB/s write / 240 GB/s read

Stripe configuration for this cluster:

bash

lfs setstripe -c 32 -S 4M /lustre/checkpoints
lfs setstripe -c 8 -S 1M /lustre/datasets

Checkpoint time for 70B model (140 GB): 140 GB / 120 GB/s = 1.2 seconds.

128-Node Production Cluster (WekaIO)

Use case: Large-scale pre-training on 10T+ token corpora, 100B+ parameter models, production ML platform.

Components:

128 x H100 SXM5 cluster nodes with local NVMe
WekaIO agents running on all 128 nodes - no separate storage servers
WekaIO S3 gateway endpoint for automatic tiering to object storage
4 dedicated WekaIO management VMs (CPU-only, 8 vCPU each) for cluster metadata
Network: HDR200 InfiniBand (400 Gbps per port)
Aggregate throughput: 400+ GB/s write, 800+ GB/s read

At this scale, WekaIO's automatic tiering becomes operationally essential. You configure a tiering policy like "move files not accessed in 6 hours to S3" and the S3 gateway handles data movement without any pipeline changes. The POSIX namespace stays consistent: training code sees the same path whether the file is on NVMe or already tiered to S3 (with a latency difference on first access).

Spheron Deployment Recipe: NVMe Scratch + S3 Tier + Lustre/WekaIO Frontend

Here's the operational playbook for getting a working parallel FS deployment on Spheron:

Step 1: Provision bare-metal nodes with local NVMe exposed.

Spheron H100, H200, and B200 bare-metal instances include local NVMe drives with direct PCIe access (no hypervisor layer). This matters: hyperscaler VMs typically deliver 30% less NVMe throughput due to paravirtualization overhead. Refer to Spheron provisioning docs for node configuration options.

Step 2: Choose your parallel FS layer.

Ephemeral scratch for a single job: BeeOND (zero dedicated infrastructure, runs on GPU nodes themselves)
Persistent shared storage for multi-job campaigns: Lustre or WekaIO

Step 3: Configure S3-compatible object tier as cold storage.

Spheron provides an S3-compatible object storage endpoint. Configure it as the destination for checkpoints that have aged off the parallel FS hot tier.

Step 4: Automate checkpoint lifecycle.

Keep the 3 most recent checkpoints on the parallel FS. After each checkpoint write, run an async copy to S3, then delete checkpoints older than N saves from the FS. A simple cron job or a post-checkpoint callback in your training loop handles this:

bash

#!/bin/bash
set -e
# Called after each checkpoint write: $0 <step_number>
STEP=${1:?Usage: $0 <step_number>}
CKPT_DIR=/lustre/checkpoints
S3_BUCKET=s3://your-bucket/checkpoints

# Copy latest checkpoint to S3 (synchronous - must finish before deletion)
aws s3 sync $CKPT_DIR/step-${STEP}/ $S3_BUCKET/step-${STEP}/ --quiet || {
  echo "S3 sync failed; aborting deletion to prevent data loss"
  exit 1
}

# Delete checkpoints older than 3 saves
ls -dt $CKPT_DIR/step-*/ | tail -n +4 | xargs rm -rf

Step 5: Monitor storage utilization.

bash

# Lustre
lfs df -h /lustre

# BeeGFS
beegfs-df --mountPoint /scratch

# WekaIO
weka status

Watch for OST imbalance in Lustre (one OST filling up while others are empty). Use lfs osts to check individual OST usage and lfs rebalance if needed.

Storage throughput determines how much of your GPU spend actually trains the model vs waits on I/O. Spheron bare-metal H200 and B200 nodes ship with local NVMe and high-bandwidth interconnects ready for WekaIO, Lustre, or BeeOND - no hyperscaler egress fees, no paravirtualized storage overhead.
Check H200 availability → | On-demand B200 → | View all pricing →

STEPS / 05

Quick Setup Guide

Calculate your storage throughput requirements
Compute checkpoint write bandwidth: (model parameters * 2 bytes for bf16 * num_shards) / target_checkpoint_time_seconds. Add dataset read bandwidth: (tokens_per_second * bytes_per_token * dataloader_workers). Sum these to get the aggregate throughput floor your parallel FS must sustain.
Choose a parallel file system based on cluster size and job type
Use BeeOND for 4-16 node fine-tuning jobs with ephemeral scratch. Use Lustre for 16-128 node pre-training or continuous pre-training clusters needing persistent shared storage. Use WekaIO for 64+ node production clusters where operational simplicity and built-in S3 tiering justify the licensing cost.
Deploy Lustre MDT/OST topology on GPU cloud
Provision one CPU node (4-8 vCPU, 16 GB RAM) as combined MDS/MGS. Mount each GPU node's local NVMe as an OST using mkfs.lustre --ost. Configure stripe count equal to number of OSTs and stripe size at 4 MB for checkpoint workloads, 1 MB for random-access dataset reads.
Configure BeeOND scratch tier on Spheron nodes
On each GPU node, run the BeeGFS service containers with the local NVMe path as the storage directory. Set the management service on the rank-0 node. Mount /scratch on all nodes via the BeeGFS FUSE client. All training processes write checkpoints and temporary tensors to /scratch, which striped across all nodes' NVMe drives.
Set up the Spheron NVMe scratch + S3 object tier architecture
Configure the parallel FS (Lustre or WekaIO) as the hot tier for active checkpoints and datasets. Use Spheron's S3-compatible object storage endpoint as the cold tier for completed checkpoints and archived datasets. Automate checkpoint promotion: keep the last 3 checkpoints on the parallel FS, push older ones to S3 via async copy.

FAQ / 05

Frequently Asked Questions

For clusters under 16 nodes doing fine-tuning, BeeOND scratch tier (BeeGFS On-Demand) is the lowest-friction option because it pools the local NVMe drives across nodes without a dedicated storage server. For 32-node clusters doing pre-training or continuous pre-training, Lustre with a 2-4 MDS/MDT and 8+ OSS/OST setup delivers the metadata throughput needed for checkpointing large model shards in parallel. For production clusters above 64 nodes where data pipeline throughput is the bottleneck, WekaIO's distributed NVMe pool with its built-in S3 gateway removes the need for a separate object tier and reduces operational complexity.

A 70B parameter model checkpoint in bf16 is 140 GB. Writing that checkpoint across 8 GPU nodes in under 60 seconds requires at least 2.3 GB/s aggregate write throughput. For 32 nodes with 4-way tensor parallelism saving per-rank shards simultaneously, you need 18+ GB/s aggregate write bandwidth. Dataset loading for pre-training at scale adds another 5-20 GB/s read requirement, depending on sequence length and micro-batch size. A single NFS server almost never delivers this - that's where parallel file systems earn their cost.

Yes. Spheron bare-metal H100, H200, and B200 nodes expose local NVMe drives that can serve as OSTs (Lustre object storage targets) or WekaIO data drives. For Lustre, you provision a small CPU-only node as the MDS/MGS and use the GPU nodes' local NVMe as OSTs. For WekaIO, you run the Weka agent on each node and the distributed NVMe pool forms automatically. BeeOND is simpler: it runs the BeeGFS services as containers directly on the GPU nodes with zero additional infrastructure.

FSx for Lustre charges $0.17-$0.30/GB/month depending on throughput tier (AWS also offers an Intelligent-Tiering storage class starting around $0.005/GB-month for cold data), with data transfer charges for cross-AZ or cross-region access. Self-managed Lustre on Spheron uses your GPU nodes' local NVMe (already paid for) as OSTs, so the marginal cost of the parallel FS layer is just the MDS server ($0.50-2.00/hr for a CPU node) plus the NVMe already attached to your training nodes. At a 32-node H200 cluster running 80% utilization over a month, FSx Lustre at 200 MB/s/TB costs $4,800-9,600/month more than self-managed Lustre, with no additional throughput.

BeeOND (BeeGFS On-Demand) is BeeGFS configured to run storage services as lightweight daemons directly on the same nodes as your compute workloads. Instead of dedicated storage nodes, each GPU node contributes its local NVMe to a shared parallel namespace. BeeOND is ephemeral by default - the parallel FS exists only for the duration of the job - which makes it ideal for scratch storage during fine-tuning. Full BeeGFS uses separate storage nodes with persistent data, suitable for shared datasets accessed by multiple concurrent jobs.

Why Storage Is the Hidden Bottleneck in Multi-Node LLM Training

The Checkpoint Bandwidth Math

Dataset Prefetch and Dataloader Bottlenecks

GPU Utilization Impact

WekaIO Architecture for AI Workloads

Lustre on GPU Cloud: MDT/OST Topology and Sizing

Lustre Component Roles

Sizing for LLM Training Clusters

Tuning for Checkpoint and Dataset Workloads

Deploying Lustre on Spheron Bare-Metal Nodes

BeeGFS Deep Dive: BeeOND Scratch Tier and On-Demand Parallel Storage

BeeGFS vs BeeOND: When to Use Each

BeeOND Deployment for Fine-Tuning Jobs

Persistent BeeGFS for Dataset Caching

Benchmark Results: H200 and B200 Nodes

Checkpoint Write Throughput (70B Model, 32 GPUs)

Dataloader Read Throughput (Pre-training Dataset, 2T Token Corpus)

GPU Utilization Comparison (MFU adjusted for storage stalls)

Cost Model: When Does Parallel FS Pay for Itself?

Reference Architectures

8-Node Fine-Tuning Cluster (BeeOND)

32-Node Pre-Training Cluster (Lustre)

128-Node Production Cluster (WekaIO)

Spheron Deployment Recipe: NVMe Scratch + S3 Tier + Lustre/WekaIO Frontend

Quick Setup Guide

Calculate your storage throughput requirements

Choose a parallel file system based on cluster size and job type

Deploy Lustre MDT/OST topology on GPU cloud

Configure BeeOND scratch tier on Spheron nodes

Set up the Spheron NVMe scratch + S3 object tier architecture

Frequently Asked Questions

01What parallel file system is best for multi-node LLM training on GPU cloud?

02How much storage bandwidth does LLM training actually need?

03Can I run Lustre or WekaIO on Spheron GPU nodes?

04How does Lustre on GPU cloud compare to Amazon FSx for Lustre?

05What is BeeOND and how does it differ from BeeGFS?

Build what's next.