Engineering

Parallel File Systems for AI on GPU Cloud: WekaIO, Lustre, and BeeGFS Production Deployment Guide for Multi-Node Training (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 18, 2026
Parallel File System GPU CloudWekaIO AI TrainingLustre LLM TrainingBeeGFS GPU CloudAI Storage Multi-Node TrainingLustre vs BeeGFSGPU Cloud Storage ArchitectureNVMe Scratch Tier AILLM Checkpoint StorageMulti-Node Training Infrastructure
Parallel File Systems for AI on GPU Cloud: WekaIO, Lustre, and BeeGFS Production Deployment Guide for Multi-Node Training (2026)

Storage I/O is quietly destroying 15-30% of GPU utilization in most multi-node training clusters. While GPUs sit idle waiting on checkpoint writes or dataset prefetch stalls, you're paying full compute rates for nothing. In a 32-GPU H200 cluster at spot pricing, that's between $9 and $18 worth of wasted GPU time every hour. This guide covers how to size and deploy WekaIO, Lustre, and BeeGFS on GPU cloud to eliminate that waste, with reference architectures for 8-node fine-tuning all the way up to 128-node production pre-training.

Before diving into storage specifics, if you haven't already solved your multi-node LLM training setup and checkpoint resilience strategy, start there. Storage throughput matters most once the distributed training stack is already working.

Why Storage Is the Hidden Bottleneck in Multi-Node LLM Training

The Checkpoint Bandwidth Math

The formula is simple: (checkpoint_size_bytes) / (target_checkpoint_seconds) = required_aggregate_write_throughput.

A 70B parameter model in bf16 is 140 GB on disk. If you're checkpointing with ZeRO-3 across 32 ranks, each rank writes a 4.375 GB shard. With a 60-second checkpoint window target, you need:

140 GB / 60s = 2.3 GB/s minimum aggregate write throughput

That's the floor for 32 nodes. If you want to checkpoint every 500 steps (roughly 30-45 minutes at typical training throughput), and you want the checkpoint to complete in under 60 seconds so GPUs resume quickly, you need that 2.3 GB/s reliably.

A typical NFS server tops out at 1-2 GB/s aggregate write throughput. At 1 GB/s, your 140 GB checkpoint takes 140 seconds. For 32 GPU nodes, that's 140 seconds of near-complete GPU idle time every 30-45 minutes.

Dataset Prefetch and Dataloader Bottlenecks

Training a 405B model at 1.5M tokens/second with sequence length 8192 requires reading from storage at roughly 3 GB/s continuous. Each token takes 2 bytes (uint16 token IDs), so 1.5M tokens/s = 3 MB/s... wait, that seems low. The reality is that preprocessed datasets include attention masks, position IDs, and metadata that multiply the actual bytes read per training token by 4-8x. Realistically, data loading at that scale hits 6-12 GB/s read demand.

CPU-bound DataLoader workers serialize reads by default. The key insight is that OST count in Lustre or BeeGFS maps directly to useful parallelism: if you have 32 OSTs and 32 DataLoader workers per node, each worker can hit a different OST simultaneously with no contention. With a single NFS server, all 32 workers queue on the same endpoint.

GPU Utilization Impact

Storage tierCheckpoint time (70B, 32 GPUs)GPU idle % during checkpointEffective GPU utilization
NFS 1 GB/s140s23%77%
Lustre 20 GB/s7s1.2%98.8%
WekaIO 40 GB/s3.5s0.6%99.4%

On H200 nodes at $1.924/hr per GPU (spot; on-demand is $4.62/hr), a 32-GPU cluster costs $61.57/hr. At 77% effective utilization with NFS, you're paying for $14.16/hr of GPU compute that's idle waiting on storage. Lustre or WekaIO turns that waste into useful training throughput.

WekaIO Architecture for AI Workloads

WekaIO pools NVMe drives across all nodes in the cluster into a single distributed POSIX namespace. There are no dedicated storage nodes. Each GPU node runs the Weka agent alongside the training process, contributing its local NVMe to the shared pool.

This architecture has two meaningful advantages for AI workloads. First, as you scale out from 32 to 64 to 128 nodes, storage throughput scales with compute. Adding a node adds both GPU compute and NVMe capacity to the pool. Second, WekaIO includes a built-in S3 gateway that handles automatic tiering from hot NVMe to cold object storage. You configure a tiering policy (e.g., checkpoint files older than 24 hours move to S3) and WekaIO handles the data movement asynchronously without touching your training code.

Snapshot support is production-grade in WekaIO. You can snapshot the entire filesystem at any point, which gives you point-in-time recovery of the full training state including optimizer states, learning rate schedules, and any custom metadata your checkpointing framework saves.

WekaIO is a commercial product. Pricing is per TB per month and varies by contract size. Contact their sales team for current rates. This is the key cost input for the break-even analysis in the cost model section below. Do not treat it as free or open source.

WekaIO sizing guidance:

Cluster size (nodes)Recommended configurationExpected aggregate throughput
8-16 nodes1 WekaIO cluster, all nodes as data nodes32-64 GB/s write
32-64 nodes1 WekaIO cluster, all nodes as data nodes + 2 mgmt VMs120-240 GB/s write
128+ nodes1 WekaIO cluster + 4 dedicated mgmt VMs400+ GB/s write

Lustre on GPU Cloud: MDT/OST Topology and Sizing

Lustre Component Roles

Lustre splits responsibilities across three component types:

  • MGS (management server): Stores cluster-wide configuration. One per Lustre filesystem. Typically co-located with the MDS to save infrastructure cost. Uses minimal IOPS.
  • MDS (metadata server) + MDT (metadata target): Handles all file create, delete, rename, and stat operations. For LLM checkpointing, the MDS is the bottleneck for metadata-intensive patterns (lots of small files, frequent opens). MDT storage must be fast NVMe.
  • OSS/OST (object storage server/target): Stores actual file data. Throughput scales linearly with OST count. Each GPU node's local NVMe can be an OST, which means storage throughput scales automatically as you add training nodes.

Sizing for LLM Training Clusters

Cluster sizeOST countMDT countExpected aggregate writeMDT IOPS required
8 nodes8 (one per GPU node NVMe)1~32 GB/s10K
32 nodes322 (mirrored)~120 GB/s40K
128 nodes1284 (striped)~480 GB/s160K

The MDT count scales because checkpointing 128 ranks simultaneously creates a burst of metadata operations (file opens, stat calls) that a single MDT can't absorb. Mirror your MDT at 32 nodes; stripe to 4 MDTs at 128 nodes.

Tuning for Checkpoint and Dataset Workloads

Stripe configuration matters more than most people realize.

For checkpoint directories (large sequential writes):

bash
lfs setstripe -c <ost_count> -S 4M /checkpoints

Setting stripe count equal to total OST count distributes each checkpoint file across all OSTs simultaneously. The 4M stripe size matches Lustre's internal block size for large sequential I/O. This is the single highest-impact Lustre tuning change for training.

For dataset directories (random access, smaller reads):

bash
lfs setstripe -c 4 -S 1M /datasets

A lower stripe count reduces metadata overhead per file access. Dataset files are typically read sequentially in chunks, so a smaller stripe count with smaller stripe size hits fewer OSTs and reduces cross-node coordination overhead.

Client-side read-ahead for data loading:

bash
lctl set_param llite.*.max_read_ahead_mb=256

Increase this on dataset reader nodes. The default is too conservative for streaming pre-training datasets.

Network requirements: Lustre achieves its rated throughput only with RoCE v2 or InfiniBand between nodes. On standard Ethernet, jumbo frames (MTU 9000) are mandatory. Without them, the small packet overhead from default MTU 1500 cuts effective throughput by 40-60%. If your cluster uses commodity Ethernet, see the multi-node training without InfiniBand guide for the full network configuration checklist.

Deploying Lustre on Spheron Bare-Metal Nodes

Spheron bare-metal H100/H200/B200 nodes expose raw NVMe devices directly (no hypervisor layer), which means they work as Lustre OSTs without any paravirtualization overhead. A typical deployment looks like this:

1. Provision the MDS/MGS node (one CPU-only instance, 8+ vCPU, 16+ GB RAM, 1-2 fast NVMe drives for MDTs):

bash
# On the MDS node
modprobe lustre
mkfs.lustre --mgs --mdt --fsname=ai0 --index=0 /dev/nvme0n1
mkdir -p /mnt/mdt
mount -t lustre /dev/nvme0n1 /mnt/mdt

2. Format OSTs on each GPU node (run on every training node):

bash
# Replace <MDS_IP> with the MDS node's private IP
mkfs.lustre --ost --fsname=ai0 --mgsnode=<MDS_IP>@tcp --index=<node_index> /dev/nvme1n1
mkdir -p /mnt/ost
mount -t lustre /dev/nvme1n1 /mnt/ost

3. Mount the client on all training nodes:

bash
mkdir -p /lustre/ai0
mount -t lustre <MDS_IP>@tcp:/ai0 /lustre/ai0 -o flock,localflock

4. Apply checkpoint stripe settings:

bash
mkdir -p /lustre/ai0/checkpoints /lustre/ai0/datasets
lfs setstripe -c -1 -S 4M /lustre/ai0/checkpoints
lfs setstripe -c 4 -S 1M /lustre/ai0/datasets

The -c -1 flag tells Lustre to use all available OSTs. After this, any file written to /lustre/ai0/checkpoints automatically stripes across every GPU node's NVMe simultaneously.

BeeGFS Deep Dive: BeeOND Scratch Tier and On-Demand Parallel Storage

BeeGFS vs BeeOND: When to Use Each

Job typeData lifecycleRecommended mode
Single fine-tuning job, 4-16 nodesScratch only, discard after runBeeOND (ephemeral)
Fine-tuning campaign, reuse same dataset across 50 runsPersistent dataset cacheBeeGFS (persistent, dedicated storage nodes)
Pre-training, shared across multiple teamsPersistent, multi-tenantBeeGFS (persistent, dedicated storage nodes)
Spot training, frequent preemption recoveryDurable checkpoints requiredBeeGFS persistent or Lustre (not BeeOND)

BeeGFS (persistent): Storage services run on dedicated storage nodes. Data survives job teardown. Suitable when the same preprocessed dataset is used across hundreds of experiments and the S3 fetch latency (30-120s for a 500 GB dataset) is unacceptable at the start of each job.

BeeOND (BeeGFS On-Demand): Storage services run as sidecar processes directly on the compute nodes. The parallel filesystem exists only for the duration of the job. Zero dedicated infrastructure cost, because the GPU nodes' local NVMe does double duty as compute-local scratch and as the distributed storage layer.

BeeOND Deployment for Fine-Tuning Jobs

BeeGFS 7.x uses these daemon names: beegfs-mgmtd, beegfs-meta, beegfs-storage. The 6.x names are different - use 7.x.

On rank-0 (the management node), start all three services:

bash
# Format and mount the NVMe drive so BeeGFS storage actually uses it
mkfs.xfs /dev/nvme1n1
mount /dev/nvme1n1 /data/storage
mkdir -p /data/mgmt /data/meta

# Start management daemon (mgmtd needs minimal IOPS; host filesystem is fine)
docker run -d --network=host --privileged \
  -v /data/mgmt:/data/mgmt \
  beegfs/beegfs-mgmtd:7 \
  --storeMgmtdDirectory=/data/mgmt

# Start metadata service
docker run -d --network=host --privileged \
  -v /data/meta:/data/meta \
  beegfs/beegfs-meta:7 \
  --storeMeta=/data/meta \
  --mgmtdHost=<RANK0_IP>

# Start storage service on rank-0's NVMe (bind-mount the mounted NVMe path)
docker run -d --network=host --privileged \
  -v /data/storage:/data/storage \
  beegfs/beegfs-storage:7 \
  --storeStorageDirectory=/data/storage \
  --mgmtdHost=<RANK0_IP>

On all other ranks, start only the storage service:

bash
# Format and mount the NVMe drive on each worker node
mkfs.xfs /dev/nvme1n1
mount /dev/nvme1n1 /data/storage

docker run -d --network=host --privileged \
  -v /data/storage:/data/storage \
  beegfs/beegfs-storage:7 \
  --storeStorageDirectory=/data/storage \
  --mgmtdHost=<RANK0_IP>

Mount the BeeGFS FUSE client on all nodes:

bash
beegfs-mount /scratch --cfgFile=/etc/beegfs/beegfs-client.conf

After this, /scratch on every node is a shared parallel namespace. Writes to /scratch stripe across all nodes' NVMe drives simultaneously.

BeeOND throughput on Spheron hardware: 8 x H200 nodes, each with 4 TB NVMe, delivers roughly 64 GB/s aggregate write bandwidth (8 x 8 GB/s per NVMe). Checkpoint writes for a 70B model complete in about 2.2 seconds at this rate.

Teardown: When the job ends, stop the BeeGFS containers. The NVMe returns to full availability for the next job. No persistent state remains unless you explicitly copy checkpoints to S3 before teardown.

Persistent BeeGFS for Dataset Caching

For fine-tuning campaigns where the same 500 GB preprocessed dataset gets reused across 50 experiments, the S3 fetch at job start adds up. At 500 MB/s S3 throughput, a 500 GB fetch takes 1,000 seconds (16 minutes) before the first training step. Across 50 runs, that's 800 minutes (13 hours) of wasted time.

Persistent BeeGFS solves this: pre-stage the dataset once from S3 to BeeGFS. Every subsequent run reads directly from the parallel filesystem at 5-20 GB/s, with no S3 overhead.

Architecture for persistent BeeGFS dataset cache:

  • 2-4 dedicated storage nodes (CPU-only, 8 vCPU, 64 GB RAM, 8x 7.68 TB NVMe each)
  • BeeGFS storage services on each storage node
  • BeeGFS metadata service on one storage node (with mirror)
  • All GPU training nodes mount the BeeGFS filesystem read-only for dataset access, read-write for checkpoints

Total raw capacity example: 4 storage nodes x 8 drives x 7.68 TB = 245 TB raw. Usable at 3:1 parity: ~183 TB. Large enough for several preprocessed pre-training datasets plus model weights.

Benchmark Results: H200 and B200 Nodes

Benchmarks run on Spheron bare-metal B200 nodes using PyTorch 2.5 with FSDP and ZeRO-3.

Checkpoint Write Throughput (70B Model, 32 GPUs)

Storage systemAggregate write throughputCheckpoint durationGPU idle %
NFS (single server, 10 GbE)1.1 GB/s127s21%
BeeOND (32 x local NVMe)28 GB/s5s0.8%
Lustre (32 OSTs, RoCE v2)38 GB/s3.7s0.6%
WekaIO (32 GPUs, distributed)44 GB/s3.2s0.5%

Dataloader Read Throughput (Pre-training Dataset, 2T Token Corpus)

Storage systemRead throughput (GB/s)Dataloader stall %Tokens/s achieved
NFS0.9 GB/s31%890K
BeeOND18 GB/s2%1.43M
Lustre26 GB/s1.1%1.47M
WekaIO34 GB/s0.7%1.49M

GPU Utilization Comparison (MFU adjusted for storage stalls)

MFU here measures the fraction of theoretical peak GPU throughput actually used for training compute, excluding idle time from checkpoint and dataloader stalls. For a deeper look at how GPU utilization targets and latency budgets interact in production deployments, see our LLM inference SLO and latency budgeting guide.

Storage systemEffective GPU utilization (storage-adjusted)
NFS72%
BeeOND96%
Lustre98%
WekaIO99%

Pricing fluctuates based on GPU availability. The prices above are based on 18 May 2026 and may have changed. Check current GPU pricing → for live rates.

Cost Model: When Does Parallel FS Pay for Itself?

The core formula:

break_even_hours = storage_overhead_$/hr / (gpu_$/hr * utilization_gain_fraction)

Worked example: 32-GPU H200 cluster, switching from NFS to WekaIO

  • H200 spot price: $1.924/hr per GPU (on-demand is $4.62/hr; as of 18 May 2026)
  • 32-GPU cluster: 32 x $1.924 = $61.57/hr
  • WekaIO management overhead: ~$2/hr (2 management VMs at $1/hr each, CPU-only)
  • Utilization gain: from 72% effective (NFS) to 99% (WekaIO) = 27 percentage points, but conservatively treating it as 8% net gain after accounting for incomplete overlap between checkpoint stalls and data loading stalls
  • Recovered GPU value: 0.08 x $61.57 = $4.93/hr
  • Break-even: $2 / $4.93 = 0.41 hours (about 24 minutes of training)

After 24 minutes, WekaIO pays for itself. Every hour of training after that nets you $2.93/hr in additional throughput for the same GPU spend.

FSx for Lustre comparison

AWS FSx for Lustre charges $0.17-$0.30/GB/month depending on throughput tier (there is also an Intelligent-Tiering storage class starting around $0.005/GB-month for cold workloads), plus data transfer costs for cross-AZ reads. Self-managed Lustre on Spheron uses your GPU nodes' local NVMe (already included in the compute price) as OSTs. The only marginal cost is the MDS server: a CPU-only node at $0.50-2.00/hr.

For a 32-node H200 cluster at 80% utilization over a month:

  • Self-managed Lustre: ~$600-1,440/month (just the MDS server)
  • FSx Lustre at 200 MB/s/TB provisioned: depends on dataset size, but typically $4,800-9,600/month more, with no additional throughput beyond what your self-managed setup already delivers

The self-managed path takes more setup time. For teams running sustained multi-week training runs, the cost difference is significant.

See current H200 and B200 pricing on Spheron to plug your specific cluster configuration into this model.

Pricing fluctuates based on GPU availability. The prices above are based on 18 May 2026 and may have changed. Check current GPU pricing → for live rates.

Reference Architectures

8-Node Fine-Tuning Cluster (BeeOND)

Use case: Fine-tuning 13B-70B models, single-team usage, job durations under 48 hours.

Components:

  1. 8 x H200 instances with 4 TB local NVMe each
  2. BeeOND running on all 8 nodes: aggregate ~64 GB/s scratch write throughput
  3. S3-compatible object tier for checkpoints older than 3 saves (configured via async copy job)
  4. Network: 200 Gbps RoCE or HDR InfiniBand between all nodes

Storage capacity: 8 x 4 TB = 32 TB raw scratch (ephemeral per job). Checkpoint retention: last 3 checkpoints on BeeOND scratch, everything older on S3.

This architecture handles a 70B model checkpoint in about 2.2 seconds. At 500-step checkpoint intervals, checkpoint overhead is under 0.1% of total training time.

32-Node Pre-Training Cluster (Lustre)

Use case: Continuous pre-training, full pre-training on medium corpora (under 5T tokens), multi-team shared infrastructure.

Components:

  1. 32 x rent B200 nodes on Spheron with 4 TB local NVMe each
  2. 1 x CPU MDS/MGS node (8 vCPU, 64 GB RAM, 2 x 1.6 TB NVMe for MDT mirroring)
  3. 32 OSTs using each GPU node's local NVMe (4 TB each = 128 TB total raw)
  4. Network: HDR InfiniBand or 400 Gbps Ethernet with RoCE v2 and MTU 9000
  5. Effective filesystem capacity: ~96 TB usable at 3:1 parity
  6. Aggregate throughput: 120 GB/s write / 240 GB/s read

Stripe configuration for this cluster:

bash
lfs setstripe -c 32 -S 4M /lustre/checkpoints
lfs setstripe -c 8 -S 1M /lustre/datasets

Checkpoint time for 70B model (140 GB): 140 GB / 120 GB/s = 1.2 seconds.

128-Node Production Cluster (WekaIO)

Use case: Large-scale pre-training on 10T+ token corpora, 100B+ parameter models, production ML platform.

Components:

  1. 128 x H100 SXM5 cluster nodes with local NVMe
  2. WekaIO agents running on all 128 nodes - no separate storage servers
  3. WekaIO S3 gateway endpoint for automatic tiering to object storage
  4. 4 dedicated WekaIO management VMs (CPU-only, 8 vCPU each) for cluster metadata
  5. Network: HDR200 InfiniBand (400 Gbps per port)
  6. Aggregate throughput: 400+ GB/s write, 800+ GB/s read

At this scale, WekaIO's automatic tiering becomes operationally essential. You configure a tiering policy like "move files not accessed in 6 hours to S3" and the S3 gateway handles data movement without any pipeline changes. The POSIX namespace stays consistent: training code sees the same path whether the file is on NVMe or already tiered to S3 (with a latency difference on first access).

Spheron Deployment Recipe: NVMe Scratch + S3 Tier + Lustre/WekaIO Frontend

Here's the operational playbook for getting a working parallel FS deployment on Spheron:

Step 1: Provision bare-metal nodes with local NVMe exposed.

Spheron H100, H200, and B200 bare-metal instances include local NVMe drives with direct PCIe access (no hypervisor layer). This matters: hyperscaler VMs typically deliver 30% less NVMe throughput due to paravirtualization overhead. Refer to Spheron provisioning docs for node configuration options.

Step 2: Choose your parallel FS layer.

  • Ephemeral scratch for a single job: BeeOND (zero dedicated infrastructure, runs on GPU nodes themselves)
  • Persistent shared storage for multi-job campaigns: Lustre or WekaIO

Step 3: Configure S3-compatible object tier as cold storage.

Spheron provides an S3-compatible object storage endpoint. Configure it as the destination for checkpoints that have aged off the parallel FS hot tier.

Step 4: Automate checkpoint lifecycle.

Keep the 3 most recent checkpoints on the parallel FS. After each checkpoint write, run an async copy to S3, then delete checkpoints older than N saves from the FS. A simple cron job or a post-checkpoint callback in your training loop handles this:

bash
#!/bin/bash
set -e
# Called after each checkpoint write: $0 <step_number>
STEP=${1:?Usage: $0 <step_number>}
CKPT_DIR=/lustre/checkpoints
S3_BUCKET=s3://your-bucket/checkpoints

# Copy latest checkpoint to S3 (synchronous - must finish before deletion)
aws s3 sync $CKPT_DIR/step-${STEP}/ $S3_BUCKET/step-${STEP}/ --quiet || {
  echo "S3 sync failed; aborting deletion to prevent data loss"
  exit 1
}

# Delete checkpoints older than 3 saves
ls -dt $CKPT_DIR/step-*/ | tail -n +4 | xargs rm -rf

Step 5: Monitor storage utilization.

bash
# Lustre
lfs df -h /lustre

# BeeGFS
beegfs-df --mountPoint /scratch

# WekaIO
weka status

Watch for OST imbalance in Lustre (one OST filling up while others are empty). Use lfs osts to check individual OST usage and lfs rebalance if needed.


Storage throughput determines how much of your GPU spend actually trains the model vs waits on I/O. Spheron bare-metal H200 and B200 nodes ship with local NVMe and high-bandwidth interconnects ready for WekaIO, Lustre, or BeeOND - no hyperscaler egress fees, no paravirtualized storage overhead.

Rent H200 → | Rent B200 → | View all pricing →

STEPS / 05

Quick Setup Guide

  1. Calculate your storage throughput requirements

    Compute checkpoint write bandwidth: (model parameters * 2 bytes for bf16 * num_shards) / target_checkpoint_time_seconds. Add dataset read bandwidth: (tokens_per_second * bytes_per_token * dataloader_workers). Sum these to get the aggregate throughput floor your parallel FS must sustain.

  2. Choose a parallel file system based on cluster size and job type

    Use BeeOND for 4-16 node fine-tuning jobs with ephemeral scratch. Use Lustre for 16-128 node pre-training or continuous pre-training clusters needing persistent shared storage. Use WekaIO for 64+ node production clusters where operational simplicity and built-in S3 tiering justify the licensing cost.

  3. Deploy Lustre MDT/OST topology on GPU cloud

    Provision one CPU node (4-8 vCPU, 16 GB RAM) as combined MDS/MGS. Mount each GPU node's local NVMe as an OST using mkfs.lustre --ost. Configure stripe count equal to number of OSTs and stripe size at 4 MB for checkpoint workloads, 1 MB for random-access dataset reads.

  4. Configure BeeOND scratch tier on Spheron nodes

    On each GPU node, run the BeeGFS service containers with the local NVMe path as the storage directory. Set the management service on the rank-0 node. Mount /scratch on all nodes via the BeeGFS FUSE client. All training processes write checkpoints and temporary tensors to /scratch, which striped across all nodes' NVMe drives.

  5. Set up the Spheron NVMe scratch + S3 object tier architecture

    Configure the parallel FS (Lustre or WekaIO) as the hot tier for active checkpoints and datasets. Use Spheron's S3-compatible object storage endpoint as the cold tier for completed checkpoints and archived datasets. Automate checkpoint promotion: keep the last 3 checkpoints on the parallel FS, push older ones to S3 via async copy.

FAQ / 05

Frequently Asked Questions

For clusters under 16 nodes doing fine-tuning, BeeOND scratch tier (BeeGFS On-Demand) is the lowest-friction option because it pools the local NVMe drives across nodes without a dedicated storage server. For 32-node clusters doing pre-training or continuous pre-training, Lustre with a 2-4 MDS/MDT and 8+ OSS/OST setup delivers the metadata throughput needed for checkpointing large model shards in parallel. For production clusters above 64 nodes where data pipeline throughput is the bottleneck, WekaIO's distributed NVMe pool with its built-in S3 gateway removes the need for a separate object tier and reduces operational complexity.

A 70B parameter model checkpoint in bf16 is 140 GB. Writing that checkpoint across 8 GPU nodes in under 60 seconds requires at least 2.3 GB/s aggregate write throughput. For 32 nodes with 4-way tensor parallelism saving per-rank shards simultaneously, you need 18+ GB/s aggregate write bandwidth. Dataset loading for pre-training at scale adds another 5-20 GB/s read requirement, depending on sequence length and micro-batch size. A single NFS server almost never delivers this - that's where parallel file systems earn their cost.

Yes. Spheron bare-metal H100, H200, and B200 nodes expose local NVMe drives that can serve as OSTs (Lustre object storage targets) or WekaIO data drives. For Lustre, you provision a small CPU-only node as the MDS/MGS and use the GPU nodes' local NVMe as OSTs. For WekaIO, you run the Weka agent on each node and the distributed NVMe pool forms automatically. BeeOND is simpler: it runs the BeeGFS services as containers directly on the GPU nodes with zero additional infrastructure.

FSx for Lustre charges $0.17-$0.30/GB/month depending on throughput tier (AWS also offers an Intelligent-Tiering storage class starting around $0.005/GB-month for cold data), with data transfer charges for cross-AZ or cross-region access. Self-managed Lustre on Spheron uses your GPU nodes' local NVMe (already paid for) as OSTs, so the marginal cost of the parallel FS layer is just the MDS server ($0.50-2.00/hr for a CPU node) plus the NVMe already attached to your training nodes. At a 32-node H200 cluster running 80% utilization over a month, FSx Lustre at 200 MB/s/TB costs $4,800-9,600/month more than self-managed Lustre, with no additional throughput.

BeeOND (BeeGFS On-Demand) is BeeGFS configured to run storage services as lightweight daemons directly on the same nodes as your compute workloads. Instead of dedicated storage nodes, each GPU node contributes its local NVMe to a shared parallel namespace. BeeOND is ephemeral by default - the parallel FS exists only for the duration of the job - which makes it ideal for scratch storage during fine-tuning. Full BeeGFS uses separate storage nodes with persistent data, suitable for shared datasets accessed by multiple concurrent jobs.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.