Tutorial

Slurm for AI Workloads on GPU Cloud: HPC-Style Job Scheduling for LLM Training and Batch Inference (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 11, 2026
Slurm GPU CloudSlurm for AI TrainingSlurm vs Kubernetes AIHPC Scheduler LLM TrainingSlurm Multi-Node GPUsbatch GPU TrainingSlurm Cluster SetupGPU Cloud HPC
Slurm for AI Workloads on GPU Cloud: HPC-Style Job Scheduling for LLM Training and Batch Inference (2026 Guide)

Slurm is the default job scheduler at Meta, most national supercomputing centers, and nearly every university HPC cluster running serious AI workloads, and it is used in production at AI labs including Mistral. Yet almost every GPU cloud tutorial assumes Kubernetes. If you are migrating from an on-prem HPC cluster, or just want a simpler batch scheduling model for training jobs, this guide shows you how to run Slurm on cloud GPU nodes: architecture, cluster setup, sbatch patterns for multi-node LLM training, Pyxis containers, topology-aware scheduling, and cost optimization with spot instances.

Why Frontier Labs Run Slurm

The core reason is gang scheduling. When you submit a 4-node training job to Slurm, all four nodes start simultaneously or none start. The job does not begin until the full allocation is available. This matters because distributed training is intolerant of partial starts: a PyTorch dist.barrier() call blocks indefinitely if one rank never shows up.

Kubernetes does not have native gang scheduling. It schedules pods independently, and a 4-node training job can end up with 3 pods running and the fourth stuck in pending because no node has free GPUs. This causes silent hangs and wasted billing time. KAI Scheduler and Volcano add gang scheduling to Kubernetes, but they add operational complexity that Slurm users do not need.

Beyond gang scheduling:

  • Fair-share scheduling. Slurm's multifactor priority system automatically decays the priority of teams that over-consume GPU resources, redistributing capacity to under-served users. No manual queue management needed.
  • Native MPI integration. mpirun binds to srun task slots naturally. For non-PyTorch HPC workloads, OpenMPI over Slurm is a solved problem.
  • Simple job semantics. A Slurm job is a shell script with #SBATCH directives. There is no YAML object graph, no custom resource definition, no controller to debug. Submit a script, get results, read logs.
  • Zero toolchain change. Researchers moving from a university cluster to cloud GPUs can copy their sbatch scripts with minor modifications. The learning curve is flat.

This is not a claim that Slurm is universally better. Slurm solves a different problem set than Kubernetes. The right choice depends on what you are building.

Slurm vs Kubernetes for AI: When Each Wins

DimensionSlurmKubernetes
Job typeBatch training, HPC, MPIAlways-on inference, serving
Gang schedulingNativeRequires KAI Scheduler or Volcano
Container supportVia Pyxis + EnrootNative
Auto-scalingElastic plugins (cloud bursting)KEDA, Knative
Multi-tenancyFair-share queuesNamespaces + resource quotas
Topology awarenesstopology.conf (native)Node affinity + labels
Existing HPC migrationZero toolchain changeFull rewrite
Inference servingNot designed for itDesigned for it

The decision is usually straightforward: if you run training jobs that start, run for hours, checkpoint, and terminate, Slurm wins on simplicity. If you need auto-scaling HTTP inference endpoints or microservice architectures alongside AI, Kubernetes wins on ecosystem. For teams doing both, running Slurm for training and Kubernetes for inference serving is a common split. For a deep look at the Kubernetes side, the Kubernetes GPU scheduling with DRA and KAI Scheduler guide covers the full stack.

Slurm Architecture for GPU Clusters

A Slurm cluster has three main components:

slurmctld (the controller daemon) runs on the head node. It manages the job queue, allocates resources, and dispatches jobs to compute nodes. There is typically one active controller with an optional standby for high availability.

slurmd (the compute node daemon) runs on every GPU node. It receives job steps from the controller, launches processes, and reports resource usage and health back to the controller.

slurmdbd (the database daemon) stores job accounting data in MariaDB or MySQL. Required for fair-share scheduling and usage reporting. Runs on the controller node or a dedicated host.

The GRES (Generic Resource) system is how Slurm tracks GPUs. You declare GPU resources in two files:

# /etc/slurm/slurm.conf (controller)
ClusterName=gpu-cluster
SlurmctldHost=controller-node
GresTypes=gpu

NodeName=gpu-node-[001-004] \
  Gres=gpu:h100:8 \
  CPUs=128 \
  RealMemory=2048000 \
  State=UNKNOWN

PartitionName=train \
  Nodes=gpu-node-[001-004] \
  Default=YES \
  MaxTime=168:00:00 \
  State=UP
# /etc/slurm/gres.conf (each compute node)
NodeName=gpu-node-001 Name=gpu Type=h100 File=/dev/nvidia[0-7]
NodeName=gpu-node-002 Name=gpu Type=h100 File=/dev/nvidia[0-7]
NodeName=gpu-node-003 Name=gpu Type=h100 File=/dev/nvidia[0-7]
NodeName=gpu-node-004 Name=gpu Type=h100 File=/dev/nvidia[0-7]

The File=/dev/nvidia[0-7] binding tells Slurm to set CUDA_VISIBLE_DEVICES correctly for each job, preventing GPU conflicts between concurrent jobs on the same node.

Provisioning a Slurm Cluster on GPU Cloud: 4-8 Node H100 Walkthrough

This walkthrough provisions a 4-node H100 SXM5 cluster. Adjust node counts and GPU types as needed.

Step 1: Provision the controller node. Rent one CPU-only instance (8-16 cores, 32-64 GB RAM) as the controller. It does not need GPUs. This node will run slurmctld and slurmdbd.

Step 2: Provision compute nodes. Rent 4 H100 SXM5 bare-metal instances on Spheron. All nodes must be on the same private subnet so they can reach each other without NAT. Note the private IPs of each compute node.

Step 3: Set up shared storage. Create an NFS server (or use a managed NFS service) and export /home and /scratch to all nodes. Shared home directories are required so that sbatch scripts and dataset paths resolve identically on every node. For /scratch, use fast local NVMe for dataset reads and NFS only for checkpoints.

Step 4: Install Slurm. On Ubuntu 22.04, the slurm-wlm package installs both slurmctld and slurmd:

bash
sudo apt-get update && sudo apt-get install -y slurm-wlm slurmdbd mariadb-server munge

Step 5: Generate and distribute the Munge key. Munge is the authentication system Slurm uses. The key must be byte-identical on every node:

bash
# On the controller (Ubuntu 22.04 / munge < 0.5.15):
sudo create-munge-key
# If your system has munge >= 0.5.15 use: sudo mungekey --create
sudo systemctl enable --now munge

# Copy to each compute node (use a secrets manager in production)
scp /etc/munge/munge.key user@gpu-node-001:/tmp/munge.key
ssh gpu-node-001 "sudo mv /tmp/munge.key /etc/munge/munge.key && sudo chown munge:munge /etc/munge/munge.key && sudo chmod 400 /etc/munge/munge.key && sudo systemctl enable --now munge"

Step 6: Write slurm.conf. Populate slurm.conf with your node names, specs, and partition definitions. Use the template from the Architecture section above, substituting your actual IP-resolved hostnames.

Step 7: Start the daemons. On the controller: sudo systemctl enable --now slurmctld slurmdbd. On each compute node: sudo systemctl enable --now slurmd.

Step 8: Verify the cluster. Run sinfo on the controller. If nodes show idle status, the cluster is ready. Run a sanity check:

bash
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --time=00:05:00

srun nvidia-smi

Submit with sbatch sanity.sh and check output with squeue, then sacct -j $JOBID.

Topology-Aware Scheduling for Multi-Node Training

On a GPU cluster with InfiniBand, multi-node all-reduce traffic routes through a leaf-spine fabric. Nodes under the same leaf switch can communicate at full IB bandwidth. Nodes under different leaf switches cross the spine, adding latency and potentially sharing bandwidth.

Slurm's topology scheduling places jobs on nodes that minimize cross-switch hops. You configure it in topology.conf:

# /etc/slurm/topology.conf
# Two racks, four nodes per rack, connected through a spine switch
SwitchName=spine1 Switches=leaf1,leaf2
SwitchName=leaf1  Nodes=gpu-node-[001-004]
SwitchName=leaf2  Nodes=gpu-node-[005-008]

Enable it in slurm.conf:

TopologyPlugin=topology/tree

With topology/tree, Slurm attempts to place a 4-node job entirely under leaf1 before considering nodes across leaves. For single-rack clusters where all nodes share one switch, topology/flat is sufficient.

Test placement without actually running a job:

bash
srun --nodes=4 --gres=gpu:8 --test-only /bin/true

The output shows which nodes Slurm would allocate.

On the NCCL side, align your environment variables with the IB fabric topology:

bash
# Pin NCCL to the correct InfiniBand HCAs (check with ibstat)
export NCCL_IB_HCA=mlx5_0:1,mlx5_1:1

# Use GID index 3 for RoCEv2 (or 0 for IB)
export NCCL_IB_GID_INDEX=3

# Enable GPU Direct RDMA reads
export NCCL_NET_GDR_READ=1

# Set the socket interface for inter-node rendezvous
export NCCL_SOCKET_IFNAME=ib0

On cloud bare-metal nodes, IB device names may differ from on-prem clusters. Always check with ibstat before setting NCCL_IB_HCA. For more detail on IB vs RoCE tradeoffs, see the InfiniBand vs RoCE fabric selection guide. For the full set of NCCL environment variables, see NCCL tuning for multi-node training.

Running LLM Training Jobs: torchrun, FSDP, and DeepSpeed Under sbatch

Here is a complete sbatch script for a 4-node, 32-GPU FSDP training job:

bash
#!/bin/bash
#SBATCH --job-name=llm-fsdp-train
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --exclusive
#SBATCH --time=48:00:00
#SBATCH --output=logs/%j/train.out
#SBATCH --error=logs/%j/train.err

# Extract the first node as the rendezvous master
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=29500

# InfiniBand settings - check ibstat for your HCA names
export NCCL_IB_HCA=mlx5_0:1,mlx5_1:1
export NCCL_IB_GID_INDEX=3
export NCCL_NET_GDR_READ=1
export NCCL_SOCKET_IFNAME=ib0

# Launch torchrun on each node via srun
# srun starts one task per node; torchrun starts 8 GPU workers within each task
srun torchrun \
  --nnodes=$SLURM_JOB_NUM_NODES \
  --nproc_per_node=8 \
  --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT \
  --rdzv_backend=c10d \
  train.py \
  --model_name meta-llama/Llama-3-70b \
  --fsdp_sharding_strategy FULL_SHARD \
  --gradient_checkpointing

--rdzv_backend=c10d uses PyTorch's C10d rendezvous for node discovery. It is more reliable than the default static backend in cluster environments where nodes may have slightly different startup times.

For DeepSpeed, let srun handle distribution and skip the DeepSpeed CLI launcher entirely:

bash
srun python train_ds.py --deepspeed ds_config.json

The script calls deepspeed.initialize() internally, so srun assigns ranks and manages inter-node communication without a conflicting second launcher. If you prefer using the DeepSpeed CLI as the sole outer launcher (not wrapped in srun), use a hostfile instead:

bash
# Build a hostfile from the nodes SLURM allocated (8 slots = 8 GPUs per node)
scontrol show hostnames "$SLURM_JOB_NODELIST" \
  | awk '{print $1 " slots=8"}' > /tmp/hostfile
HOSTFILE=/tmp/hostfile

deepspeed \
  --hostfile=$HOSTFILE \
  --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT \
  train_ds.py \
  --deepspeed ds_config.json

Monitoring Running Jobs

Check the queue and job status:

bash
# See all running and pending jobs
squeue -u $USER

# Detailed accounting for a completed or running job
sacct -j $JOBID --format=JobID,Elapsed,CPUTime,NCPUS,AllocGRES,State

# Check GPU utilization on allocated nodes without SSH
srun --jobid=$JOBID --overlap nvidia-smi

For the full multi-node FSDP and DeepSpeed ZeRO-3 setup, including memory math and checkpoint strategies, see the FSDP and DeepSpeed multi-node setup guide.

Pyxis and Enroot: Containerized Slurm Without Performance Loss

The standard way to run containers in Slurm is via Pyxis and Enroot. The combination gives you full OCI container support, rootless execution, and GPU passthrough with near-zero performance overhead.

Why not Docker? Docker requires a root daemon, which is a security risk on shared HPC clusters. Most cluster admins do not allow it. Enroot is rootless: it unpacks Docker/OCI images into squashfs files and mounts them without a daemon. NVIDIA Container Toolkit hooks inside Enroot handle GPU device access.

How Pyxis works. Pyxis is a Slurm SPANK plugin that extends srun and sbatch with container flags. Once installed, your sbatch scripts gain --container-image and --container-mounts options that work identically to regular job flags.

Installation (abbreviated):

bash
# On all compute nodes: install Enroot
curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.5.0/enroot_3.5.0-1_amd64.deb
sudo apt-get install -y ./enroot_3.5.0-1_amd64.deb

# Install NVIDIA Container Toolkit and register the GPU hook for Enroot
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk enroot-hook install

# On all nodes: install Pyxis
# Download from github.com/NVIDIA/pyxis and build against your Slurm headers
# Then register in /etc/slurm/plugstack.conf:
# required /usr/local/lib/slurm/spank_pyxis.so

Using containers in sbatch:

bash
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --exclusive

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)

srun \
  --container-image=nvcr.io/nvidia/pytorch:24.12-py3 \
  --container-mounts=/data:/data,/scratch:/scratch \
  torchrun \
  --nnodes=$SLURM_JOB_NUM_NODES \
  --nproc_per_node=8 \
  --master_addr=$MASTER_ADDR \
  --master_port=29500 \
  train.py

Enroot imports the image on first use and caches a squashfs on each node. Subsequent runs mount from cache with near-zero startup time. For GPU-bound training workloads, squashfs mounts add less than 5% overhead vs bare-metal execution.

Fair-Share Scheduling, GPU Partitions, and Preemption

For multi-team clusters, Slurm's fair-share scheduler prevents any single team from monopolizing GPU resources.

Partition setup for a research cluster:

# slurm.conf partition definitions
PartitionName=debug  Nodes=gpu-node-001 MaxTime=02:00:00 MaxCPUsPerUser=8 State=UP
PartitionName=train  Nodes=gpu-node-[001-008] MaxTime=168:00:00 State=UP Default=YES
PartitionName=priority Nodes=gpu-node-[001-008] MaxTime=720:00:00 PriorityJobFactor=2 State=UP

Enable fair-share scheduling:

# slurm.conf
AccountingStorageType=accounting_storage/slurmdbd
PriorityType=priority/multifactor
PriorityWeightFairshare=50000
PriorityWeightAge=1000
PriorityWeightJobSize=1000

Fair-share works by comparing each user's or account's historical usage against their allocated share. Teams that have consumed more GPU-hours than their share get lower priority; teams that have used less get a boost. The PriorityWeightFairshare parameter controls how aggressively historical usage influences the queue.

Preemption allows high-priority jobs to evict lower-priority ones:

# slurm.conf
PreemptType=preempt/partition_prio
PreemptMode=REQUEUE

With REQUEUE, evicted jobs re-enter the queue and restart from their last checkpoint when resources free up. This pairs well with frequent checkpointing in training scripts.

QOS per team sets hard limits:

bash
# Create accounts for each team
sacctmgr add account team-a Description="Team A" Organization=research
sacctmgr add user alice Account=team-a

# Create a QOS limiting GPU-hours per month
sacctmgr add qos team-a-qos GrpTRESMins="gres/gpu=43200"  # 43200 GPU-minutes = 720 GPU-hours
sacctmgr modify account team-a set QOS=team-a-qos

Cost Optimization: Spot Instances and Elastic Slurm

Spot pricing cuts GPU costs significantly on workloads that can tolerate preemption. Here are current Spheron prices for the most common Slurm training configurations:

GPUTypeOn-Demand (per GPU/hr)8-GPU Node (on-demand/hr)
H100 SXM5NVIDIA Hopper$4.21$33.68
A100 80GBNVIDIA Ampere$1.04$8.32

Pricing fluctuates based on GPU availability. The prices above are based on 11 May 2026 and may have changed. Check current GPU pricing → for live rates.

Spot-safe job patterns for Slurm. Add --requeue to your sbatch script so that if a spot node is reclaimed, Slurm requeues the job automatically:

bash
#SBATCH --requeue

Pair this with frequent checkpointing in your training script. Save a checkpoint every 100-500 steps to shared NFS. On restart, load from the latest checkpoint:

python
# In your training loop
if step % checkpoint_interval == 0:
    torch.save({"step": step, "model": model.state_dict(), ...}, f"/scratch/ckpt/step_{step}.pt")

For automatic restart after preemption:

bash
# Submit job and capture the job ID
JOBID=$(sbatch --parsable train.sh)

# Set up a dependency job that restarts if the first fails (exit code != 0)
sbatch --dependency=afternotok:$JOBID train.sh

Elastic Slurm (cloud bursting) adds and removes nodes dynamically via ResumeProgram and SuspendProgram hooks in slurm.conf. These hooks call cloud provider APIs to provision or terminate nodes as the queue grows or shrinks. SchedMD's documentation covers the configuration; the key point is that the hook scripts need to update /etc/slurm/slurm.conf and reload the controller each time nodes change.

Cost attribution with sacct:

bash
# Total GPU-hours consumed by each user this month
sacct --allocations --starttime=$(date -d "1 month ago" +%Y-%m-%d) \
  --format=User,AllocGRES,ElapsedRaw \
  --state=COMPLETED \
  | awk '/h100/ {split($2,a,":"); gpuhours[$1] += (a[3] * $3/3600)} END {for (u in gpuhours) print u, gpuhours[u], "GPU-hrs"}'

Run the same accounting query on A100 clusters or H100 nodes to get per-team GPU spend for chargebacks. For a detailed comparison of when spot vs on-demand vs reserved makes sense, see on-demand vs spot vs reserved GPU instances. For a real case study, the spot GPU training cost analysis shows how a 70B training run was completed for $11,200 on spot GPUs.

Slurm for Batch Inference: When It Beats Always-On Kubernetes

Batch inference is an underrated Slurm use case. If you are running overnight embedding generation, batch LLM evaluation, or dataset scoring pipelines, you are probably paying 24/7 for a Kubernetes deployment that is idle 18 hours a day.

Slurm array jobs handle embarrassingly parallel inference efficiently:

bash
#!/bin/bash
#SBATCH --job-name=batch-embed
#SBATCH --array=0-999          # 1000 shards
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00

# Each task processes one shard
SHARD_ID=$SLURM_ARRAY_TASK_ID

python embed.py \
  --shard $SHARD_ID \
  --total-shards 1000 \
  --input /data/corpus \
  --output /scratch/embeddings/shard_${SHARD_ID}.npy

Slurm schedules array tasks across available GPUs as they free up. A 1000-shard job on a 10-GPU cluster finishes in 100 batches, with GPU utilization at 100% throughout. Compare to an always-on Kubernetes deployment: even if you have 10 replicas running 24/7, idle time during off-hours still bills at full rate.

The economics flip when you need low p50 latency or auto-scaling HTTP endpoints. Slurm has no native HTTP serving layer. For production inference serving with latency SLAs, Kubernetes wins. For batch LLM inference scheduling where throughput matters more than latency, Slurm array jobs are the simpler choice.

RLHF training workflows fit naturally into Slurm: reward-model inference and policy training each become separate sbatch jobs with --dependency linking them. For an overview of the major RLHF frameworks, see the verl, OpenRLHF, and TRL training infrastructure guide.

Migrating from On-Prem Slurm to GPU Cloud

Engineers moving from on-prem HPC clusters to cloud Slurm hit a predictable set of gotchas.

Storage. On-prem clusters usually have Lustre or IBM Storage Scale with hundreds of GB/s aggregate bandwidth. On cloud, you are working with NFS over network block storage. Shared storage I/O is often the bottleneck, not compute. Mitigate this by mounting training datasets to fast local NVMe on each compute node and only using NFS for checkpoints and model outputs. Many cloud providers offer local NVMe on bare-metal GPU nodes.

Networking. InfiniBand is available on bare-metal H100 and A100 nodes from some cloud providers. Verify IB availability before provisioning. If IB is unavailable, RoCEv2 over 100GbE substitutes for most training workloads at moderate scale. The multi-node training without InfiniBand on cloud guide covers the tradeoffs and configuration in detail.

Licensing. Slurm is open source under the GNU GPLv2 license. No license cost. The database backend (slurmdbd) requires MariaDB or MySQL; budget a small instance for that, or use a managed database service.

Node naming. Cloud instances use dynamic hostnames or IP-based names that change on reprovision. Automate NodeName entries in slurm.conf via Terraform or a startup script that registers each node with the controller on boot. The controller must be able to resolve every compute node hostname.

MPI. On-prem clusters often run OpenMPI extensively. For PyTorch-based LLM training, NCCL replaces MPI entirely. OpenMPI still works for non-PyTorch HPC codes via srun; install libopenmpi-dev on all nodes and it works the same as on-prem.

Munge key distribution. The Munge authentication key must be byte-identical on all nodes. In production, store it in a secrets manager (AWS Secrets Manager, HashiCorp Vault, etc.) and retrieve it during node initialization before slurmd starts. Do not copy keys over SSH manually in scripts that run on node boot.

Spheron's bare-metal H100 and A100 instances with InfiniBand networking are built for exactly this kind of workload. You bring your own scheduler, whether Slurm, Ray, or something custom, and get raw HPC performance without managed-Kubernetes overhead. No lock-in, per-minute billing, full root access.

Rent H100 on Spheron → | Rent A100 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.