What components make up an RLHF training infrastructure?

A complete RLHF infrastructure runs four simultaneous model roles: the actor (the online policy being trained), the critic (value model estimating expected returns), the reference model (frozen SFT checkpoint for KL divergence), and the reward model (frozen after training, scoring rollout completions). Each role has distinct memory and compute shapes. The actor and critic require full training stacks with optimizer states; the reference and reward models are inference-only and can run on smaller or spot instances.

How much VRAM does a full RLHF pipeline need for a 70B model?

A 70B actor in BF16 with AdamW optimizer states needs roughly 980 GB unsharded (140 GB for weights + 840 GB for optimizer). With 8-way FSDP sharding across B200 nodes (192 GB each), the actor fits on one 8-GPU node at approximately 122 GB per GPU. The critic needs the same. The frozen reference and reward models each need 140 GB, fitting on a single B200. Full pipeline for 70B: at minimum 3 nodes total (actor, critic, reference plus reward).

What is the difference between verl, OpenRLHF, and TRL for RLHF?

verl uses a HybridEngine that swaps between FSDP/Megatron training layout and vLLM inference layout in-place, avoiding a second GPU allocation for rollout. Best for large-scale production runs at 70B+. OpenRLHF uses Ray to run each role (actor, critic, reference, reward) as a separate actor pool, supporting heterogeneous cluster sizing where the reward model runs on smaller GPUs. TRL is HuggingFace-native with lowest setup friction, practical for 7B-30B models; beyond 30B verl or OpenRLHF become better choices.

How long does a reward model + PPO loop take on H100 clusters?

For a 70B base model, reward model training on 8x H100 SXM5 typically takes 18-30 hours depending on dataset size and sequence length, assuming activation checkpointing is enabled and effective throughput is around 1,500-2,500 tokens/sec/GPU. The subsequent PPO loop (32 GPUs: 8 for actor, 16 for rollout workers, 8 for critic plus reference) runs 36-72 hours for a meaningful alignment training run. Total wall-clock time from scratch is 2-4 days. B200 and B300 cut this significantly due to higher memory bandwidth and capacity, allowing larger batch sizes.

Can RLHF training use spot GPU instances?

Yes, with role-based spot eligibility. The rollout workers (vLLM inference for generating candidate responses) hold no training state and are fully spot-eligible. The actor trainer and critic trainer must stay on on-demand nodes since checkpoint state is too expensive to regenerate on preemption. The reference model (frozen, no optimizer state) can also run on spot. A typical split: 16 spot GPUs for rollout, 16 on-demand GPUs for actor and critic (8 per role).

RLHF Training Infrastructure on GPU Cloud: verl, OpenRLHF, and TRL for Production Reward Modeling (2026)

RLHF tooling in 2026 has split into three distinct frameworks with meaningfully different architectures. verl, OpenRLHF, and TRL each solve the same problem - reward model training plus PPO optimization - but make opposite tradeoffs on scale, heterogeneous hardware, and setup complexity. Picking the wrong one for your cluster size wastes days of debugging.

This guide is for teams that have already completed SFT and need to add the full RLHF layer on top. If you're evaluating whether you need PPO at all, GRPO eliminates the critic model and is the right choice for verifiable reasoning tasks, while DPO handles preference alignment without any online generation. PPO with a trained reward model is the right choice when neither of those apply: you need a general-purpose scalar reward signal and can tolerate the infrastructure overhead.

The infrastructure overhead is real. A full PPO run simultaneously manages four model copies: actor, critic, reference, and reward. Each has a different memory footprint, different compute shape, and different checkpoint requirements. Getting this wired up on multi-node GPU clusters requires decisions about framework selection, VRAM allocation, spot vs. on-demand placement, and rollout disaggregation that aren't obvious from framework documentation. This guide covers all of it.

The RLHF Pipeline in 2026

The canonical RLHF pipeline runs in three sequential phases. Phase one trains the SFT base model on a curated instruction dataset. Phase two trains the reward model on human preference data. Phase three runs the PPO loop using that reward model to optimize the policy.

Most guides focus on phase two and three together, but they're distinct workloads with different GPU requirements. Reward model training is standard fine-tuning: one forward pass, one backward pass, gradient update. PPO is fundamentally different: it requires simultaneously running four models, generating rollouts, scoring them, computing advantages, and updating two models.

The four roles in a PPO run:

Stage	Model	GPU Memory Pattern	Can Share Node?
Actor	Online policy (training)	BF16 weights + AdamW optimizer	Yes, but VRAM-constrained
Critic	Value model (training)	BF16 weights + AdamW optimizer	Yes, with actor if enough VRAM
Reference	Frozen SFT checkpoint (inference)	BF16 weights only	Yes, with reward model
Reward	Frozen reward model (inference)	BF16 weights only	Yes, with reference

GRPO eliminates the critic entirely by using group-relative advantage estimates within the rollout batch. That cuts VRAM by 30-40% and removes one unstable optimization. If your task has verifiable rewards, GRPO is almost always a better choice than PPO. PPO makes sense when you need a trained reward model that can score outputs on subjective or complex criteria that can't be expressed as a deterministic function.

Framework Overview

Each framework picks a different point on the scale-vs-flexibility curve. The detailed comparison matrix is later in the post; the next three sections describe each framework's architecture and the workloads it fits.

verl

verl (Volcano Engine Reinforcement Learning) started at ByteDance and now has substantial adoption in the open-source community. Its key architectural insight is the HybridEngine: instead of running separate training and inference processes, verl swaps the weight layout in-place between FSDP training shards and vLLM inference shards on the same GPUs.

This avoids the major alternative: allocating a second set of GPUs just for rollout generation. In a standard disaggregated setup, if your actor needs 8 GPUs for training, you'd allocate another 8 for vLLM rollout inference. verl's HybridEngine reuses the same 8 GPUs for both, alternating between training and generation phases within each PPO iteration. The tradeoff is that training and rollout are serialized rather than pipelined.

The config structure for verl is YAML-based with nested sections for each role:

yaml

# verl PPO config (verl>=0.3)
actor_rollout_ref:
  model:
    path: meta-llama/Llama-3.3-70B-Instruct
  actor:
    optim:
      lr: 1e-6
    ppo_mini_batch_size: 64
    ppo_micro_batch_size_per_gpu: 4
    fsdp_config:
      param_offload: false
      optimizer_offload: false
  rollout:
    name: vllm
    tensor_model_parallel_size: 4
    gpu_memory_utilization: 0.85
    rollout_batch_size: 1024
  ref:
    fsdp_config:
      param_offload: true  # reference is frozen, offload optimizer states (none)

critic:
  model:
    path: meta-llama/Llama-3.3-70B-Instruct
  optim:
    lr: 1e-5
  cliprange_value: 0.5

data:
  train_files: /data/preference/train.parquet
  prompt_key: prompt
  max_prompt_length: 512
  max_response_length: 1024

trainer:
  n_gpus_per_node: 8
  nnodes: 2
  total_epochs: 3
  save_freq: 50

The tensor_model_parallel_size: 4 setting for vLLM rollout splits each rollout group across 4 GPUs, which leaves room for two parallel rollout groups on a single 8-GPU node. TP=8 is the more common production choice for 70B on a single node if you want the full 8 GPUs dedicated to one rollout group and maximum KV cache headroom for long sequences.

At 70B+ scale with Megatron-LM TP+PP, verl becomes the clearest choice. Megatron's tensor parallelism keeps inter-GPU communication within the NVLink domain, and pipeline parallelism handles the depth that FSDP sharding can't address efficiently at very large node counts.

OpenRLHF

OpenRLHF takes a different architectural approach: each role runs as an independent Ray actor pool. The actor has its own pool of GPUs, the critic has its own pool, the reference model has its own pool, and vLLM rollout workers run in a separate pool. Ray handles scheduling and data movement between pools.

The major practical advantage is heterogeneous cluster support. Your reward model only does inference, never training. You can put it on 4x H100 80GB while the actor runs on 2x 8xB200. OpenRLHF handles the placement. verl and TRL require uniform node configurations.

Launch command for a multi-node OpenRLHF run:

bash

python -m openrlhf.cli.train_ppo \
  --actor_num_nodes 2 \
  --actor_num_gpus_per_node 8 \
  --ref_num_nodes 1 \
  --ref_num_gpus_per_node 4 \
  --critic_num_nodes 1 \
  --critic_num_gpus_per_node 8 \
  --reward_pretrain /checkpoints/reward_model_70b \
  --vllm_num_engines 4 \
  --vllm_tensor_parallel_size 4 \
  --pretrain meta-llama/Llama-3.3-70B-Instruct \
  --prompt_data trl-lib/ultrafeedback_binarized \
  --max_samples 100000 \
  --micro_train_batch_size 2 \
  --micro_rollout_batch_size 4 \
  --rollout_batch_size 1024 \
  --save_path /checkpoints/ppo_actor \
  --save_steps 50 \
  --logging_steps 1 \
  --train_batch_size 128 \
  --max_epochs 1 \
  --num_episodes 3 \
  --kl_target 0.02 \
  --init_kl_coef 0.01 \
  --normalize_reward

The --vllm_num_engines 4 argument spins up 4 separate vLLM engine instances as Ray actors. Each handles a shard of the rollout batch. Rollout and training run in parallel across their respective Ray pools, unlike verl's serialized HybridEngine approach.

The ceiling for OpenRLHF is cluster coordination overhead at very large node counts. Ray's scheduling latency between actor pools adds up when you have 16+ nodes with frequent cross-pool data transfers. For research workloads and medium-scale production runs (up to ~8 nodes), OpenRLHF's heterogeneous placement flexibility often outweighs the coordination cost.

TRL

TRL (Transformer Reinforcement Learning) from HuggingFace wraps the PPO algorithm behind the standard Trainer API pattern. If you ran SFT or DPO with TRL, the PPO setup looks familiar: same config pattern, same dataset format, same Accelerate-based multi-GPU launch.

The PPOTrainer is the core class. It manages the actor, reference model, and reward function internally:

python

import torch
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer

config = PPOConfig(
    model_name="meta-llama/Llama-3.3-70B-Instruct",
    learning_rate=1e-6,
    batch_size=64,
    mini_batch_size=16,
    gradient_accumulation_steps=4,
    ppo_epochs=4,
    kl_penalty="kl",
    init_kl_coef=0.2,
    target_kl=6.0,
    adap_kl_ctrl=True,
    horizon=10000,
    gamma=1.0,
    lam=0.95,
    cliprange=0.2,
    cliprange_value=0.2,
    vf_coef=0.1,
)

model = AutoModelForCausalLMWithValueHead.from_pretrained(
    config.model_name,
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)

# reward scoring happens externally in the training loop via a loaded reward model
trainer = PPOTrainer(config, model, ref_model=None, tokenizer=tokenizer)

Note the ref_model=None - TRL can compute reference log-probabilities from a frozen copy of the initial policy rather than a separate model instance, saving VRAM at the cost of a second forward pass per update step.

Multi-node TRL runs via accelerate launch with an FSDP config:

bash

accelerate launch \
  --config_file fsdp_config.yaml \
  --num_machines 2 \
  --num_processes 16 \
  --machine_rank $MACHINE_RANK \
  --main_process_ip $MASTER_ADDR \
  --main_process_port 29500 \
  train_ppo.py

TRL is the right choice for teams already invested in the HuggingFace ecosystem at 7B-30B scale. Beyond 30B on multi-node, the framework overhead and the lack of vLLM-native rollout scheduling become limiting factors. verl or OpenRLHF handle those scales more efficiently.

Framework Comparison Matrix

Dimension	verl	OpenRLHF	TRL
Algorithms	PPO, GRPO, REINFORCE++, DPO	PPO, REINFORCE++, GRPO, RLOO	PPO, DPO, GRPO, KTO, SFT
Max tested scale	70B+ (Megatron)	70B+ (Ray)	~30B (FSDP)
Rollout engine	vLLM (HybridEngine in-process)	vLLM (separate Ray actor pool)	vLLM (external) or HF generate
Training backend	FSDP2 or Megatron-LM	DeepSpeed ZeRO-3	Accelerate (FSDP or DeepSpeed)
Critic model	Required for PPO	Required for PPO	Required for PPO
Heterogeneous cluster	No (uniform node config)	Yes (Ray actor placement)	No
Setup complexity	Medium	High	Low
Best for	Large-scale production RLHF	Research, heterogeneous sizing	SFT+RLHF teams on single stack

GPU Sizing for Each RLHF Stage

The core VRAM formula for each role:

Reward model (BF16 inference): params × 2 bytes
Actor (BF16 + AdamW optimizer): params × 2 bytes (weights) + params × 12 bytes (FP32 master weights + first moment + second moment)
Critic: same as actor
Reference (BF16 frozen): params × 2 bytes

These are unsharded totals. FSDP/ZeRO-3 divides by GPU count. Add 15-20% headroom for activation peaks and rollout buffers.

Model Size	Reward Model VRAM	Actor+Optimizer VRAM	Critic VRAM	Recommended GPU	8-GPU Nodes
7B	14 GB	98 GB	98 GB	H100 80GB	1 (all roles)
13B	26 GB	182 GB	182 GB	H100 80GB	1 (all roles, ZeRO-3)
32B	64 GB	448 GB	448 GB	B200 192GB	1 (actor) + 1 (critic)
70B	140 GB	980 GB	980 GB	B200 192GB	1 (actor) + 1 (critic)

For detailed per-architecture VRAM breakdowns and activation memory math, see the GPU memory requirements guide for LLMs.

For 70B models, B200 (192 GB HBM3e) covers actor weights and optimizer when sharded 8-way at approximately 122 GB per GPU, leaving nearly 70 GB of headroom per card for activations and rollout buffers. B300 provides higher memory bandwidth for throughput-intensive runs. You can rent B200 on Spheron for RLHF runs where memory is the binding constraint, and B300 bare-metal instances for workloads demanding maximum throughput.

Multi-Node Setup

NCCL Tuning for InfiniBand

The actor and critic require gradient synchronization across nodes. Set these environment variables for InfiniBand-connected clusters:

bash

export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_0,mlx5_1   # adjust to your HCA device names
export NCCL_NET_GDR_LEVEL=5
export NCCL_P2P_LEVEL=SYS
export NCCL_SOCKET_IFNAME=bond0     # or your primary network interface
export NCCL_DEBUG=WARN

For the full NCCL configuration guide including topology detection and bandwidth tuning, see NCCL tuning for multi-GPU LLM training. For the FSDP and Megatron backend setup that verl and OpenRLHF use internally, see the distributed LLM training guide.

Rollout Node Colocation

Whether to colocate rollout with the actor trainer or run it on separate nodes depends on your framework:

verl HybridEngine: rollout and training share the same GPUs by design. No separate rollout nodes needed, but training and rollout are serialized within each PPO iteration.
OpenRLHF: vLLM rollout workers run as separate Ray actors. You can assign them to dedicated nodes to avoid TP group interference with the training backward pass.
TRL: typically colocated unless you configure an external vLLM server.

For large models (70B+), running rollout on separate nodes avoids the tensor-parallelism group reconfiguration that happens when verl's HybridEngine switches between training and inference layouts. Separate rollout nodes also let you use spot instances for the generation phase since they hold no optimizer state.

Batch Sizing for Stable PPO

PPO's effective batch size for advantage normalization is:

effective_batch = num_rollout_steps × rollout_batch_size

gradient_accumulation_steps splits those samples into smaller micro-batches for memory-efficient optimizer updates, but does not increase the number of unique trajectories collected. It affects gradient-update memory, not rollout sample diversity.

KL divergence estimates become noisy with small effective batches. Rule of thumb: num_rollout_steps × rollout_batch_size should be at least 512 unique samples for stable advantage normalization. With a rollout batch of 64, you need at least 8 rollout steps to reach that threshold.

Rollout batch	Grad accum	Rollout steps	Unique rollout samples	KL estimate quality
64	4	2	128	Marginal
128	2	4	512	Good
256	1	4	1024	Good
512	1	2	1024	Good

torchrun Multi-Node Launch (verl)

bash

# Node 0 (rank 0, master)
torchrun \
  --nproc_per_node=8 \
  --nnodes=2 \
  --node_rank=0 \
  --master_addr=$MASTER_ADDR \
  --master_port=29500 \
  -m verl.trainer.main_ppo \
  config=config/ppo_70b.yaml \
  trainer.nnodes=2

# Node 1 (rank 1, worker)
torchrun \
  --nproc_per_node=8 \
  --nnodes=2 \
  --node_rank=1 \
  --master_addr=$MASTER_ADDR \
  --master_port=29500 \
  -m verl.trainer.main_ppo \
  config=config/ppo_70b.yaml \
  trainer.nnodes=2

Cost Model: H100 vs B200 vs B300 for a 70B RLHF Run

Scenario: reward model training (24 hours, 8x GPU) followed by PPO loop (48 hours, 8x GPU for actor, 16x GPU for rollout workers, 8x GPU for critic plus reference).

GPU	On-demand $/hr	Spot $/hr	Reward model cost (8x GPU, 24h)	PPO on-demand (32x GPU, 48h)	PPO mixed spot	Total (all on-demand)	Total (mixed spot PPO)
H100 SXM5	$4.21	N/A	$808.32	$6,466.56	N/A	$7,274.88	N/A
B200 SXM6	$7.00	$1.71	$1,344.00	$10,752.00	$6,689.28	$12,096.00	$8,033.28
B300 SXM6	$9.77	$2.45	$1,875.84	$15,006.72	$9,384.96	$16,882.56	$11,260.80

Mixed spot assumes 16 spot GPUs for rollout workers plus 16 on-demand GPUs for actor and critic training (8 per role). Rollout workers hold no optimizer state and resume instantly on preemption. H100 SXM5 has no spot pricing available on Spheron, so mixed spot figures are not applicable for that GPU. The spot discount on B200 ($1.71 vs $7.00) and B300 ($2.45 vs $9.77) is substantial: mixed spot cuts the PPO loop cost by roughly 35-40% on either GPU compared to all on-demand.

For comparison, AWS p4de.24xlarge (8x A100 80GB) runs approximately $40-45/hr for the full node, or $5-6 per GPU-hour. At $4.21/hr, the H100 SXM5 on Spheron is roughly 25-30% cheaper per GPU than AWS p4de. The tradeoff is that H100 has less memory than B200, which forces ZeRO-3 parameter sharding on 70B models where B200 can hold more parameters per card.

H100 GPU rental on Spheron is the most cost-effective entry point for RLHF runs where the model fits with ZeRO-3 sharding. B200 is the right upgrade when memory capacity changes the math enough to avoid aggressive sharding and increase batch sizes.

Pricing fluctuates based on GPU availability. The prices above are based on 08 May 2026 and may have changed. Check current GPU pricing → for live rates.

Reproducible Recipe: 70B Reward Model + PPO on Spheron

This section walks through a complete setup: reward model training followed by a PPO run, using verl on 2x 8xB200 nodes.

Step 1: Provision Nodes

Provision two 8xB200 SXM6 nodes via the Spheron dashboard or API. For the PPO run, use on-demand for the trainer node and spot for the rollout node. Provisioning takes a few minutes. SSH into both nodes and verify NCCL connectivity:

bash

# Check GPU availability
nvidia-smi

# Verify InfiniBand (if available on your node config)
ibstat | grep -E "State|Physical state"

See the Spheron docs for node provisioning and SSH access details.

Step 2: Install Dependencies

bash

pip install "verl>=0.3" "vllm>=0.4.0" transformers datasets trl accelerate
pip install flash-attn --no-build-isolation  # required for long-context runs

Step 3: Prepare Preference Data

The reward model needs (prompt, chosen, rejected) triples. trl-lib/ultrafeedback_binarized is a public ~63K-sample preference dataset covering instruction following, coding, and reasoning tasks:

python

from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized")
# Format: {'prompt': str, 'chosen': [{'role': str, 'content': str}], 'rejected': [...]}

Step 4: Train the Reward Model

Reward model training uses TRL's RewardTrainer - it's a standard fine-tune on preference pairs:

python

import torch
from trl import RewardConfig, RewardTrainer
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-3.3-70B-Instruct",
    num_labels=1,
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")

training_args = RewardConfig(
    output_dir="/checkpoints/reward_model",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    num_train_epochs=2,
    learning_rate=2e-5,
    bf16=True,
    fsdp="full_shard auto_wrap",
    fsdp_transformer_layer_cls_to_wrap="LlamaDecoderLayer",
    save_steps=500,
    logging_steps=50,
    max_length=2048,
)

trainer = RewardTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train_prefs"],
    eval_dataset=dataset["test_prefs"],
    tokenizer=tokenizer,
)
trainer.train()

Launch with torchrun for multi-node using the same FSDP setup described in the Multi-Node Setup section above.

Step 5: Configure verl PPO

yaml

# config/ppo_70b.yaml
actor_rollout_ref:
  model:
    path: meta-llama/Llama-3.3-70B-Instruct
  actor:
    optim:
      lr: 1e-6
      weight_decay: 0.01
    ppo_mini_batch_size: 32
    ppo_micro_batch_size_per_gpu: 2
    clip_ratio: 0.2
  rollout:
    name: vllm
    tensor_model_parallel_size: 4
    gpu_memory_utilization: 0.80
    rollout_batch_size: 512
    temperature: 0.9
    top_p: 0.95
    max_tokens: 1024
  ref:
    fsdp_config:
      param_offload: true

critic:
  model:
    path: /checkpoints/reward_model   # initialize critic from reward model
  optim:
    lr: 1e-5
  cliprange_value: 0.5
  ppo_micro_batch_size_per_gpu: 2

reward_model:
  model:
    path: /checkpoints/reward_model
  micro_batch_size_per_gpu: 4

data:
  train_files: /data/prompts/train.parquet
  prompt_key: prompt
  max_prompt_length: 512
  max_response_length: 1024

trainer:
  n_gpus_per_node: 8
  nnodes: 2
  total_episodes: 10000
  total_epochs: 1
  ppo_epochs: 1
  save_freq: 50
  project_name: rlhf_70b
  experiment_name: ppo_run_01

Initializing the critic from the reward model checkpoint is standard practice - the reward model has already learned representations useful for value estimation.

Step 6: Launch

bash

# Node 0
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
  --master_addr=$MASTER_ADDR --master_port=29500 \
  -m verl.trainer.main_ppo config=config/ppo_70b.yaml

# Node 1
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
  --master_addr=$MASTER_ADDR --master_port=29500 \
  -m verl.trainer.main_ppo config=config/ppo_70b.yaml

Step 7: Monitor

Watch these metrics during training. Log them to W&B or TensorBoard:

reward_mean        # should increase steadily, not spike
kl_div             # should stay below 0.1; rising trend signals policy drift
entropy            # should decrease slowly; collapse means policy is deterministic
val_loss           # critic loss; instability early in training is expected
ppo_clipfrac       # fraction of updates clipped by ratio bounds; >0.3 means LR too high

Common Failures and How to Fix Them

Reward Hacking

Symptom: reward_mean climbs consistently but held-out task evaluation accuracy falls or stays flat.

Cause: The policy has learned to exploit patterns in the reward model that don't generalize to actual task quality. The reward model overfit to surface features during training.

Fix: Add a format reward alongside the learned reward (e.g., penalize very short or very long outputs, reward structured formatting). Increase init_kl_coef to raise the cost of drifting from the reference policy. If the problem persists, retrain the reward model with more diverse preference pairs or add reward model ensembling.

KL Divergence Drift

Symptom: kl_div crosses 0.1 and keeps climbing across updates.

Cause: The policy is moving too far from the reference distribution per update step. Usually caused by a learning rate that's too high or a KL coefficient that's too low.

Fix: Raise init_kl_coef by 50%. Switch from completion-level KL to per-token KL clipping (kl_penalty="full" in TRL). Verify the reference model is actually frozen and not being updated. Reduce ppo_epochs from 4 to 1 if the problem appears early.

Critic Divergence (NaN Loss)

Symptom: val_loss goes NaN in the first 100-200 steps.

Cause: The critic is being updated from random initialization with advantage estimates that are too large. The unbounded gradient signal causes overflow.

Fix: Warm up the critic for one epoch of supervised value regression before starting actor updates. Set this with critic_warmup_steps in verl or a manual warmup loop. Clip advantages to [-5, 5] before computing the value loss (cliprange_value in verl config). Check that advantage normalization is enabled.

OOM During Rollout

Symptom: CUDA OOM during the generation phase, not the training backward pass.

Cause: The rollout batch creates a large KV cache. At long sequence lengths this fills VRAM faster than expected.

Fix: Reduce rollout_batch_size by half. Set vllm_gpu_memory_utilization=0.80 (lower than the default 0.90) to give the KV cache less budget. If rollout is colocated with training (verl HybridEngine), consider disaggregating rollout to a separate node where the KV cache has the full GPU VRAM budget.

Stale Rollouts

Symptom: Training metrics look reasonable but the policy doesn't improve after 500+ steps.

Cause: The policy is updating faster than rollout workers can regenerate experience. Rollout workers are still sampling from an older version of the policy, creating off-policy samples that the PPO update treats as on-policy.

Fix: Increase the number of rollout worker engines (vllm_num_engines in OpenRLHF). Reduce ppo_epochs per rollout batch from 4 to 1 or 2, so the policy doesn't move as far before fresh rollouts are generated. In verl, check rollout_batch_size relative to ppo_mini_batch_size: if the rollout batch is small relative to the number of PPO update steps, staleness accumulates quickly.

Choosing Your Framework

Pick TRL if your team is already on the HuggingFace stack and your model is under 30B. The same Accelerate config, same dataset formats, and same checkpointing patterns as your SFT and DPO runs. The friction reduction is real.

Pick OpenRLHF if you have a heterogeneous cluster - different GPU types across roles, or a reward model that's significantly smaller than your actor. The Ray-based architecture handles mixed hardware cleanly.

Pick verl if you're running 70B+ in production and need Megatron-LM TP+PP to fit the model. HybridEngine's in-place rollout avoids the second GPU allocation and makes per-iteration latency predictable.

The infrastructure investment is substantial: reward model training plus a full PPO run for a 70B model is a 48-72 hour, multi-cluster job. Size your checkpointing accordingly. Save every 50 steps for the actor, keep the last 3 checkpoints, and always checkpoint the reward model before the PPO run starts. Once you have a reward model that generalizes, it's reusable across multiple PPO runs.

RLHF runs are 24-72 hour multi-cluster jobs that don't tolerate cold-start latency or billing surprises. Spheron's bare-metal H100, B200, and B300 nodes provision in minutes and bill per minute - no reserved commitment required for the reward model training phase, and spot rollout workers cut the PPO loop cost by roughly 35-40%.
Start your RLHF run on Spheron →

The RLHF Pipeline in 2026

Framework Overview

verl

OpenRLHF

TRL

Framework Comparison Matrix

GPU Sizing for Each RLHF Stage

Multi-Node Setup

NCCL Tuning for InfiniBand

Rollout Node Colocation

Batch Sizing for Stable PPO

torchrun Multi-Node Launch (verl)

Cost Model: H100 vs B200 vs B300 for a 70B RLHF Run

Reproducible Recipe: 70B Reward Model + PPO on Spheron

Step 1: Provision Nodes

Step 2: Install Dependencies

Step 3: Prepare Preference Data

Step 4: Train the Reward Model

Step 5: Configure verl PPO

Step 6: Launch

Step 7: Monitor

Common Failures and How to Fix Them

Reward Hacking

KL Divergence Drift

Critic Divergence (NaN Loss)

OOM During Rollout

Stale Rollouts

Choosing Your Framework

Build what's next.