PyTorch-Native Agentic RL on GPU Cloud: TorchForge and Monarch for Multi-Turn Agent Training at Scale (2026 Guide)

Most agentic RL post-training today runs on infrastructure built for RLHF: Ray actor pools, separate vLLM servers, HybridEngines that swap weight layouts mid-step. If you've run verl or OpenRLHF for standard PPO, you know the pattern. It works fine for single-completion reward training. It starts to strain when you move to multi-turn agent trajectories with 8-32 tool calls per episode, thousands of parallel rollout environments, and verifiable rewards from code execution sandboxes. For that workload, TorchForge and Monarch offer a different path: the entire RL loop expressed as a plain PyTorch program, scheduled by a single controller across the full cluster. See the RLHF training infrastructure guide for how verl and OpenRLHF handle standard RLHF, and the Agent RFT guide for multi-step trajectory training on top of those frameworks. This post is specifically about TorchForge and Monarch: what the stack is, how it works, and how to run multi-node agentic RL jobs on H100 and H200 clusters.

Why Agentic RL Post-Training Is Different in 2026

The shift started with reasoning models. DeepSeek-R1 showed that long chains of thought trained with RL improved math and code reasoning dramatically. The follow-on question was obvious: what happens when you extend that to actual tool use? Not just internal reasoning steps, but real actions: executing code, querying databases, calling APIs, browsing the web.

Agentic RL post-training operates on that extension. The workload profile is different in three specific ways:

Rollout-heavy. Standard RLHF generates one completion per prompt, scores it, and updates the policy. Agentic RL generates full multi-turn trajectories: 8-32 tool calls, their results, and the policy's responses to those results. Generating enough trajectories to keep the trainer busy requires an order of magnitude more rollout throughput than single-completion PPO. You need a large parallel rollout fleet, not a single colocated vLLM server.

Environment-bound. Rewards come from verifiable signals: did the generated code pass the test suite? Did the agent retrieve the right answer from the database? Verification is I/O heavy and variable in latency. A code execution verifier might take 50ms on a trivial test or 10 seconds on a complex one. The reward computation can't block the GPU; it needs to run asynchronously against the rollout fleet.

Trajectory depth. With a 70B model generating 8-step trajectories at 1024 tokens per step, a single trajectory is 8192 tokens. With 4096 parallel environments, the rollout buffer is 4096 × 8192 × 2 bytes = 64 MB just for trajectory token storage. Memory management at this scale requires careful disaggregation between the rollout fleet and the trainer.

The 2026 batch of long-horizon agent training runs, including WideSeek-R1 and similar multi-agent setups, pushed these constraints into production at scale. The bottleneck is no longer fitting the model on GPUs. It is generating enough verified trajectories fast enough to keep the trainer busy. For a look at how GPU-native environments integrate into this architecture, the RL Environments guide covers the environment side in detail. For embodied AI and VLA-specific RL infrastructure, the RLinf guide covers that specialization.

TorchForge: PyTorch-Native Agentic RL

TorchForge is Meta's agentic RL post-training library. The core design principle is that the entire training loop should be a PyTorch program, not a collection of distributed system primitives glued together with Ray callbacks and RPC calls.

As of mid-2026, Meta has paused active development on TorchForge. LLM training work is being consolidated into TorchTitan. The architecture and mechanics described in this post are accurate, but weigh the development status before committing TorchForge to a production stack.

In a standard verl setup, the HybridEngine swaps between FSDP training layout and vLLM inference layout in-place on the same GPUs. The layout swap is engineered to avoid a second GPU allocation, but it means training and rollout are serialized: the GPUs can't train while they're running vLLM, and they can't run rollout while running FSDP. In OpenRLHF, each role (actor, critic, reference, reward) runs as a separate Ray actor pool. Ray handles scheduling and data movement between pools, but the inter-framework overhead compounds at large node counts.

TorchForge takes a different approach. Under the hood, it uses vLLM for high-throughput rollout inference and TorchTitan for training (FSDP, tensor parallelism, pipeline parallelism). The differentiator is Monarch's single-controller actor model: instead of Ray managing separate actor pools with inter-framework RPC between them, Monarch coordinates both the inference and training engines from one controller process. The in-place FSDP-to-vLLM weight-layout swap that verl's HybridEngine performs is replaced by Monarch's actor messaging layer. The controller is a single Python process that runs the RL loop as a program.

Dimension	TorchForge	verl	OpenRLHF
Scaling ceiling	Tested at 512 GPUs; designed for large clusters	70B+ (Megatron)	70B+ (Ray)
Controller architecture	Single-controller (Monarch)	HybridEngine (per-node)	Ray actor pools
Rollout disaggregation	First-class (RaaS fleet)	In-process (serialized)	Ray actor pool (pipelined)
PyTorch-native	Yes (Monarch single-controller; uses vLLM + TorchTitan)	Partial (FSDP+vLLM swap)	Partial (Ray+vLLM+DeepSpeed)
Heterogeneous hardware	Yes (Monarch device mesh)	No (uniform config)	Yes (Ray actor placement)

The single-controller design has a practical implication for debugging: you can step through the RL loop with a Python debugger. There is no Ray trace to reconstruct, no distributed state machine to reason about. The loop runs as a Python program, and you can observe its state at any breakpoint.

Monarch: Single-Controller Distributed Execution

Monarch is the distributed scheduling layer that makes TorchForge's single-controller model work at scale. It sits between the Python training loop and the physical GPU cluster, handling three things: device mesh assignment, collective communication scheduling, and failure recovery.

Device mesh assignment. Monarch maps logical worker roles (rollout worker, trainer) to physical devices. You specify the number of GPUs per role in the TorchForge config YAML. Monarch handles the rest: which physical devices get which logical role, how the all-reduce groups are formed, and how devices are re-partitioned if you scale the cluster mid-run.

Collective communication scheduling. The RL loop has dynamic computation shapes. Rollout collection generates trajectories of variable length (different tool call chains produce different token counts). Advantage computation aggregates across a variable-size batch. Policy update runs on a fixed batch. torchrun and standard NCCL are designed for programs where every step has the same computation graph: all-reduce over the same set of parameters at the same point in each step. Monarch is designed for dynamic graphs. It dispatches collective operations as Monarch tasks that can vary in size and shape between steps.

This is not a minor implementation detail for RL workloads. Variable trajectory lengths mean variable communication shapes. Forcing that into a fixed-graph NCCL all-reduce either requires padding all trajectories to maximum length (wasting compute) or building a complex bucketing layer on top of NCCL. Monarch handles it natively.

Failure recovery. When a spot rollout worker is preempted, Monarch detects the disconnection, marks in-flight rollout batches as incomplete, redistributes them to the remaining fleet, and continues. The trainer does not need to restart. This failure isolation is what makes spot pricing viable for rollout workers: a preempted node costs one rollout batch, not the entire training run.

For the distributed training fundamentals that underpin Monarch's approach (FSDP, NCCL tuning, multi-node launch patterns), the Distributed LLM Training guide covers those in depth for the pretraining context.

Architecture: Rollout-as-a-Service

The Rollout-as-a-Service (RaaS) pattern is TorchForge's central architectural innovation. It separates the cluster into two distinct fleets with different properties and different pricing implications.

Rollout fleet. Stateless workers that load the current policy checkpoint and generate trajectories against the environment. They run the policy in forward-only mode: no gradient computation, no optimizer state, no backward pass. Stateless means preemption-safe. A preempted rollout worker restarts, loads the latest policy checkpoint from the trainer, and resumes. The rollout fleet is spot-eligible.

Trainer. Holds optimizer state (AdamW first and second moments in FP32, master weights in FP32). Runs the policy update step on collected trajectories. Must checkpoint its state to survive a restart. Must stay on on-demand instances: a preempted trainer loses optimizer state and rolls back to the last checkpoint. Checkpointing every 50-100 steps limits the loss to manageable amounts.

Weight broadcast. After each policy update, the trainer distributes the new model weights to all rollout workers via the Monarch collective communication layer. This broadcast is the primary inter-node communication in the RL loop. Its latency relative to rollout collection time determines trainer utilization.

Reward computation. Can be co-located with rollout workers as a verifier process (deterministic, CPU-bound) or run as a separate CPU pool for learned reward models. For verifiable rewards (code execution, math checking, tool call validation), the verifier runs alongside each rollout worker. For learned reward models, a separate CPU or GPU pool is more efficient.

Environment (verifier / tool sandbox)
        |
        | (trajectories + verifiable rewards)
        v
  [RolloutWorker Fleet] --weight-broadcast-- [Trainer Node]
   (spot GPUs, stateless)                   (on-demand GPUs, holds optimizer)
        |
        v
  [RewardWorker] (optional CPU pool for learned reward)

The decoupling between the rollout fleet and the trainer is what enables spot pricing for the majority of GPU-hours in a run. For a 70B model at 256 GPUs split 8 trainer / 248 rollout, 97% of the GPU-hours are on spot-eligible rollout workers.

GPU and Interconnect Requirements

VRAM Sizing

The VRAM formula differs between trainer and rollout workers because only the trainer holds optimizer state.

Trainer VRAM per GPU (assuming FSDP full sharding across trainer GPUs):

Model weights in BF16: params × 2 bytes / num_trainer_gpus
FP32 master weights: params × 4 bytes / num_trainer_gpus
AdamW first moment: params × 4 bytes / num_trainer_gpus
AdamW second moment: params × 4 bytes / num_trainer_gpus
Total per GPU: params × 14 bytes / num_trainer_gpus

Rollout worker VRAM per GPU (inference-only, weights in BF16):

params × 2 bytes / num_gpus_per_worker

For a 70B model (140 GB in BF16):

Trainer total: 70B × 14 bytes = 980 GB, split across 8x H200 SXM5 (141 GB each) = 122.5 GB per GPU. Fits with headroom.
Rollout worker: 70B × 2 bytes = 140 GB for weights only. A single H200 SXM5 (141 GB) handles one copy of the 70B model in BF16, with 1 GB headroom. In practice you want more: use 2 H200s per rollout worker with tensor parallelism for stable long-trajectory generation.

GPU Tier Selection

Model	Trainer GPU	Rollout GPU	Min inter-node BW	Linear scaling GPUs
7B	2x H100 SXM5	4-8x H100 SXM5 (spot)	100 Gbps	Up to 64 GPUs
32B	6x H100 SXM5	16-32x H100 SXM5 (spot)	200 Gbps	Up to 128 GPUs
70B	8x H200 SXM5	32-64x H100/H200 (spot)	400 Gbps	Up to 256 GPUs

Interconnect and Weight Broadcast Latency

The weight broadcast step is where interconnect speed directly affects trainer utilization. After each policy update, the trainer broadcasts the updated model weights to all rollout workers. The broadcast latency is:

broadcast_time_seconds = model_size_bytes / (interconnect_gbps * 1e9 / 8)

For a 70B model in BF16 (140 GB = 1.40 × 10^11 bytes):

At 100 Gbps: 1.40e11 / (100e9 / 8) = 11.2 seconds
At 200 Gbps: 1.40e11 / (200e9 / 8) = 5.6 seconds
At 400 Gbps: 1.40e11 / (400e9 / 8) = 2.8 seconds

If rollout collection takes 10 seconds per step (typical for 70B with 512 parallel environments), the bandwidth penalty formula is:

bandwidth_penalty = broadcast_time_seconds / (broadcast_time_seconds + rollout_time_seconds)

At 100 Gbps, rollout workers are idle 52.8% of the broadcast+rollout cycle waiting to receive the weight broadcast. At 400 Gbps, that drops to 21.9% of the cycle. For 256+ GPU runs targeting production throughput, 400 Gbps InfiniBand or RoCEv2 RDMA is the practical minimum. For multi-node networking configuration and NCCL tuning on RDMA fabrics, check your cluster provider's documentation.

Step-by-Step: Multi-Node H100/H200 Cluster for TorchForge

Step 1: Install dependencies

On all nodes, start with a clean Ubuntu 22.04 image with CUDA 12.4+. Spheron's default H100 and H200 images include CUDA 12.4.

bash

# Create a conda environment with Python 3.12
conda create -n forge python=3.12
conda activate forge

# Clone the TorchForge repository
git clone https://github.com/meta-pytorch/torchforge
cd torchforge

# Run the install script (pulls PyTorch 2.9+, Monarch, vLLM, and TorchTitan)
./scripts/install.sh

# Verify CUDA and NCCL
python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA devices: {torch.cuda.device_count()}')"
python -c "import torch.distributed as dist; print('NCCL available')"

TorchForge and Monarch are experimental and not available on PyPI. Follow the install instructions in the TorchForge GitHub repository. The Monarch repo is at https://github.com/meta-pytorch/monarch if you need it separately.

Step 2: Write the TorchForge config

yaml

# torchforge_config.yaml

# Algorithm: ppo or grpo
algorithm: grpo

# Model
model:
  path: meta-llama/Llama-3-8B-Instruct  # HuggingFace checkpoint or local path
  dtype: bfloat16

# Rollout fleet (spot-eligible workers)
rollout:
  num_workers: 14            # Number of rollout GPU processes
  batch_size: 512            # Parallel environments per step
  max_length: 4096           # Max total tokens per trajectory
  num_turns: 8               # Max tool call rounds per trajectory
  spot_eligible: true        # Mark rollout workers as preemption-safe
  checkpoint_on_preemption: true

# Trainer (on-demand, holds optimizer state)
trainer:
  num_gpus: 2                # GPUs for the policy update step
  gradient_accumulation_steps: 4
  learning_rate: 1e-6
  checkpoint_interval: 50    # Steps between trainer checkpoints

# Reward function
reward:
  function: your_reward_module.compute_reward  # Path to verifiable reward
  timeout_seconds: 10         # Max verifier wait per trajectory

# Monarch distributed config
monarch:
  master_addr: HEAD_NODE_IP  # Head node IP
  master_port: 29500
  num_nodes: 8
  gpus_per_node: 8
  coord_port: 29501          # Monarch coordination port

Step 3: Verify cluster health before launching

bash

# Check device mesh from head node
monarch-status --addr HEAD_NODE_IP:29501 --num_nodes 8

# Verify NCCL connectivity across nodes (run on each node)
python -c "
import torch
import torch.distributed as dist
dist.init_process_group('nccl')
rank = dist.get_rank()
t = torch.zeros(1).cuda()
dist.all_reduce(t)
print(f'Rank {rank}: NCCL all-reduce OK')
"

# Test inter-node bandwidth (run on head node)
# Expects output around 400+ Gbps on InfiniBand or RoCEv2 clusters
NCCL_DEBUG=INFO python -c "
import torch, torch.distributed as dist
dist.init_process_group('nccl')
t = torch.zeros(int(1e9), dtype=torch.float16).cuda()  # 2GB tensor
dist.all_reduce(t)
print('bandwidth test complete')
"

Step 4: Launch the TorchForge job

bash

# From the head node
python -m torchforge.train \
  --config torchforge_config.yaml \
  --monarch.num_nodes 8 \
  --monarch.gpus_per_node 8

# Monitor TensorBoard in a separate terminal
tensorboard --logdir ./runs --host 0.0.0.0 --port 6006

TorchForge automatically launches the RaaS fleet and the trainer as separate Monarch workers. You should see log output confirming both fleet startup and the Monarch device mesh assignment within 60 seconds of launch.

Step 5: Enable spot rollout workers

In the TorchForge config, rollout.spot_eligible: true marks rollout workers as preemption-safe. On Spheron, provision rollout workers as spot H100 or H200 instances and the trainer as on-demand.

When a spot rollout node is preempted, the sequence is:

Monarch detects the disconnection (within 10-30 seconds, depending on heartbeat config).
In-flight rollout batches on the preempted node are marked incomplete.
Monarch redistributes those batches to the remaining rollout fleet.
The trainer continues from its last checkpoint.
The preempted node restarts (or you provision a replacement), rejoins the Monarch mesh, and picks up new rollout assignments.

Net training loss: one rollout batch. Net GPU cost savings: approximately 20-23% versus an all-on-demand fleet, depending on the trainer-to-rollout ratio and GPU mix.

Common failure modes

NCCL timeout on slow interconnect. If your cluster runs 100 Gbps Ethernet and you're broadcasting 70B weights, the broadcast step takes 11+ seconds. NCCL's default timeout is often shorter. Set NCCL_TIMEOUT=600 (600 seconds) in your environment and increase monarch.heartbeat_interval if Monarch is mistaking slow broadcasts for node failures.

Rollout worker OOM on long trajectories. At max_length=8192 and 512 parallel environments, the KV cache for the rollout fleet can exceed VRAM unexpectedly on complex tool-using trajectories. Reduce rollout.batch_size or decrease max_length if rollout workers OOM. Alternatively, switch to paged attention (enabled by default in newer TorchForge versions) to cap KV cache growth.

Reward function timeout causing stale trajectory batches. If your verifier sometimes exceeds reward.timeout_seconds, TorchForge marks those trajectories with a None reward. The advantage computation then has a sparse reward signal, which can cause training instability. Set a deterministic upper bound on verifier runtime (e.g., subprocess kill after 10 seconds for code execution) rather than relying on the framework timeout.

Cost Optimization: Separating Rollout and Training GPUs

Why the split works

Rollout workers are inference-shaped workloads. They run the policy in forward-only mode against the environment, collect trajectory (prompt, action, observation) tuples, and compute verifiable rewards. They hold no gradient state, no optimizer buffers, and no training checkpoints. A preempted rollout worker loses at most the in-progress rollout batch, which TorchForge's Monarch layer reassigns to the remaining fleet. This makes rollout workers fully preemption-safe, which is exactly the definition of a spot-eligible workload.

The trainer holds optimizer state: FP32 master weights, AdamW first and second moments. That state is expensive to recompute if a node is preempted (you'd roll back to the last checkpoint and re-run all steps since then). Trainers must stay on on-demand instances.

Cost math with live pricing

Using current Spheron pricing (29 Jun 2026):

70B model run, 24 hours:

8 on-demand H200 SXM5 trainer nodes at $3.70/hr each: 8 × $3.70 × 24 = $710.40
32 spot H100 SXM5 GPUs forming 16 rollout workers (TP=2 per worker; two H100s provide 160 GB combined, fitting the 140 GB 70B model) at $2.91/hr per GPU: 32 × $2.91 × 24 = $2,234.88
Total with spot rollout: $2,945.28

Same run, all on-demand (H200 trainers + H100 rollout fleet):

8 on-demand H200 SXM5 trainer nodes at $3.70/hr each: 8 × $3.70 × 24 = $710.40
32 on-demand H100 SXM5 GPUs (16 rollout workers at TP=2) at $3.92/hr per GPU: 32 × $3.92 × 24 = $3,010.56
Total all on-demand: $3,720.96

Savings: $775.68 (20.8%) over 24 hours. Cost is linear in time, so the savings percentage stays roughly constant regardless of run duration. The savings scale with the ratio of rollout workers to trainer GPUs: at a 1:8 trainer-to-rollout split, the savings range is approximately 20-23% depending on the GPU mix.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Spheron's marketplace for agentic RL

Agentic RL has a bursty access pattern. Rollout collection is GPU-intensive and runs in parallel bursts across the fleet. The policy update is a smaller, sequential operation on the trainer. Between policy updates, the rollout fleet is idle. Between rollout collection rounds, the trainer is idle. You want to pay only for active compute, not reserved capacity.

Spheron's marketplace aggregates supply from 5+ providers with per-minute billing. You provision rollout workers when you start a run and release them when you're done, paying only for the minutes they're active. For overnight training runs that checkpoint between day and night and resume in the morning, you release the rollout fleet at end of day and re-provision it in the morning. The trainer stays running (it holds optimizer state), but the rollout fleet costs nothing between sessions.

Putting It Together

TorchForge and Monarch address a real gap in the 2026 RL infrastructure landscape. The verl and OpenRLHF stacks were designed for single-completion RLHF. They work for agentic RL with adaptations, but the Ray actor pool model adds inter-framework overhead that compounds at the rollout scales agentic tasks require. TorchForge's single-controller model makes the full RL loop a plain PyTorch program, and Monarch schedules it across the cluster without a distributed state machine in between.

The practical entry point is small: an 8-GPU H100 cluster with 2 trainer GPUs and 6 rollout workers handles a 7B agentic RL run. The spot/on-demand split is meaningful from the first run. At 256 GPUs, the architecture scales without code changes, assuming 400 Gbps inter-node bandwidth.

For teams already running verl or OpenRLHF for RLHF, TorchForge is worth evaluating for any post-training workload where rollout collection (not the optimizer step) is the bottleneck.

Agentic RL at scale means a bursty rollout fleet that spins up for trajectory collection and back down during policy updates. Spheron's marketplace provisions multi-node H100 and H200 clusters on per-minute billing, so you only pay for active rollout time, not reserved capacity sitting idle.
H100 cluster pricing on Spheron → | H200 instances → | View all GPU pricing →

STEPS / 05

Quick Setup Guide

Install TorchForge and Monarch dependencies
Create a conda environment: conda create -n forge python=3.12 && conda activate forge. Clone the TorchForge repository from https://github.com/meta-pytorch/torchforge and run its install script: ./scripts/install.sh (this pulls PyTorch 2.9+, Monarch, vLLM, and TorchTitan). Verify CUDA visibility with python -c 'import torch; print(torch.cuda.device_count())'. Confirm NCCL is available with python -c 'import torch.distributed as dist; print(torch.__version__)'. On Spheron H100 and H200 instances, CUDA 12.4+ ships in the default Ubuntu 22.04 image.
Configure the Monarch device mesh for your cluster
Define the device mesh in your TorchForge config YAML. For an 8-node, 64-GPU cluster: set trainer_gpus: 8 for the policy optimizer and rollout_gpus: 56 for the rollout fleet. Monarch maps these to physical device addresses automatically. For multi-node setups, specify the head node IP under monarch.master_addr and confirm all nodes can reach it on port 29500 and the Monarch coordination port (default 29501). Use monarch-status to verify the device mesh before launching the training job.
Launch a TorchForge RL job with disaggregated rollout
Create a TorchForge config file specifying: algorithm (ppo or grpo), model.path (HuggingFace checkpoint or local path), rollout.num_workers, rollout.batch_size, trainer.gradient_accumulation_steps, and reward.function (path to your verifiable reward module). Run: python -m torchforge.train --config your_config.yaml --monarch.num_nodes 8 --monarch.gpus_per_node 8. TorchForge launches the Rollout-as-a-Service (RaaS) fleet and the trainer as separate Monarch workers automatically. Monitor training with the built-in TensorBoard integration: tensorboard --logdir ./runs.
Enable spot rollout workers for cost reduction
In the TorchForge config, set rollout.spot_eligible: true and rollout.checkpoint_on_preemption: true. On Spheron, provision rollout workers as spot H100 or H200 instances and the trainer as on-demand instances. TorchForge's Monarch layer handles rollout worker failure transparently: when a spot node is preempted, Monarch detects the disconnection, marks in-flight rollout batches as incomplete, and reassigns them to the remaining rollout fleet. The trainer continues from its last checkpoint without interruption. Net impact: one rollout batch (typically 50-200 steps) is discarded on preemption; no training progress is lost.
Scale from 8 to 256 GPUs with linear-scaling tuning
TorchForge and Monarch achieve near-linear scaling when inter-node bandwidth is at least 400 Gbps. To scale: increase monarch.num_nodes in the config. Monarch re-partitions the rollout fleet across the new node count without code changes. Monitor scaling efficiency with torchforge.bench throughput - target GPU utilization above 80% on rollout workers and above 90% on trainer nodes. If GPU utilization drops below these targets at high node counts, check NCCL communication timings (enable NCCL_DEBUG=INFO) and verify your cluster's InfiniBand or RoCEv2 inter-node bandwidth is at least 400 Gbps per node (8x 50 Gbps ports or 2x 200 Gbps ports).

FAQ / 05

Frequently Asked Questions

TorchForge is Meta's PyTorch-native agentic RL post-training library, designed for single-controller distributed training from small clusters up to large multi-node deployments. Unlike verl (which swaps between FSDP and vLLM layout via HybridEngine) and OpenRLHF (which uses Ray to manage role-separated actor pools), TorchForge uses vLLM for high-throughput rollout inference and TorchTitan for training (FSDP/TP/PP), all coordinated by Monarch's single-controller actor model. The key differentiator is that Monarch removes Ray-actor orchestration and the in-place FSDP-to-vLLM weight-layout swap between separate driver processes, making the training loop a single Python process that is easier to trace, debug, and extend. Monarch is the distributed execution layer that schedules TorchForge's operations across the GPU cluster.

Monarch is Meta's single-controller distributed execution framework for PyTorch. In the context of agentic RL, Monarch acts as the scheduling layer that maps TorchForge's training program onto a distributed GPU cluster. It handles device mesh assignment, collective communication scheduling, and failure recovery without requiring the user to write MPI or Ray actor code. Monarch's single-controller design means all GPU operations are orchestrated from one process, which makes it possible to express complex RL dataflows (rollout generation, reward computation, policy update) as simple Python loops without callbacks, queues, or distributed state machines.

For a 7B model with a moderate rollout fleet (512 parallel environments), 8-16 H100 SXM5 or H200 SXM5 GPUs covering both rollout workers and the trainer is a practical starting point. For 32B-70B models with large rollout fleets (4k-8k parallel trajectories), 64-256 GPUs split between the rollout fleet (spot-eligible, stateless) and the trainer (on-demand, holds optimizer state) is the production configuration. TorchForge and Monarch are experimental; the publicly documented run used a 512-H100 cluster on CoreWeave, and Monarch is designed to scale to thousands of GPUs. The agentic RL code path runs on the same infrastructure with a smaller practical footprint for most post-training runs.

The bottleneck in agentic RL at 256+ GPUs is the weight broadcast step: after each policy update, the trainer must distribute the new model weights to all rollout workers before the next round of rollout collection begins. At 256 GPUs across 32 nodes (8 GPUs each), broadcasting a 70B model in BF16 (140 GB) over 100 Gbps Ethernet introduces approximately 11-12 seconds of inter-step latency at saturated bandwidth. At 400 Gbps InfiniBand or RoCEv2 RDMA, the same broadcast completes in under 3 seconds, keeping GPU utilization above 80% during rollout. Below 400 Gbps, the weight broadcast becomes the dominant latency in the training loop, not the rollout generation or the optimizer step.

The key split is between rollout workers and the trainer. Rollout workers collect trajectories - they run the policy in forward-only mode against the environment, hold no optimizer state, and are fully preemption-safe if you checkpoint the trainer every 50-100 steps. This makes rollout workers spot-eligible. The trainer holds optimizer state and must stay on on-demand instances. A typical cost-optimal split for a 32B model: 32 spot H100 GPUs for rollout (with automatic restart on preemption), 8 on-demand H100 GPUs for the trainer. Spot rollout GPUs on Spheron run at $2.91/hr per H100 SXM5; on-demand trainer nodes at $3.92/hr per H100 SXM5. Based on those rates, this split reduces total fleet cost by approximately 20-23% compared to running the entire job on on-demand capacity.

Why Agentic RL Post-Training Is Different in 2026

TorchForge: PyTorch-Native Agentic RL

Monarch: Single-Controller Distributed Execution

Architecture: Rollout-as-a-Service

GPU and Interconnect Requirements

VRAM Sizing

GPU Tier Selection

Interconnect and Weight Broadcast Latency

Step-by-Step: Multi-Node H100/H200 Cluster for TorchForge

Step 1: Install dependencies

Step 2: Write the TorchForge config

Step 3: Verify cluster health before launching

Step 4: Launch the TorchForge job

Step 5: Enable spot rollout workers

Common failure modes

Cost Optimization: Separating Rollout and Training GPUs

Why the split works

Cost math with live pricing

Spheron's marketplace for agentic RL

Putting It Together

Quick Setup Guide

Install TorchForge and Monarch dependencies

Configure the Monarch device mesh for your cluster

Launch a TorchForge RL job with disaggregated rollout

Enable spot rollout workers for cost reduction

Scale from 8 to 256 GPUs with linear-scaling tuning

Frequently Asked Questions

01What is TorchForge and how does it differ from verl or OpenRLHF?

02What is Monarch in PyTorch and what does it do for distributed RL training?

03How many GPUs do I need for TorchForge agentic RL training?

04Why does agentic RL require 400 Gbps+ inter-node bandwidth at scale?

05How can I reduce cost on a TorchForge multi-node RL run?

Build what's next.