Engineering

Deploy RLinf on GPU Cloud: Scalable RL Infrastructure for Embodied and Agentic AI (2026 Guide)

RLinfRLinf Reinforcement LearningEmbodied AI RL TrainingVLA Reinforcement LearningAgentic RL InfrastructureDistributed RL TrainingPPO Multi-Node GPUVision Language Action Model RLGPU Cloud RLRLinf GPU Cloud
Deploy RLinf on GPU Cloud: Scalable RL Infrastructure for Embodied and Agentic AI (2026 Guide)

RLinf is a distributed reinforcement learning infrastructure framework from Tsinghua University, released as v0.2 with a validation run on a 256-GPU H100 cluster. It targets embodied AI and agentic systems, the workloads that RLHF frameworks like verl and OpenRLHF were never designed for. If you're training VLA policies, running online RL with physical robots, or orchestrating multi-agent RL at scale, RLinf is the framework to know in 2026.

What RLinf Is and Where It Fits

RLinf occupies a different part of the RL infrastructure map from verl, OpenRLHF, and TRL. Those three frameworks solve language model alignment: they run reward model training and PPO (or GRPO) loops to adjust LLM behavior based on human preference signals. The input is text, the policy is a transformer language model, and the reward is a scalar from a trained model or a verifier. RLinf targets a different problem: general RL policy optimization across heterogeneous hardware, with native support for the physical observation spaces, continuous action spaces, and real-time latency budgets that embodied AI requires. Using verl for a robotics VLA fine-tuning run is like using a fine-tuning framework for physics simulation, it can sort of work but you're fighting the abstractions every step.

RL environment frameworks (Isaac Gym, Brax, Gymnasium) are also not the same category. Those frameworks run simulation and produce (observation, reward) pairs. RLinf sits above them: it consumes those environment outputs and orchestrates the distributed policy update across worker pools. Genesis for simulation plus RLinf for distributed policy training is a complete robotics RL stack. For a detailed look at how GPU-native environments slot into this architecture, RL environments on GPU cloud covers the environment side of that same stack.

FrameworkPrimary use caseAlgorithm supportHeterogeneous hardwareVLA supportMulti-agent
RLinfEmbodied AI, agentic RLPPO, GRPO, SACYes (GPU/NPU/CPU pool)Yes (native)Yes (WideSeek-R1)
verlLLM alignment at scalePPO, GRPOLimitedNoNo
OpenRLHFLLM alignment, heterogeneousPPO, REINFORCEYes (role-level)NoNo
RLlibGeneral-purpose distributed RLPPO, SAC, IMPALA, etc.PartialNoPartial

For the RLHF side of this map, the RLHF training infrastructure guide covers verl, OpenRLHF, and TRL in detail. This post focuses entirely on RLinf.

RLinf Architecture: Macro-to-Micro Flow Transformation

The central design idea in RLinf is the macro-to-micro flow transformation. The macro flow is the high-level algorithm description: collect rollouts, compute rewards, update policy. This is the view you see in any RL textbook. The micro flow is the actual execution graph: which device handles which computation, how batches route between workers, what communication pattern moves gradients from rollout workers to the trainer.

RLinf lets you write at the macro level and handles the micro translation. You define your algorithm as a sequence of macro operations. RLinf compiles that into a device-level execution plan across whatever hardware pool you've provisioned. The Worker abstraction is how this mapping happens.

Macro flow (researcher-defined):
  collect_rollouts() -> compute_rewards() -> update_policy()

Micro flow (RLinf-generated):
  RolloutWorker [4x H100] -> RewardWorker [CPU] -> TrainerWorker [4x H100]

Worker types in RLinf:

  • RolloutWorker: generates rollouts by running the policy against the environment. Stateless. Can run on GPU, NPU, or CPU. Spot-eligible.
  • TrainerWorker: holds the optimizer state and runs the policy update. Must stay on on-demand instances.
  • RewardWorker: computes reward signals. Can be a verifier (deterministic, CPU), a learned reward model (GPU), or an environment-native signal.
  • ReferenceWorker: frozen checkpoint for KL divergence penalty. Inference-only, can run on spot.

Heterogeneous resource pooling is what sets RLinf apart from RLlib's more rigid node assignment. You can mix GPU models (H100 and H200 on different workers), put reward computation on CPU workers to save GPU budget, and resize individual worker pools without rewriting algorithm code. RLinf handles device placement, batch routing, and the communication schedule between worker types from the single config YAML.

Environment (Genesis/Isaac Lab)
        |
        | (observations, rewards)
        v
  [RolloutWorker Pool] --> batch trajectories --> [RewardWorker]
        |                                               |
        |<------------ reward signals -----------------|
        |
        v
  [TrainerWorker] <--- weight broadcast at each update step
        |
        v
  [ReferenceWorker] (KL penalty computation)

Supported Algorithms and Modes

Single-Agent RL

PPO is the default algorithm and the most thoroughly tested. RLinf's PPO implementation handles both LLM policies (discrete token actions) and VLA policies (continuous or hybrid action spaces). The architecture separates the rollout collection from the policy update, which lets you run more rollout workers than trainer processes and avoid the GPU idle time that colocated rollout creates.

GRPO drops the critic model and computes group-relative advantage estimates from rollout batches directly. For verifiable reward tasks (code execution, math verification, manipulation success signals), GRPO cuts VRAM requirements by 30-40% and removes critic instability. RLinf's GRPO implementation plugs directly into the same Worker abstraction as PPO, so switching algorithms is a config change.

SAC (Soft Actor-Critic) is RLinf's primary algorithm for continuous-action embodied AI: manipulation, locomotion, real-robot online RL. SAC's entropy regularization makes it more stable than PPO for high-dimensional continuous action spaces, which is why it appears in robotics applications far more often than PPO does.

AlgorithmCritic neededReward typeBest use case
PPOYesScalar (learned or verifiable)LLM policies, discrete-action tasks
GRPONoVerifiable (deterministic)Reasoning, code, agentic tasks
SACYes (separate Q-network)Dense or sparseContinuous control, manipulation, locomotion

Multi-Agent RL

RLinf's multi-agent mode supports the WideSeek-R1 training setup: a hierarchical system with a lead agent and subagents that share a single LLM but maintain isolated contexts per agent. This is the "width scaling" approach from the WideSeek-R1 paper (arXiv:2602.04634), designed for broad information-seeking tasks where coordinated parallel agents with isolated contexts outperform a single agent holding the full context. The system is trained end-to-end via multi-agent RL.

The WideSeek-R1 config runs N rollout workers that all load the same LLM weights but each operate with an isolated context window. The lead agent coordinates subagent calls, and the combined multi-agent RL loss across the full hierarchical interaction trace trains the shared model to orchestrate more effectively.

Embodied AI: VLA Training and Online Robot RL

Embodied AI is the primary motivation for RLinf's design. Language models generate text with millisecond tolerance for latency spikes. Robots generate actions against a control frequency deadline (10-50 Hz for manipulation, 100-500 Hz for locomotion). The inference latency budget is different by an order of magnitude, and the observation space is completely different: RGB images, proprioceptive joint states, force-torque readings, depth maps.

Training VLA models in RLinf uses a split-worker architecture. The vision encoder and language backbone run on the trainer node. The action head runs on rollout workers. This disaggregation lets you run more rollout workers per trainer node than a colocated setup allows, which is important because VLA rollout generation (running the model against a robot simulation or real hardware) is slower than pure language model sampling.

For teams working with OpenVLA on GPU cloud, RLinf provides the multi-node distributed training layer on top of OpenVLA's single-GPU fine-tuning scripts. If you have a custom robot embodiment and want to scale your OpenVLA fine-tuning run past what a single H100 can handle, RLinf's VLA training mode is how you do it.

For NVIDIA GR00T N1.5/N1.6 at larger scale, the same pattern applies. GR00T N1.5/N1.6's LoRA fine-tuning scripts cover single-node and small multi-GPU setups. For runs across 16+ GPUs with disaggregated rollout and live robot feedback, RLinf's distributed RL infrastructure handles the coordination layer that the GR00T N1 family's own tooling doesn't address.

Fine-tuning with LoRA is fully supported. RLinf's trainer integrates LoRA adapters for both the vision encoder and language backbone, and the action head is fine-tuned at full precision by default (the action head is small enough that LoRA overhead isn't needed). A full LoRA config for a 7B VLA:

yaml
model:
  base_path: /data/models/openvla-7b
  lora:
    enabled: true
    rank: 32
    alpha: 64
    target_modules: ["q_proj", "v_proj"]
task: embodied_vla
env:
  sim: genesis
  scene: franka_pick_place.yaml

Online RL with real robots is where RLinf distinguishes itself from sim-only frameworks. The env.robot: franka config connects RLinf's rollout workers directly to the Franka Emika robot API. The rollout worker sends an action, waits for the robot's proprioceptive state, and produces the next trajectory step. The latency budget is tight: 100 ms per step at 10 Hz control. RLinf's rollout worker is designed to meet this deadline on H100 hardware.

For simulation-based RL, pairing Genesis physics engine on GPU cloud with RLinf gives you 10-80x sim throughput compared to CPU-based simulators, plus RLinf's distributed policy updates across the simulation worker pool. Genesis handles the physics; RLinf handles the policy.

Multi-Node Setup on H100 GPU Cloud (8 to 256 GPUs)

Installation

bash
# Clone RLinf
git clone https://github.com/RLinf/RLinf.git
cd RLinf
pip install -e '.[all]'

# Install Ray for distributed runtime
pip install 'ray[default]'

# Verify CUDA
python -c 'import torch; print(torch.cuda.device_count(), torch.version.cuda)'

8-GPU Single-Node Setup

All worker types run on one machine. This is the starting point for algorithm development and small-scale VLA fine-tuning.

yaml
# config/ppo_vla_8gpu.yaml
algorithm: ppo
ray_cluster:
  num_nodes: 1
rollout_worker:
  num_workers: 4
  num_gpus: 1
  device: cuda
trainer:
  num_gpus: 4
  gradient_accumulation_steps: 4
reward_worker:
  device: cpu
  num_workers: 4
checkpoint_interval: 100

Launch:

bash
# Start local Ray (single node)
ray start --head --port=6379 --num-gpus=8

# Run training
python -m rlinf.train config/ppo_vla_8gpu.yaml

32-GPU 4-Node Setup

Rollout workers disaggregated from the trainer across nodes. The trainer holds all optimizer state on node 0; nodes 1-3 run rollout workers.

yaml
# config/ppo_vla_32gpu.yaml
algorithm: ppo
ray_cluster:
  num_nodes: 4
  head_node_gpus: 8
rollout_worker:
  num_workers: 24    # 3 nodes * 8 GPUs
  num_gpus: 1
  device: cuda
trainer:
  num_gpus: 8        # head node only
  gradient_accumulation_steps: 8
reward_worker:
  device: cpu
  num_workers: 16

Start the Ray cluster first:

bash
# On head node (trainer)
ray start --head --port=6379 --num-gpus=8

# On each of the 3 rollout nodes
ray start --address='HEAD_IP:6379' --num-gpus=8

# Verify cluster
ray status
# Should show 4 nodes, 32 GPUs total

# Launch
python -m rlinf.train config/ppo_vla_32gpu.yaml

256-GPU 32-Node Setup

Full validated scale from the RLinf paper. One head node (trainer) and 31 rollout worker nodes, all connected via high-speed interconnect.

yaml
# config/ppo_vla_256gpu.yaml
algorithm: ppo
ray_cluster:
  num_nodes: 32
  head_node_gpus: 8
rollout_worker:
  num_workers: 248   # 31 nodes * 8 GPUs
  num_gpus: 1
  device: cuda
trainer:
  num_gpus: 8
  gradient_accumulation_steps: 32
  tensor_parallel_size: 4     # for 70B+ policies
reward_worker:
  device: cpu
  num_workers: 64
checkpoint_interval: 50       # more frequent at this scale

At 256 GPUs, the weight broadcast step (trainer pushing updated weights to all rollout workers after each update) is the primary inter-node communication bottleneck. With 400 Gbps RDMA (InfiniBand or RoCEv2), this broadcast completes in under 2 seconds for a 7B policy. The RLinf paper's validated 256-GPU run used 400 Gbps RoCEv2 with 8x Mellanox ConnectX-7 NICs per node, demonstrating that RoCEv2 meets the bandwidth requirement at full scale. With 100 Gbps Ethernet, the same broadcast takes 8-12 seconds and becomes a significant fraction of per-update wall-clock time.

Interconnect, Node Count, and GPU Right-Sizing

ConfigGPUsInterconnect minimumPrimary use caseExpected scaling efficiency
Single-node8x H100NVLink (intra-node)7B VLA fine-tuning, algorithm devN/A (single node)
4-node32x H100100 Gbps RoCEMulti-agent LLM RL, moderate scale80-85% at 32 GPUs
8-node64x H100400 Gbps RDMA (InfiniBand or RoCEv2)WideSeek-R1 scale, production VLA85-90% at 64 GPUs
32-node256x H100400 Gbps RDMA (InfiniBand or RoCEv2)Full RLinf validated scale (paper used RoCEv2)88-92% at 256 GPUs

For rollout workers specifically: they communicate with the trainer over the inter-node fabric for weight broadcasts and trajectory uploads. The RolloutWorker has no gradient communication with other rollout workers, so you don't need all-reduce bandwidth between rollout nodes. The 400 Gbps RDMA requirement is about trainer-to-all-rollout bandwidth, not all-to-all. A star topology with the trainer node as the hub is sufficient.

The trainer node itself benefits most from high intra-node bandwidth (NVLink for FSDP weight sharding). For 70B+ policies running tensor parallelism on the trainer, NVLink between the trainer GPUs is mandatory. For 7B-13B policies without TP, PCIe within the trainer node is acceptable.

For Spheron's cluster instance types and their interconnect specifications (InfiniBand vs Ethernet, bandwidth per node, and which workload profile each suits), see the Spheron instance types docs.

Running RLinf on Spheron: Cost and Pricing

H100 SXM5 pricing on Spheron as of 11 Jun 2026:

ConfigGPUsOn-Demand $/hr (total)Spot $/hr (total)Use case
Single-node8x H100 SXM5$40.08$11.447B VLA fine-tuning, algo dev
4-node32x H100 SXM5$160.32$45.76Multi-agent RL, WideSeek-R1
8-node64x H100 SXM5$320.64$91.52Production VLA training
32-node256x H100 SXM5$1,282.56$366.08Full validated scale

Spot and on-demand totals are calculated using $5.01/GPU/hr on-demand and $1.43/GPU/hr spot (lowest rates from the Spheron API).

Spot vs on-demand allocation strategy for RLinf:

RLinf's worker disaggregation maps directly to a spot/on-demand cost split. Rollout workers are stateless: they load policy weights from the trainer at each update cycle and have no optimizer state. A preempted rollout worker restarts, pulls the latest checkpoint from the trainer, and resumes. You lose one update cycle of rollout throughput, not any training progress. Run rollout workers on spot.

The trainer worker holds optimizer state. A preempted trainer means restarting from the last checkpoint. Run the trainer on on-demand. The reference worker (frozen model for KL divergence) is inference-only and spot-eligible.

A typical 32-GPU run might allocate 8 on-demand GPUs (trainer) and 24 spot GPUs (rollout workers). At Spheron's rates, that's 8 × $5.01 + 24 × $1.43 = $74.40/hr, compared to $160.32/hr all on-demand. A ~54% cost reduction for the same compute.

Experiment duration estimates:

  • 7B VLA fine-tuning on 8x H100 (single-node): 12-24 hours for a LoRA run, ~$481-$962 on-demand (single-node runs on one instance, so all GPUs share the same pricing tier; spot/on-demand mixing requires a multi-node setup)
  • Multi-agent RL on 32x H100 (4-node): 48-72 hours typical, ~$7,695-$11,543 full on-demand or ~$3,571-$5,357 with spot rollout workers
  • Full 256-GPU run (32-node): depends on dataset and convergence, ~$20,521-$30,781 per 16-24 hours on-demand, or ~$6,316-$9,473 with spot rollout workers

For long runs, Spheron's reserved multi-node blocks reduce the effective per-hour rate further. Check current GPU pricing for live rates before budgeting a multi-week training campaign.

Pricing fluctuates based on GPU availability. The prices above are based on 11 Jun 2026 and may have changed. Check current GPU pricing → for live rates.


Common Issues and Fixes

Ray cluster connection refused. The head node IP must be reachable from all worker nodes on ports 6379 (Ray runtime) and 8265 (dashboard). Check firewall rules on each cloud instance. On Spheron, use the instance's private IP within the same VPC for head node address, not the public IP.

Weight broadcast timeout at 64+ nodes. The default Ray object store timeout is 600 seconds. Increase with ray.init(object_store_memory=100_000_000_000) and set RAY_object_store_allow_slow_storage=1. For weight tensors larger than 50 GB (70B+ policies), increase the Ray plasma store size on each node proportionally.

Rollout throughput collapse on CPU reward workers. CPU reward workers saturate on verifier-heavy tasks (code execution, complex math). Increase reward_worker.num_workers to add more CPU processes. If verifier latency is the bottleneck (each verification call takes 500ms+), increase rollout batch size to amortize the verifier calls across more completions per update cycle.

NCCL timeout during gradient sync. For multi-node runs with more than 8 nodes, set NCCL_TIMEOUT=600 and NCCL_ASYNC_ERROR_HANDLING=1. If timeouts persist, check that all nodes are on the same RDMA fabric (InfiniBand or RoCEv2) and that the NCCL topology file (if using a custom topology) matches the actual cluster interconnect.

VLA rollout latency exceeds 100ms budget. For real-robot online RL, profile where time goes: model inference (usually 60-80ms on H100), action tokenization/de-tokenization (2-5ms), robot API round-trip (network-dependent). If model inference is the bottleneck, enable Flash Attention 3 and reduce action token sequence length. If the robot API round-trip is the bottleneck, co-locate the rollout worker with the robot controller network.


RLinf at 256 GPUs validates across 32 H100 nodes, a scale that most research clusters can't provision quickly. Spheron provisions multi-node H100 clusters in minutes, with spot rollout workers cutting the per-experiment cost by roughly 50% or more at multi-node scale.

H100 SXM5 on Spheron → | View all pricing →

STEPS / 06

Quick Setup Guide

  1. Install RLinf and verify CUDA environment

    Clone the RLinf repository from the official Tsinghua repo. Install dependencies with `pip install -e '.[all]'`. Verify CUDA visibility with `python -c 'import torch; print(torch.cuda.device_count())'`. For multi-node setups, confirm that NCCL and optionally NVSHMEM are available. RLinf uses Ray as its default distributed runtime, so install Ray with `pip install 'ray[default]'`.

  2. Configure the Worker abstraction for your algorithm

    Define your worker roles in the RLinf config YAML. Each role maps to a resource type (GPU, NPU, or CPU) and a process count. For a PPO run on 8 H100s: set `rollout_worker.num_gpus=4` and `trainer.num_gpus=4`. For heterogeneous runs, set `rollout_worker.device=cuda` and `reward_worker.device=cpu`. RLinf dispatches tasks to the correct device pool without algorithm-side changes.

  3. Set up a multi-node Ray cluster on GPU cloud

    On the head node: `ray start --head --port=6379 --num-gpus=8`. On each worker node: `ray start --address='HEAD_IP:6379' --num-gpus=8`. Verify cluster health with `ray status`. For 256-GPU runs (32 nodes), use a head node with 8 GPUs and 31 worker nodes. Ensure the head node IP is reachable from all workers on port 6379 and 8265 (Ray dashboard).

  4. Launch a PPO training job with RLinf

    Run `python -m rlinf.train config/ppo_llm.yaml`. Key config fields: `algorithm: ppo`, `rollout_worker.num_envs_per_worker: 512`, `trainer.gradient_accumulation_steps: 4`, `checkpoint_interval: 100`. For GRPO, switch to `algorithm: grpo` and set `reward_fn: your_verifier_module`. For SAC on continuous-control embodied AI, use `algorithm: sac` and connect a GPU-native environment (Isaac Lab or Genesis).

  5. Fine-tune a VLA model with RLinf embodied AI mode

    Set `task: embodied_vla` in the config. Provide the base VLA checkpoint path under `model.base_path`. RLinf loads the vision encoder and language backbone on the trainer node and the action head on rollout workers. For real-robot online RL (Franka), set `env.robot: franka` and configure the robot API endpoint. For simulation-only fine-tuning (e.g. GR00T N1.5/N1.6 on Genesis), set `env.sim: genesis` and point to your Genesis scene config.

  6. Scale from 8 to 256 GPUs without config rewrites

    RLinf's macro-to-micro flow transformation handles scaling without algorithm-level changes. To scale: increase `rollout_worker.num_workers` and `ray_cluster.num_nodes` in the config. The framework re-partitions the worker pool automatically. Monitor scaling efficiency with `rlinf.bench throughput` - linear scaling to 256 GPUs requires at least 400 Gbps inter-node bandwidth. On Spheron, verify your cluster's interconnect spec before booking reserved multi-node capacity.

FAQ / 05

Frequently Asked Questions

RLinf (Reinforcement Learning Infrastructure) is a distributed RL training framework from Tsinghua University focused on embodied AI and agentic systems. Unlike verl, OpenRLHF, and TRL, which target language model alignment (RLHF), RLinf is designed for RL algorithms applied to policy optimization across heterogeneous hardware pools. It supports PPO, GRPO, and SAC natively and disaggregates rollout workers from trainer processes, making it well suited for robotic simulation environments and multi-agent training runs. verl and OpenRLHF are the better fit if your goal is reward-modeling LLM behavior; RLinf is the better fit if your goal is training or fine-tuning a VLA or embodied AI policy.

Yes. RLinf includes a dedicated embodied AI module that supports training and fine-tuning Vision-Language-Action (VLA) models. It ships with adapters for OpenVLA-style observation encoders and supports the Franka Emika and Turtle2 robot APIs for real-world online RL. For fine-tuning GR00T N1.5/N1.6 or OpenVLA on custom robot data, you mount the base VLA weights and run RLinf's VLA fine-tuning pipeline, which handles observation tokenization and action-head gradient updates within the distributed rollout framework.

For single-agent RL on a 7B VLA (e.g. OpenVLA-7B): 4-8 H100 SXM5 GPUs cover both the rollout workers and trainer. For multi-agent RL like WideSeek-R1 with dozens of parallel policies: 32-64 H100s is a practical starting point. RLinf has been validated at 256 H100 GPUs (32 nodes of 8 GPUs each) for its largest published experiments. Rollout workers are stateless and preemption-safe, so you can run them on spot instances and restrict on-demand provisioning to the trainer node.

RLinf's macro-to-micro flow transformation disaggregates computation across node boundaries, so inter-node bandwidth directly affects rollout throughput. For the trainer-to-rollout-worker weight broadcast step, 400 Gbps RDMA (InfiniBand or RoCEv2) achieves near-linear scaling from 8 to 256 GPUs. RLinf's published 256-GPU validation cluster used 400 Gbps RoCEv2 (8x Mellanox ConnectX-7 NICs per node), not InfiniBand. The real requirement is 400 Gbps RDMA bandwidth; both InfiniBand and RoCEv2 satisfy it. 100 Gbps Ethernet works for runs up to 32 GPUs but introduces a 15-30% scaling penalty at 64+ GPUs. Spheron's multi-node H100 and H200 clusters include high-speed interconnect to support the full RLinf scale range.

RLinf's macro-to-micro flow transformation is the core scheduling primitive that converts a high-level RL algorithm description (macro flow: collect rollouts, compute rewards, update policy) into a distributed execution graph over heterogeneous workers (micro flow: assign GPU, NPU, or CPU workers per computation type). This abstraction lets researchers write algorithm logic at the macro level while RLinf handles device placement, batch routing, and communication schedules transparently. It is the mechanism that makes RLinf's heterogeneous GPU/NPU/CPU resource pooling possible without algorithm-specific code changes.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.