RL Environments GPU Cloud: Gymnasium, Prime Intellect & Verifiers

The throughput bottleneck in agentic RL training is no longer the policy optimizer. It is the environment. In 2026, with GRPO and RLOO replacing PPO as the dominant post-training algorithms for reasoning models, the rate at which your environment can generate verified rollouts determines training wall-clock time more than your GPU count or batch size.

This post covers the full stack: why CPU environments become the bottleneck at scale, how GPU-native environments and verifiers close the gap, a detailed comparison of frameworks from standard Gymnasium to the Prime Intellect Environments Hub, and a concrete architecture for running a vectorized environment fleet on GPU cloud alongside your GRPO or RLHF trainer.

Why RL Environments Are the New Bottleneck

The shift happened because verifiable-reward RL algorithms changed what "a training step" means.

In classic PPO with a trained reward model, the bottleneck was the four-model choreography: actor, critic, reference, reward. Each gradient step required managing four sets of weights. In GRPO, the critic disappears. You sample G completions per prompt, evaluate them with a verifiable reward function, compute group-relative advantages, and update. The bottleneck moves from model weight management to rollout generation and reward verification.

That is where environments come in. GRPO needs completions scored against ground truth. For reasoning and agentic tasks, scoring means running code, checking math, validating tool calls. The environment handles all three. If your environment produces 10K verified completions per second and your GPU policy can consume 500K per second, the GPU sits idle 98% of the time.

For the full GRPO implementation guide covering the trainer side, see the GRPO fine-tuning guide. For the complete RLHF infrastructure stack with PPO and reward models, see the RLHF training infrastructure guide. For the full Agent RFT training loop that consumes verified trajectories from these environments, see the Agent RFT guide.

Environment Framework Comparison

Framework	GPU-native?	Best for	Verifiable rewards?	RL algo fit
Gymnasium	No (CPU)	Baseline, Atari, discrete	Manual (external)	Any (with wrapper)
RLlib	Partial	Production, heterogeneous clusters	Via custom env	PPO, SAC, IMPALA
CleanRL	No (CPU)	Clean reference implementations	Manual	PPO, DQN, SAC
envpool / SampleFactory	Partial (async CPU)	High-throughput Atari, MuJoCo	Manual	Any
Brax	Yes (JAX/GPU)	Continuous control, robotics sim	Limited	PPO, SAC
Isaac Gym / IsaacLab	Yes (CUDA)	Physical robotics, manipulation	Limited	PPO, SAC
ManiSkill3	Yes (GPU)	Dexterous manipulation, navigation	Limited	PPO, SAC
Prime Intellect Environments Hub	No (CPU-scored, agentic)	Agentic tasks, code, math, reasoning	Built-in	GRPO, RLOO, DAPO
verifiers (custom)	Configurable	Any task with ground truth	Built-in	GRPO, RLOO

The critical column is "Verifiable rewards." For GRPO and RLOO, the verifier is the reward model. Any framework without built-in verifiable rewards requires you to wire one up externally. Prime Intellect Environments Hub and custom verifiers ship with reward functions ready to plug directly into GRPOTrainer.reward_funcs. For embodied AI workloads where the environment is a physics simulator (Genesis, Isaac Lab) rather than a verifier, RLinf's distributed rollout architecture sits above these environment frameworks and handles the distributed policy update layer. For a different closed-loop pattern where eval runs produce training data directly, the HUD evals-as-training-data approach builds on the same verifier infrastructure.

The CPU-GPU Transfer Wall

Standard Gymnasium runs simulation on the CPU. Each env.step() call executes Python code, updates state, and returns an observation numpy array. Your training loop copies that array to GPU for the policy forward pass, gets an action, copies it back to CPU for the next step.

At small scale (8-32 envs), this is fine. At the scale GRPO needs, it is not.

Here is what the bottleneck looks like in a profiling trace:

python

import time
import numpy as np
import gymnasium as gym
import torch

# Vectorized CPU environment
env = gym.make_vec("CartPole-v1", num_envs=512, vectorization_mode="async")
obs, _ = env.reset()

steps = 0
t0 = time.time()
for _ in range(10_000):
    # This is the bottleneck: CPU simulation + PCIe transfer
    obs_tensor = torch.from_numpy(obs).float().to("cuda")  # PCIe copy
    with torch.no_grad():
        action = policy(obs_tensor).cpu().numpy()           # back to CPU
    obs, reward, done, truncated, info = env.step(action)   # CPU simulation
    steps += 512

elapsed = time.time() - t0
print(f"Throughput: {steps / elapsed:.0f} steps/sec")
# Typical: 80,000-150,000 steps/sec for CartPole with 512 envs

For CartPole, you might see 100K steps/sec. For code execution or math verification, you are looking at 500-5,000 steps/sec because each step requires spawning a subprocess or calling a solver.

On an H100 SXM5 with a GPU-native environment like Brax running the same task:

python

import brax
from brax.envs import create
import jax
import jax.numpy as jnp

env = create("ant", batch_size=4096)  # 4096 parallel environments on GPU
jit_step = jax.jit(env.step)
jit_reset = jax.jit(env.reset)

state = jit_reset(jax.random.PRNGKey(0))
# All simulation runs in GPU memory, no PCIe transfers
# Typical: 5-50 million steps/sec for physics envs

For agentic tasks where you cannot fully parallelize the environment (code execution needs isolation, math checking needs a solver), the gain is smaller but still meaningful: 5-20x when the verifier can batch-evaluate.

Prime Intellect Environments Hub

The Prime Intellect Environments Hub is an open collection of verifiable RL environments for agentic model training. The environments cover tasks where correctness can be checked programmatically: code generation with test suite execution, math problems with answer verification, tool use with execution traces, and multi-step reasoning tasks.

Unlike GPU-native physics simulators (Isaac Gym, Brax), Hub environments score completions on CPU. The verifier checks code output, evaluates math answers, or validates tool calls against a ground-truth reference. The advantage over writing your own verifier is that the environments ship with ground-truth datasets and scoring logic already paired together, so you skip the boilerplate and get a working reward signal immediately.

Installing and pulling an environment from the Hub:

python

# pip install verifiers
# prime env install PrimeIntellect/math-verify
import verifiers as vf

# Load a verifier environment from the Hub
env = vf.load_environment("PrimeIntellect/math-verify")

# Score a batch of completions against the current prompts (stateless evaluation)
prompts = ["Solve: 2x + 5 = 15", "Find x: 3x^2 - 12 = 0"]
completions = policy.generate(prompts)
rewards = env.evaluate(prompts, completions)
# rewards: list of floats, e.g. [1.0, -1.0]

The evaluate call is stateless: it scores each completion against its corresponding prompt in the same call. There is no persistent environment state to manage between batches, which makes this pattern safe to use directly inside a GRPO reward function.

For GRPO specifically, you pass a reward function wrapping the verifier environment directly to GRPOTrainer:

python

from trl import GRPOConfig, GRPOTrainer
import verifiers as vf

env = vf.load_environment("PrimeIntellect/math-verify")

def reward_fn(prompts, completions, **kwargs):
    # Stateless evaluation: scores completions against the provided prompts
    return env.evaluate(prompts, completions)

config = GRPOConfig(
    num_generations=8,
    temperature=0.8,
    max_completion_length=2048,
    beta=0.04,
)
trainer = GRPOTrainer(
    model=model,
    reward_funcs=[reward_fn],
    args=config,
    train_dataset=dataset,
)
trainer.train()

The Hub is designed to eliminate the custom reward function boilerplate that most GRPO implementations require. If your task fits one of the Hub's environments, use it directly rather than writing your own verifier.

Verifiers: Replacing the Reward Model

For tasks not covered by the Hub, you write a verifier. The verifier pattern is simpler than it sounds: it is just a function that takes a completion and returns a scalar.

The key distinction from a learned reward model (what PPO uses) is that a verifier is deterministic. There is no training phase, no reward model checkpoint, no distribution shift to manage. The reward signal is correct by construction.

Here is the minimal verifier pattern:

python

import subprocess
import sympy

def math_verifier(prompts, completions, ground_truth, **kwargs):
    """
    Binary reward: 1.0 for correct final answer, -1.0 for wrong or malformed.
    """
    rewards = []
    for completion, gt in zip(completions, ground_truth):
        # Parse final answer from <answer>...</answer> tags
        import re
        match = re.search(r'<answer>(.*?)</answer>', completion, re.DOTALL)
        if not match:
            rewards.append(-1.0)
            continue
        answer_str = match.group(1).strip()
        try:
            predicted = sympy.sympify(answer_str)
            expected = sympy.sympify(gt)
            correct = sympy.simplify(predicted - expected) == 0
            rewards.append(1.0 if correct else -1.0)
        except Exception:
            rewards.append(-1.0)
    return rewards


def code_verifier(prompts, completions, test_cases, **kwargs):
    """
    Run each completion against test cases in a subprocess.
    Uses an exec harness to prevent sys.exit(0) reward hacking.
    WARNING: use proper sandboxing (Docker --network none) in production.
    """
    import re
    rewards = []
    for completion, tests in zip(completions, test_cases):
        code_match = re.search(r'```python\n(.*?)```', completion, re.DOTALL)
        if not code_match:
            rewards.append(-1.0)
            continue
        code = code_match.group(1)
        # Run model code and tests in separate exec() calls so sys.exit(0) in
        # model code cannot skip the test suite. SystemExit is caught explicitly.
        harness = "\n".join([
            "import sys",
            "sys.exit = lambda *a: None",
            "_ns = {}",
            "try:",
            "    exec(compile(" + repr(code) + ", '<model>', 'exec'), _ns)",
            "except SystemExit:",
            "    pass",
            "exec(compile(" + repr(tests) + ", '<tests>', 'exec'), _ns)",
        ])
        try:
            result = subprocess.run(
                ["python3", "-c", harness],
                timeout=10,
                capture_output=True,
                text=True,
            )
            rewards.append(1.0 if result.returncode == 0 else -1.0)
        except Exception:
            rewards.append(-1.0)
    return rewards

The verifier replaces the reward model entirely. For tasks with verifiable ground truth, this is almost always the right choice over PPO's trained reward model. For tasks where reward is inherently subjective or cannot be computed from a ground truth reference, PPO with a trained reward model is still appropriate. The DPO vs PPO comparison covers the decision criteria in detail.

Architecture: Environment Fleet on GPU Cloud

A production agentic RL setup splits into three tiers:

Tier 1: Stateless rollout workers (spot-eligible)
  - vLLM policy server: serves inference for rollout generation
  - Environment + verifier: generates completions, scores rewards
  - Communicates: receives policy weights from trainer, sends (prompt, completion, reward) back

Tier 2: Rollout coordinator (optional, lightweight)
  - Load balances prompts across rollout workers
  - Aggregates trajectories before sending to trainer
  - Can be a simple Ray actor or a message queue

Tier 3: Stateful trainer (on-demand, non-preemptible)
  - Policy optimizer: holds model weights + AdamW state
  - Computes GRPO advantages, runs backward pass
  - Broadcasts updated weights to rollout workers after each update step

This maps directly to how verl and OpenRLHF organize their worker pools:

+---------------------------+          +-----------------------+
|  Trainer Node (on-demand) |          | Rollout Node 1 (spot) |
|  - Policy + optimizer     | <-------> | - vLLM server         |
|  - Reference model        |  weights | - Environment/verifier |
|  - GRPO advantage compute | <------- | - Rollout buffer       |
+---------------------------+  rewards +-----------------------+
            |
            |                          +-----------------------+
            +------------------------> | Rollout Node 2 (spot) |
                        weights        | - vLLM server         |
                      <--------------  | - Environment/verifier |
                        rewards        +-----------------------+

The rollout workers hold no optimizer state. If they are preempted, they restart, pull the latest checkpoint from the trainer, and resume. You lose one checkpoint interval of throughput, not any gradient progress.

The PyTorch-native agentic RL guide covers how TorchForge and Monarch wire these environment rollout workers into a single-controller training loop without Ray actors.

For the RLHF parallel: rollout nodes in GRPO fill the same role as the vLLM rollout workers in PPO. The difference is what runs alongside them: in GRPO, the environment and verifier replace the trained reward model.

GPU Sizing for Environment Fleets

Setup	Policy size	Env workers	Rollout GPUs	Trainer GPUs	Recommended GPU	Steps/sec (approx)
Small	7B	512-2048	1x spot	1x on-demand	H100 SXM5	50K-200K
Medium	32B	2048-8192	2-4x spot	2x on-demand	H200 SXM5	30K-100K
Large	70B	8192+	4-8x spot	4x on-demand	B200 SXM6	10K-50K

For the small setup with a 7B policy, a single H100 SXM5 handles both the trainer and rollout worker in colocated mode. The environment and verifier run on CPU alongside the GPU policy. This is the right starting point for initial experiments.

For medium scale, move to disaggregated: 2 spot Spheron H200 instances for rollout workers, 2 on-demand H200s for the trainer. The H200's 4.8 TB/s HBM3e bandwidth cuts rollout generation latency versus H100 for long-context completions.

For large scale with a 70B policy, B200 SXM6 gives 192 GB HBM3e per card: enough headroom to run the trainer and keep the rollout buffer fully in GPU memory without spilling to system RAM.

Wiring to GRPO and RLHF Training Stacks

TRL GRPOTrainer

TRL's GRPOTrainer accepts any callable that matches (prompts, completions, **kwargs) -> List[float]:

python

from trl import GRPOConfig, GRPOTrainer
import verifiers as vf

# Using Prime Intellect Environments Hub
pi_env = vf.load_environment("PrimeIntellect/math-verify")

def reward_fn(prompts, completions, **kwargs):
    # Stateless: evaluates each completion against its corresponding prompt
    return pi_env.evaluate(prompts, completions)

config = GRPOConfig(
    use_vllm=True,
    vllm_server_host="<rollout_node_ip>",  # spot H200 rollout node
    num_generations=8,
    temperature=0.8,
    max_completion_length=2048,
    beta=0.04,
    save_steps=50,  # frequent checkpoints for preemption safety
)

trainer = GRPOTrainer(
    model=model,
    reward_funcs=[reward_fn],
    args=config,
    train_dataset=dataset,
)
trainer.train()

verl with Custom Environments

For verl, the environment integrates at the reward function level. verl calls your reward function after each rollout batch:

python

# In your verl config YAML:
# actor_rollout_ref.rollout.reward_fn: "my_module.reward_fn"

from my_verifiers import math_verifier  # math_verifier defined in my_verifiers.py — see the verifiers section above

import torch

def reward_fn(data_batch):
    """
    data_batch contains prompts, responses, and any ground truth from dataset.
    Returns a dict with 'reward_tensor' key.
    """
    completions = data_batch["responses"]
    ground_truth = data_batch["ground_truth"]

    rewards = math_verifier(
        prompts=data_batch["prompts"],
        completions=completions,
        ground_truth=ground_truth,
    )

    return {
        "reward_tensor": torch.tensor(rewards, dtype=torch.float32)
    }

OpenRLHF with Disaggregated Rollout Workers

For OpenRLHF, you replace the --reward_pretrain model with a custom reward function module:

bash

python -m openrlhf.cli.train_grpo \
  --pretrain /checkpoints/policy_32b \
  --reward_fn my_module.reward_fn \
  --actor_num_nodes 2 \
  --actor_num_gpus_per_node 4 \
  --vllm_num_engines 4 \
  --vllm_tensor_parallel_size 2 \
  --rollout_batch_size 1024 \
  --n_samples_per_prompt 8 \
  --save_steps 50

The --vllm_num_engines 4 spawns four separate vLLM instances as Ray actors on your spot rollout nodes. Each handles a shard of the rollout batch in parallel.

Scaling and Cost: Live Pricing

Estimated costs for three environment fleet configurations. Rollout workers run on spot; trainers run on on-demand.

Configuration	Trainer (on-demand)	Rollout workers (spot)	Total/hr (est.)
Small: 7B, 512 envs	1x H100 SXM5 ($2.54/hr)	1x H100 SXM5 spot ($1.43/hr)	~$3.97/hr
Medium: 32B, 2048 envs	2x H200 SXM5 ($9.08/hr)	4x H200 SXM5 spot ($13.24/hr)	~$22.32/hr
Large: 70B, 8192 envs	4x B200 SXM6 ($34.44/hr)	4x B200 SXM6 spot ($21.36/hr)	~$55.80/hr

For a 24-hour small training run: ~$95. For a 24-hour medium run: ~$536. For a large 70B run over 72 hours: ~$4,018.

The spot savings for rollout workers are real at every scale. Running the medium configuration entirely on on-demand nodes (6x H200 at $4.54/hr each) costs ~$654/day versus ~$536/day with the mixed setup. The large configuration follows the same pattern: all on-demand comes to ~$1,653/day (8x B200 at $8.61/hr) versus ~$1,339/day with spot rollout workers.

Pricing fluctuates based on GPU availability. The prices above are based on 09 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Common Failure Modes

Rollout Throughput Bottleneck

Symptom: trainer GPU utilization is below 30%. The policy_update_time metric is low but rollout_collection_time dominates each step.

Diagnosis: the environment or verifier cannot produce completions fast enough to keep the trainer fed. Common causes: code execution verifier with slow test suites, single-threaded math checker, or CPU environment without async stepping.

Fix: add a second spot rollout node. For code execution verifiers, limit test suite execution time to 5 seconds per completion and parallelize across CPUs with multiprocessing.Pool. For math verifiers, switch from SymPy's full simplification to a faster exact-match check first, falling back to symbolic evaluation only on string mismatch.

Environment Reset Latency Spikes

Symptom: step times are inconsistent. Some batches complete in 200ms; others take 2-3 seconds. The outliers correlate with environment resets.

Diagnosis: environment reset is expensive when it requires loading new problem data (e.g., fetching a math problem from a dataset), initializing a new execution context, or clearing GPU memory from a completed episode.

Fix: pre-load the next episode's data asynchronously while the current episode is running. For GPU-native environments, use double-buffering: while the policy processes batch N, the environment pre-computes the reset state for batch N+1. For code execution environments, maintain a warm subprocess pool rather than spawning fresh subprocesses on each reset.

Verifier Timeout Causing Reward NaN

Symptom: reward_mean shows NaN values in your training logs. The batch contains completions that scored as None or raised exceptions in the verifier.

Diagnosis: the code execution verifier timed out on completions with infinite loops or very long execution times. The timeout returns a subprocess exception, which propagates as None and then becomes NaN when averaged.

Fix: wrap all verifier calls in a try/except and return -1.0 as the default failure reward. Never let exceptions propagate to the reward aggregation step. Add a hard 5-second timeout to all subprocess calls and treat timeouts as format errors rather than execution errors.

python

def _run_code_check(completion, tests, timeout=5):
    """
    Execute a single completion against its test suite, returning 1.0 or -1.0.
    Uses the same sys.exit() trap as code_verifier. Raises on subprocess errors;
    the caller (safe_code_verifier) catches them via except Exception.
    """
    import re
    import subprocess
    code_match = re.search(r'```python\n(.*?)```', completion, re.DOTALL)
    if not code_match:
        return -1.0
    code = code_match.group(1)
    harness = "\n".join([
        "import sys",
        "sys.exit = lambda *a: None",
        "_ns = {}",
        "try:",
        "    exec(compile(" + repr(code) + ", '<model>', 'exec'), _ns)",
        "except SystemExit:",
        "    pass",
        "exec(compile(" + repr(tests) + ", '<tests>', 'exec'), _ns)",
    ])
    result = subprocess.run(
        ["python3", "-c", harness],
        timeout=timeout,
        capture_output=True,
        text=True,
    )
    return 1.0 if result.returncode == 0 else -1.0


def safe_code_verifier(prompts, completions, test_cases, **kwargs):
    rewards = []
    for completion, tests in zip(completions, test_cases):
        try:
            reward = _run_code_check(completion, tests, timeout=5)
        except Exception:
            reward = -1.0  # safe default, never NaN
        rewards.append(reward)
    return rewards

KL Collapse from Off-Policy Rollout Staleness

Symptom: KL divergence climbs past 0.1 and accelerates. The policy update frequency is lower than expected. Rollout completions in later batches score systematically worse than earlier ones.

Diagnosis: rollout workers are generating completions from a stale policy checkpoint. The weight synchronization between trainer and rollout workers has fallen behind, creating off-policy data. The advantage estimates computed on stale completions push the policy in incorrect directions.

Fix: reduce the weight synchronization interval. For TRL GRPO, set vllm_server_host to sync weights after every gradient step rather than every N steps. For verl, check that actor_rollout_ref.rollout.load_format is set to "hf" and that weight broadcasts are not batched too aggressively. As a safeguard, add a staleness check: if the rollout worker's policy version lags the trainer's by more than 3 steps, discard that batch and re-generate.

RL environment fleets split cleanly into two workloads: stateless rollout workers that are fully spot-eligible, and stateful trainers that need stable on-demand compute. Spheron lets you provision both separately with per-minute billing and no reserved commitment.
H100 SXM5 on Spheron → | Spheron H200 instances → | View live GPU pricing →
Start your agentic RL run on Spheron →

STEPS / 05

Quick Setup Guide

Profile your environment throughput bottleneck
Before provisioning GPUs, measure where time goes in your current training loop. Add timing around env.step(), policy.forward(), and the optimizer step. If env.step() accounts for more than 30% of wall-clock time, switching to GPU-native environments or adding rollout workers will help. Use the throughput target: for a 7B policy doing 4096-token rollouts, you need at least 50,000 steps/sec to keep GPU utilization above 80%.
Choose the right environment framework
For agentic reasoning and code/math tasks: use Prime Intellect Environments Hub or write custom verifiers with the Gymnasium interface. For continuous control and robotics: use Isaac Gym, Brax, or ManiSkill3 for GPU-native physics. For standard Atari/discrete tasks: use Gymnasium with envpool or SampleFactory for vectorized CPU environments with async stepping. For production-scale training: use RLlib for Ray-native parallel rollout management.
Set up a vectorized environment fleet on GPU
Install required packages: gymnasium, envpool, and your chosen GPU-native env library. For agentic tasks with Prime Intellect Environments Hub: pip install verifiers, then pull an environment with prime env install PrimeIntellect/math-verify. Create a VectorEnv wrapper that batches obs/reward tensors on GPU. Target batch size: 512-4096 environments per rollout worker GPU. Configure worker count to match your trainer's expected steps/sec consumption.
Wire environment rollout workers to a GRPO/RLHF training stack
For GRPO with TRL or verl: implement a custom get_rewards() function that calls your verifier batch. Pass the rollout completions as a list, run the verifier (code execution, math check, etc.), and return a float tensor of rewards. For disaggregated rollout with verl or OpenRLHF: run environment workers as Ray actors that pull policy weights from the trainer, generate rollouts, and push (obs, action, reward) trajectories back. The trainer then aggregates trajectories and runs the policy update.
Provision spot rollout workers alongside an on-demand trainer
Environment rollout workers hold no optimizer state - they are fully preemption-safe if you checkpoint the trainer every 50-100 steps. Provision the trainer on an on-demand H200 or B200 node. Provision rollout workers as spot instances (same GPU model). On preemption, the rollout worker restarts, reloads the latest policy checkpoint from the trainer node, and resumes generating rollouts. Net result: you lose at most one checkpoint interval of rollout throughput, not any training progress.

FAQ / 05

Frequently Asked Questions

CPU-bound environments (standard Gymnasium) run simulation logic on the CPU and copy observations to GPU for the policy forward pass. This CPU-GPU transfer becomes the throughput bottleneck at scale: a single CPU core can simulate 1,000-5,000 env steps per second, while a GPU policy can consume 500,000+ steps per second. GPU-native environments (Isaac Gym, Brax, ManiSkill3) run the full simulation in GPU memory, eliminating the transfer wall and delivering 10-100x throughput gains on the same hardware. Agentic verifier environments (Prime Intellect Environments Hub, custom verifiers) are different: they score completions on CPU by checking code output, math answers, or tool calls. They accelerate training not by GPU-native simulation but by eliminating the learned reward model entirely.

The Prime Intellect Environments Hub is an open collection of verifiable RL environments designed specifically for agentic AI training. It ships environments with built-in reward functions that return binary or scalar signals without a learned reward model. This makes it directly compatible with GRPO and other verifiable-reward RL algorithms. Environments cover code execution, math verification, tool use, and reasoning tasks - the workloads most relevant to agentic model training in 2026.

For a typical agentic RL run with a 7B policy: a single H100 SXM5 running 512-2048 vectorized environments alongside the trainer is enough for initial exploration. For production-scale training with a 32B policy and tens of thousands of parallel environments, 4-8 H100/H200 GPUs dedicated to rollout workers plus 2-4 GPUs for the trainer is a practical starting point. The rollout workers hold no optimizer state and are fully spot-eligible, which cuts fleet cost by 40-60% compared to all on-demand provisioning.

Yes. Verifiers are the natural reward source for GRPO: they return a binary or scalar correctness signal that replaces the trained reward model PPO requires. The verifier runs as a separate process that receives rollout completions from the policy, checks them against ground-truth references (e.g., runs code, checks math answers, validates tool calls), and returns a score. This score feeds directly into the GRPO advantage computation. The key infrastructure requirement is that the verifier process must be co-located on the same node or low-latency cluster as the rollout worker to avoid inter-node scoring latency.

Environment throughput (steps/second) sets the floor on how fast advantage estimates can be collected for each policy update. If your policy consumes 200k steps/sec but your environment produces only 10k steps/sec, the trainer sits idle 95% of the time. GPU-native environments close this gap by 10-100x. As a rough benchmark: 2048 parallel Gymnasium CartPole envs on CPU produce ~500k steps/sec with vectorized resets, while a GPU-native Brax equivalent on one H100 reaches 10-50 million steps/sec. For agentic tasks (code execution, reasoning verification), throughput is lower but still 5-20x better on GPU when the verifier can batch-evaluate.

Why RL Environments Are the New Bottleneck

Environment Framework Comparison

The CPU-GPU Transfer Wall

Prime Intellect Environments Hub

Verifiers: Replacing the Reward Model

Architecture: Environment Fleet on GPU Cloud

GPU Sizing for Environment Fleets

Wiring to GRPO and RLHF Training Stacks

TRL GRPOTrainer

verl with Custom Environments

OpenRLHF with Disaggregated Rollout Workers

Scaling and Cost: Live Pricing

Common Failure Modes

Rollout Throughput Bottleneck

Environment Reset Latency Spikes

Verifier Timeout Causing Reward NaN

KL Collapse from Off-Policy Rollout Staleness

Quick Setup Guide

Profile your environment throughput bottleneck

Choose the right environment framework

Set up a vectorized environment fleet on GPU

Wire environment rollout workers to a GRPO/RLHF training stack

Provision spot rollout workers alongside an on-demand trainer

Frequently Asked Questions

01What is the difference between CPU-bound and GPU-native RL environments?

02What is the Prime Intellect Environments Hub?

03How many GPUs do I need for a vectorized RL environment fleet?

04Can I use verifiers with GRPO training on GPU cloud?

05How does environment throughput affect training wall-clock time?

Try It on Real GPUs