Engineering

Turn Agent Evals Into RL Training Data on GPU Cloud: HUD and Closed-Loop RFT/GRPO Pipelines (2026 Guide)

Agent Evals to RL Training DataHUD RL EnvironmentsRFT GRPO Agent TrainingEvals as Training DataClosed-Loop Agent TrainingHUD SheetBenchSelf-Improving AgentsClosed-Loop GRPO TrainingverlOpenRLHF
Turn Agent Evals Into RL Training Data on GPU Cloud: HUD and Closed-Loop RFT/GRPO Pipelines (2026 Guide)

Every eval run your agent team runs scores trajectories and then throws them away. The score goes into a dashboard, the trajectories get logged, and the next training run starts fresh with the same data mix it had before.

HUD's two-yield scenario pattern fixes this. Every scenario run produces a (observation, trajectory, reward) triple that feeds directly into a GRPO or RFT policy update. The eval pipeline and the training pipeline become the same pipeline.

For agent benchmarking infrastructure that already runs SWE-bench or GAIA at scale, the incremental cost of wiring in the closed loop is small. The main engineering change is keeping the scored trajectories instead of discarding them after reporting.

The Problem: Eval Compute Goes to Waste

A standard agentic eval loop looks like this:

  1. Run 128-2048 agent episodes in parallel against a benchmark (SWE-bench, GAIA, SheetBench-50)
  2. Score each episode: did the agent complete the task?
  3. Report the score to a dashboard
  4. Discard the trajectories

Step 4 is where the waste happens. Each trajectory is a full (prompt, tool calls, intermediate states, final answer) sequence that the agent generated under a known reward signal. That is exactly the format GRPO needs for a policy update. Instead it gets discarded.

The compute cost of running 2048 parallel agent episodes is not trivial. On H200 rollout workers, generating one batch of 2048 episodes at 8 steps per episode, 1024 tokens per step, costs roughly 4-6 GPU-hours depending on model size. Running that same batch once a day on H200 spot instances runs around $15-20 per day. Over a month, that is $450-600 in eval compute that produced no training signal.

How HUD's Two-Yield Scenario Fixes This

HUD is an agent task environment framework built around the idea that the eval environment and the training environment should be the same thing. The core primitive is the two-yield scenario.

The Two-Yield Protocol

A HUD scenario is a Python generator with exactly two yield points:

python
from hud import Environment

env = Environment(name="spreadsheet-task")

@env.template()
async def spreadsheet_task():
    # First yield: send the initial observation to the agent
    trajectory = yield "Open the spreadsheet at row 5 and compute the sum of column B."

    # Agent acts between the two yields
    # trajectory is everything the agent did: tool calls, intermediate states, final answer

    # Second yield: compute and return the reward from the trajectory
    task_success = evaluate_spreadsheet_result(trajectory)
    yield float(task_success)

The first yield sends the initial observation to the agent. The agent runs its full trajectory, making tool calls, reading outputs, and producing a final answer. The second yield receives that trajectory and returns a scalar reward.

This two-phase protocol means every scenario run automatically produces a {"prompt": initial_obs, "completion": trajectory, "reward": float} triple. That triple is directly consumable by GRPOTrainer as a scored completion. No separate data collection step, no format conversion, no alignment between eval logs and training data.

python
# CLI: run evaluation and collect trajectories
# hud eval tasks.py your-model --group 8

# Or via Python API:
import asyncio
from hud.eval import Job

session = asyncio.run(Job.start("your-model", group=8))
# session.runs: list of Run objects, each with .reward float and .trace_id
# Each run maps to: {"prompt": ..., "completion": ..., "reward": float}

SheetBench-50 and Autonomy-10: Human-Baselined Rewards

HUD ships two benchmarks designed specifically for the closed-loop pattern.

SheetBench-50 covers 50 spreadsheet manipulation tasks returning a continuous reward in [0, 1] anchored to human completion percentiles. Scores closer to 1.0 mean the agent completed the task near or above the human baseline, and 0.0 means it failed. The percentile anchoring is what prevents reward hacking: a model that games the task metric would need to also beat the human baseline, which requires genuine task competence.

Autonomy-10 covers 10 categories of computer-use tasks (100+ tasks total) graded the same way. It is smaller but each task involves 20-40 steps, making it a better signal for trajectory-level reasoning than step-level metrics.

Both benchmarks return continuous rewards in [0, 1], which is the format GRPO's advantage computation expects. Binary success/failure (0 or 1) works but produces noisier gradient estimates. The continuous scale from SheetBench-50 and Autonomy-10 gives GRPO more signal per trajectory.

Running 2,000 Parallel Environments

HUD's environment fleet runs on CPU. Each parallel environment is an isolated Python process executing the scenario generator. For 2048 parallel environments, you need CPU instances, not GPUs.

A practical sizing for 2048 parallel HUD environments:

  • 8-16 CPU instances with 32-64 cores each
  • 128-256 envs per instance
  • No GPU required for the environment fleet itself

The environment fleet is fully spot-eligible. A preempted environment instance loses its in-flight episodes, but the rollout workers simply request new episodes from the remaining instances. The HUD scheduler handles reconnection transparently.

GRPO vs RFT: Choosing the Right Training Backend

Once you have trajectories.jsonl, you have two paths: self-hosted GRPO (via verl or OpenRLHF) or managed RFT (via OpenAI RFT or Tinker). For a full comparison of the self-hosted GRPO training setup, including memory math and disaggregated rollout configuration, read that guide first.

Self-Hosted: verl and OpenRLHF

verl implements GRPO natively and accepts a custom rollout_func that lets you replace the default vLLM rollout with a HUD environment call:

python
# verl config for HUD-sourced trajectories (grpo_hud_config.yaml)
algorithm: grpo
model:
  path: Qwen/Qwen2.5-32B-Instruct
  dtype: bfloat16

rollout:
  n: 8  # G completions per prompt (GRPO group size)
  temperature: 0.8
  max_new_tokens: 4096
  rollout_func: hud_rollout.HUDRollout  # custom rollout that calls HUD envs

trainer:
  total_epochs: 3
  save_steps: 25
  learning_rate: 1e-6
  kl_coef: 0.01  # beta in GRPO; start low for agentic tasks

The HUDRollout class uses TrainingClient and Job.start() to collect trajectories and rewards from the live HUD environment fleet:

python
from hud import TrainingClient
from hud.eval import Job

class HUDRollout:
    def __init__(self, model_name, group_size=8):
        self.model_name = model_name
        self.group_size = group_size
        self.trainer = TrainingClient(model_name)

    async def rollout(self):
        # Run HUD scenarios and collect scored trajectories
        session = await Job.start(self.model_name, group=self.group_size)
        trajectories = session.runs
        rewards = torch.tensor([r.reward for r in trajectories], dtype=torch.float32)
        return trajectories, rewards

For verl and OpenRLHF setup covering the full multi-node configuration, reward model training, and PPO baseline, that guide has the detailed installation steps.

Managed: OpenAI RFT and Tinker

OpenAI RFT accepts the same trajectory format. Upload trajectories.jsonl to the fine-tuning API with method: reinforcement:

bash
openai api fine_tuning.jobs.create \
  --training_file trajectories.jsonl \
  --model o4-mini-2025-04-16 \
  --method reinforcement \
  --grader '{"type": "score_model", "input": "{{item.prompt}}", "output": "{{item.completion}}"}' \
  --hyperparameters '{"n_epochs": 3}'

Note: OpenAI RFT only runs on o-series reasoning models (currently o4-mini). A grader definition is required: either a code-based grader or a model-based grader as shown above. The managed path is simpler to operate but gives you no control over rollout scheduling, checkpoint frequency, or the GRPO group size. For production closed-loop pipelines where you want to tune the curriculum (which scenarios to sample next based on current reward distribution), self-hosted verl is the right choice.

Tinker (Thinking Machines Lab) works differently from OpenAI RFT. Instead of uploading a training file, you call Tinker's Python API inside your own training loop. The core primitives are forward_backward (compute gradients and return the loss) and sample (generate completions). You write the GRPO or REINFORCE update logic yourself; Tinker handles the infrastructure underneath (weight serving, gradient communication, checkpointing). This means you do not upload a trajectories.jsonl file to Tinker. The tradeoff versus OpenAI RFT is more implementation work in exchange for direct control over the training loop.

The Three Compute Planes

The closed-loop architecture splits cleanly into three planes with different hardware and pricing characteristics.

Plane 1: HUD Environment Fleet

CPU-only. Runs the scenario generators and verifies task completion.

  • 1 CPU instance (32-64 cores) per 128-256 parallel environments
  • For 2048 parallel envs: 8-16 CPU instances
  • Fully spot-eligible: no state to lose on preemption (in-flight episodes are re-scheduled)
  • No GPU required

The RL environment infrastructure covers the general case for Gymnasium, Prime Intellect Environments Hub, and verifiers. HUD is a different pattern from those: it is explicitly designed to bridge the eval and training pipelines, whereas Prime Intellect Environments Hub and standard verifiers treat the environment purely as a training-time reward source.

Plane 2: Rollout Workers

GPU inference, forward-only. Runs the policy to generate agent trajectories against the HUD environment fleet.

  • 1 H200 SXM5 per vLLM rollout server for models that fit in a single GPU (7B); larger models need multi-GPU servers
  • Spot-eligible: no optimizer state, no backward pass, stateless restarts
  • For a 32B model with 4-way tensor parallelism: 4 H200 GPUs per rollout server, 2 servers total (8 H200s)

For the Agent RFT trajectory rollout pattern, the rollout architecture is the same: stateless, spot-eligible, killed and restarted freely between batches. The difference with HUD is that the rollout worker does not generate completions against a static prompt dataset; it runs live against the HUD environment fleet, so the rollout and eval happen in the same forward pass.

Plane 3: Policy Trainer

GPU training, gradient state. Runs the GRPO update step on collected trajectories.

  • H200 SXM5 for all model sizes, including 70B and above. B200 has no on-demand tier currently, see the note below.
  • On-demand only: holds optimizer state (AdamW first/second moments in FP32). A preempted trainer loses state and rolls back.
  • Checkpoint every 25-50 steps. At 25-step checkpoints, a preemption costs at most 25 steps of training, typically under 30 minutes.

GPU Sizing Table

Practical configurations for the three-plane closed-loop architecture with live pricing as of 02 Jul 2026:

Model SizeTrainer GPUsTrainer TypeRollout GPUsRollout TypeTrainer $/hrRollout $/hr
7B1x H200 SXM5On-demand2x H200 SXM5Spot$4.54~$3.31 each
32B4x H200 SXM5On-demand8x H200 SXM5Spot~$18.16~$3.31 each
70B8x H200 SXM5On-demand16x H200 SXM5Spot~$36.32~$3.31 each

Note on B200: B200 SXM6 is available on spot at approximately $5.34/hr but has no on-demand tier currently. For the trainer plane, which must stay on-demand, use H200 SXM5 for all model sizes until B200 on-demand capacity is available. The H200 on Spheron page has current availability.

Spot eligibility: Rollout workers are fully spot-eligible. A preempted rollout worker loses its in-flight trajectory batch (typically 50-200 steps), not training progress. The trainer on-demand instances hold optimizer state and must not be preempted.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Jul 2026 and may have changed. Check current GPU pricing → for live rates.

Measuring Real Gains and Avoiding Reward Hacking

The closed-loop only works if the training reward tracks genuine task improvement. Reward hacking, where the model learns to exploit the reward function without actually getting better at the task, breaks the loop silently.

The detection method is straightforward: track both the training reward and a held-out eval metric that the model cannot directly optimize.

python
# Log both signals per epoch
{
    "epoch": 2,
    "grpo_reward_mean": 0.73,       # training signal (what GRPO is optimizing)
    "sheetbench50_score": 0.61,     # held-out eval (human-baselined)
    "autonomy10_score": 0.44,       # held-out eval (human-baselined)
    "kl_divergence": 0.031          # policy drift from reference
}

If grpo_reward_mean climbs but sheetbench50_score stagnates or drops, you have reward hacking. Three common causes:

Format reward over-optimization. The GRPO reward includes a format component (valid JSON for tool calls, correct schema). The model learns to produce perfectly formatted trajectories that call tools in a useless order, getting 0.1 per step from format_reward while scoring 0.0 on task_success_reward. Fix: add a hard gate that returns -1.0 total reward if task_success_reward is 0.0, regardless of format score.

Trajectory length gaming. For continuous rewards, longer trajectories can accumulate more reward signal per step. The model learns to generate verbose intermediate reasoning that bumps step-level rewards without improving final task completion. Fix: normalize reward by trajectory length, or use only the terminal reward.

Distribution shift. After a few training iterations, the policy generates trajectories that the HUD environment rewards highly but that do not generalize. The SheetBench-50 human baseline catches this because human raters evaluate the actual task result, not the trajectory structure.

Run hud eval --benchmark sheetbench-50 after each training epoch. If SheetBench-50 score drops two epochs in a row, stop training, inspect the failure cases, and adjust the reward function before continuing.

Reference Architecture on One GPU Cloud

The three-plane architecture runs cleanly on a single GPU cloud, which eliminates cross-provider egress costs. Each plane communicates with the next over the same internal network:

HUD Environment Fleet (CPU instances, spot)
        |  trajectories + rewards
        v
  Rollout Workers (H200 spot, vLLM)
        |  scored completions
        v
  GRPO Trainer (H200 on-demand, verl)
        |  updated weights broadcast
        v
  Rollout Workers (weight reload every N steps)

The weight broadcast from trainer to rollout workers is the only inter-plane communication that requires bandwidth. For a 32B model in bfloat16, the checkpoint is about 64 GB. Broadcasting to 8 rollout workers over a 100 Gbps internal network takes under 6 seconds. At a checkpoint interval of 25 training steps, that is under 4% overhead.

Running all three planes on Spheron means no egress billing between planes. Egress between cloud providers for weight broadcasts on a 32B model at 25-step checkpoints adds up to several hundred GB per day, which at commercial egress rates is a meaningful cost that disappears entirely when all planes run on the same network.


Spheron hosts all three planes of the closed-loop pipeline, from the HUD environment fleet to vLLM rollout workers to the GRPO/RFT trainer, on one platform with per-minute billing and no egress between planes.

H200 GPU capacity → | B200 for spot rollout → | Live GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Set up HUD and run a baseline eval to collect trajectories

    Install the HUD SDK: pip install hud-python. Define an environment using Environment and @env.template() from hud-python. The async generator first yields the initial observation to the agent, then yields a float reward after the agent acts. Run: hud eval tasks.py your-model --group 8. Each run produces a scored trajectory with prompt, completion, and reward fields. Inspect a sample to verify the reward distribution before wiring it into training.

  2. Wire HUD's two-yield output to a GRPO reward function

    Load scored runs and define a reward function that returns the per-trajectory reward float. Register it as reward_funcs=[hud_reward] in GRPOConfig. For online closed-loop operation (eval feeds into training directly without a file step), implement a HUDRollout class that uses TrainingClient and Job.start() to collect trajectories and rewards from the live HUD environment fleet. This eliminates the data collection file step and creates a live pipeline where each GRPO step generates fresh trajectories against the HUD environment fleet.

  3. Choose your RL training backend (verl, OpenRLHF, OpenAI RFT, or Tinker)

    For self-hosted: install verl with pip install verl and set algorithm=grpo in the config with a custom rollout_func pointing to your HUDRollout class. For OpenRLHF: use --agent_func_path pointing to a HUD environment wrapper. For managed options: upload trajectories.jsonl to OpenAI RFT, which accepts (prompt, completion, reward) triples via its fine-tuning API. Tinker (Thinking Machines Lab) is a different managed path: it is a low-level Python RL API with forward_backward and sample primitives where you implement the GRPO loop yourself, not a file-upload target. The self-hosted path gives you full control of rollout scheduling and checkpoint frequency; the managed path trades control for zero infrastructure overhead.

  4. Size the GPU cluster for the three compute planes

    Plane 1 (HUD environment fleet): 1 CPU instance per 128-256 parallel envs, fully spot-eligible, no GPU required. Plane 2 (rollout workers): 1 H200 SXM5 per vLLM rollout server, spot-eligible since rollout workers hold no optimizer state. Plane 3 (GRPO trainer): 1-4 H200 nodes for 7B-32B models on-demand, 4-8 nodes for 70B on-demand. The trainer must stay on-demand because a preempted trainer loses optimizer state and rolls back to the last checkpoint. Checkpoint every 25-50 steps to limit rollback exposure.

  5. Launch the closed-loop pipeline on Spheron GPU cloud

    Provision the trainer on on-demand H200 instances. Provision rollout workers on spot H200 instances. Run: vllm serve <model_path> --tensor-parallel-size 4 --gpu-memory-utilization 0.85 --max-model-len 8192 on each rollout worker. Launch the verl trainer with torchrun --nproc-per-node=8 -m verl.trainer.main_ppo --config your_config.yaml. The HUD environment fleet (CPU instances) connects to the rollout workers via the HUDRollout class. With per-minute billing, you only pay for rollout workers while they are actively generating trajectories.

  6. Measure gains with human-baselined benchmarks and avoid reward hacking

    After each training epoch, re-run SheetBench-50 and Autonomy-10 using hud eval --benchmark sheetbench-50 --model http://your-vllm-endpoint/v1. Plot reward_mean (the training signal) against actual task completion rate (the eval signal). If training reward climbs but task completion plateaus or drops, you have reward hacking. Add a format reward alongside the task reward, or switch to human-baselined percentile scoring instead of binary success/failure. Three to four training iterations against the failure cases identified by SheetBench-50 typically covers the main systematic failure clusters in a 7B-32B model.

FAQ / 05

Frequently Asked Questions

In standard pipelines, an eval run scores agent trajectories and then discards them. Evals-as-training-data keeps those scored trajectories and treats them as (prompt, completion, reward) triples for a GRPO or RFT policy update. HUD formalizes this by making the reward signal observable mid-trajectory via its two-yield pattern: every eval scenario automatically produces a (obs, trajectory, reward) triple that is directly consumable by GRPOTrainer as a scored completion.

A HUD scenario is a Python generator that yields the initial environment state (the agent's starting observation), then yields the final reward after the agent has acted. This two-phase protocol means every scenario run automatically produces a (obs, trajectory, reward) triple. The trajectory is everything the agent did between the two yields: tool calls, intermediate states, final answer. That triple is directly consumable by GRPOTrainer as a scored completion, without any separate data collection step.

GRPO (Group Relative Policy Optimization) samples G completions per prompt, computes group-relative advantage (above-mean gets a positive gradient, below-mean gets a negative one), and updates the policy without a critic or reward model. RFT (Reinforcement Fine-Tuning) is OpenAI's hosted variant of the same idea: you supply graded trajectory pairs (successful vs failed) and their trainer fine-tunes the policy. For self-hosted closed-loop pipelines, verl and OpenRLHF implement GRPO natively. For managed options, OpenAI RFT accepts a trajectories.jsonl file upload and handles the RL update for you. Tinker (Thinking Machines Lab) is a different kind of managed service: it exposes a low-level Python API with primitives like forward_backward and sample where you implement the GRPO loop yourself. You do not upload a trajectories file to Tinker.

The architecture has three planes. The HUD environment fleet is CPU-bound: 1 CPU instance handles 128-256 parallel environments, so 2048 parallel envs need 8-16 CPU instances. The rollout workers run the policy in forward-only mode: 1 H200 SXM5 per vLLM server at around $3.31/hr on spot. The GRPO trainer holds gradient state and must stay on on-demand: 1-4 H200 nodes for 7B-32B models, 4-8 nodes for 70B, at $4.54/hr per H200 on-demand. For a 32B model with 2048 parallel envs, a practical starting point is 2 H200 rollout servers (8 GPUs total, 4-way tensor parallel each) on spot plus 4 H200 trainer GPUs on-demand.

Reward hacking occurs when a model learns to exploit a proxy reward signal without actually improving at the real task. Human-baselined benchmarks anchor the reward signal to what real humans consider task completion, making it much harder to game. SheetBench-50 and Autonomy-10 compare agent task completion against human completion percentiles. A model that games the reward function would need to also game human evaluators, which is far harder than gaming a programmatic metric. Running these benchmarks as a held-out eval set, separate from the training reward, makes reward hacking immediately visible: training reward climbs while the human-baselined score stalls or drops.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.