GRPO Fine-Tuning on GPU Cloud: Train Reasoning Models with Verifiable Rewards (2026 Guide)

GRPO is the post-training method behind DeepSeek-R1, R2, and most open-source reasoning models released in 2026. The implementation barrier dropped significantly: TRL 0.14 ships a GRPOTrainer you can point at a vLLM rollout server and run end-to-end. If you need the supervised fine-tuning baseline first, read the SFT fine-tuning guide before continuing here. For the pattern where your eval harness doubles as your GRPO rollout source, see the closed-loop RFT/GRPO pipeline guide.

GRPO is not just a LoRA job with a different loss function. It eats 2-3x the VRAM of SFT, requires a rollout generation pipeline, and fails in specific ways that are not obvious until you hit them. This guide covers the memory math, the full TRL + vLLM setup on Spheron H200 and B200 nodes, disaggregated multi-node configuration, and a real cost breakdown comparing GRPO against PPO and DPO for a 32B model. For combining capabilities across existing fine-tunes without any training, model merging is worth evaluating first, since a TIES merge takes under an hour and costs a fraction of a GRPO run.

What GRPO Is (and Why PPO Breaks on LLMs)

PPO was the standard RL algorithm for LLM post-training for years. The problem is that it needs a value model (the critic) trained alongside the policy. On LLMs, the critic must learn to assign a scalar value to arbitrary token sequences of unbounded length. Credit assignment across a 1,000-token chain-of-thought is genuinely hard. The critic training introduces a second unstable optimization that diverges frequently, especially when reward signals are sparse or noisy.

If you need the full PPO pipeline rather than GRPO's value-model-free approach, the RLHF training infrastructure guide covers verl, OpenRLHF, and TRL with reward model training and multi-node PPO setup.

DPO sidesteps the critic entirely by using a static offline preference dataset. You train on pairs of preferred and rejected completions without any online rollout generation. That works well for style alignment but cannot teach a model to reason through problems it has never seen solved correctly. The model can only learn strategies already represented in your dataset. If DPO is the right method for your task - style alignment, instruction-following, or safety tuning on an existing preference dataset - see the DPO fine-tuning infrastructure guide for memory math, reference model strategies, and cost benchmarks.

GRPO drops the critic by making the advantage estimate relative within the current batch. For each prompt, you sample G completions (typically 8-16) from the current policy. Each gets a reward score. The advantage for completion i is:

A_i = (r_i - mean(r_1..r_G)) / std(r_1..r_G)

The policy update uses a clipped ratio loss identical to PPO's actor loss, plus a KL penalty term to keep the policy from drifting too far from the reference model:

L = E[min(ratio * A, clip(ratio, 1-ε, 1+ε) * A)] - β × KL[π_θ || π_ref]

No critic. No value model. Just group-relative advantage from within-batch reward comparisons. DeepSeek-R1-Zero was trained entirely with GRPO on math and code problems. DeepSeek-R1 added a short cold-start SFT phase first to stabilize early training, then applied GRPO on top.

Verifiable Rewards: The Three Categories That Work

The entire GRPO framework depends on having a reward function that can score rollout completions automatically. Not every task has one. Here are the three categories where GRPO consistently works.

Math and Logic

The canonical GRPO reward setup. Your training data contains problems with ground truth answers. The model generates a chain-of-thought and a final boxed answer. You parse the answer and compare it to the ground truth using symbolic evaluation:

python

import re
import sympy

def correctness_reward(prompts, completions, ground_truth, **kwargs):
    rewards = []
    for completion, gt in zip(completions, ground_truth):
        # Extract boxed answer
        match = re.search(r'\\boxed\{([^}]+)\}', completion)
        if not match:
            rewards.append(-1.0)
            continue
        try:
            predicted = sympy.sympify(match.group(1))
            expected = sympy.sympify(gt)
            rewards.append(1.0 if sympy.simplify(predicted - expected) == 0 else -1.0)
        except Exception:
            rewards.append(-1.0)
    return rewards

def format_reward(prompts, completions, **kwargs):
    rewards = []
    for completion in completions:
        has_think = bool(re.search(r'<think>.*?</think>', completion, re.DOTALL))
        has_answer = bool(re.search(r'<answer>.*?</answer>', completion, re.DOTALL))
        rewards.append(0.5 if (has_think and has_answer) else -0.5)
    return rewards

The format reward is just as important as the correctness reward. It teaches the model to use the expected reasoning structure, which improves readability and makes the correctness check more reliable.

Code Execution

For code generation tasks, run the output against a test suite in a subprocess. Warning: the snippet below provides no real sandboxing. subprocess.run inherits the parent process environment, including any API keys or credentials in env vars, and has unrestricted file and network access. Before using this in any training loop, wrap the subprocess call in a proper isolation layer (Docker with --network none --read-only, nsjail, or bubblewrap).

python

import re
import subprocess
import sys
import tempfile
import os

def code_reward(prompts, completions, test_cases, **kwargs):
    rewards = []
    for completion, tests in zip(completions, test_cases):
        # Extract code block
        match = re.search(r'```python\n(.*?)```', completion, re.DOTALL)
        if not match:
            rewards.append(-1.0)
            continue
        code = match.group(1)
        full_code = code + "\n\n" + tests
        fname = None
        try:
            with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
                fname = f.name
                f.write(full_code)
            result = subprocess.run(
                [sys.executable, fname],
                timeout=10,
                capture_output=True
            )
            rewards.append(1.0 if result.returncode == 0 else -1.0)
        except subprocess.TimeoutExpired:
            rewards.append(-1.0)
        except Exception:
            rewards.append(-1.0)
        finally:
            if fname:
                os.unlink(fname)
    return rewards

Partial credit (pass rate across test cases) works better than binary pass/fail for diverse test suites. Start with binary until you confirm the reward signal is stable.

Structured Output and Tool Use

Schema validation works well: check that generated JSON matches a Pydantic model (reward 1.0 for valid, -1.0 for invalid). Tool-use tasks can validate that API call parameters are syntactically correct before any actual execution.

What does not work: LLM-as-judge scoring. It is too slow for rollout throughput. Generating 8 completions per prompt and then calling a separate LLM to score each one roughly doubles your total compute cost per step. It also introduces systematic bias (the judge model's own tendencies) that is hard to distinguish from genuine quality improvements. Avoid it.

For full multi-step agent trajectories where each tool call needs credit assignment across steps, Agent RFT covers the trajectory-level extension of GRPO for tool-using agents.

Human feedback does not scale in a training loop either. You need thousands of reward evaluations per training step.

GPU Memory Math: Why GRPO Needs 2-3x More VRAM

The VRAM formula for a single-node GRPO run:

GRPO_VRAM ≈ (policy_weights × 2)      # online policy + reference model, both in bf16
           + (policy_params × 12)      # AdamW optimizer state: fp32 master weights + first moment + second moment
           + (G × T × 2 bytes)         # rollout buffer: G generations × T max tokens × 2 bytes (bf16 logits)
           + activation_peak           # ~15-20% headroom

For a 32B model (64 GB in bf16), with G=8 rollouts and T=1024 tokens:

Policy × 2: 128 GB (online policy + frozen reference)
Optimizer state: 384 GB (6× weight size in bf16)
Rollout buffer: TRL stores only token IDs and reference log-probs, not full logits. A few hundred MB per batch for G=8, T=1024.
AdamW with LoRA only: optimizer state drops to 3× LoRA param count in fp32 (12 bytes per trainable LoRA parameter)

In practice, GRPO with full-parameter training on 32B requires H200 (141 GB) or larger. With LoRA (r=64), optimizer state shrinks to a few GB and you can fit on H200 comfortably.

Model	Method	Config	Min VRAM	Recommended GPU
7B	SFT LoRA	r=16	16 GB	RTX 4090
7B	GRPO (single-node)	G=8, T=512	32-40 GB	A100 40GB or L40S
32B	SFT LoRA	r=16	40 GB	H100 or H200
32B	GRPO (single-node)	G=8, T=1024	100-120 GB	H200 (141 GB)
32B	GRPO (disaggregated trainer only)	G=8, T=1024	64-80 GB	H100 or H200
70B	SFT LoRA	r=16	80 GB	H100/H200
70B	GRPO (single-node)	G=8, T=1024	200-240 GB	B200 (192 GB) x2 or B300
70B	GRPO (disaggregated trainer only)	G=8, T=1024	140-160 GB	H200 or B200

Note: disaggregated GRPO removes the rollout buffer from the trainer node's VRAM accounting. The rollout server handles generation; the trainer only sees the returned rewards and reference log-probs. This is the practical path to running 70B GRPO on a single H200 trainer.

For deeper VRAM calculation background, see the GPU memory requirements for LLMs guide. For QLoRA or LoRA baseline VRAM numbers before layering on GRPO's rollout buffer, see the LLM fine-tuning VRAM sizing table.

Step-by-Step: GRPO on Spheron (TRL + vLLM Rollout Server)

1. Provision Your GPU Instance

Log into https://app.spheron.ai and rent an H200 instance as your trainer node. For a 32B model with disaggregated rollouts, you need one on-demand H200 for the trainer and one or two spot H200 nodes as rollout servers. The on-demand/spot split matters: the trainer holds gradient state and must not be preempted mid-step.

SSH into your trainer node and verify the GPU:

bash

nvidia-smi
# Should show H200 SXM5, 141 GB VRAM

2. Install Dependencies

bash

pip install "trl>=0.14" "vllm>=0.5" transformers accelerate peft

Verify TRL includes GRPOTrainer (added in 0.14):

bash

python -c "from trl import GRPOTrainer; print('TRL GRPOTrainer OK')"

3. Launch the vLLM Rollout Server

On your spot rollout node, start vLLM pointing at the same base model checkpoint:

bash

# On the rollout node (spot H200 instance)
vllm serve Qwen/Qwen2.5-32B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 4096 \
  --port 8001

For the vLLM production deployment guide covering tensor parallelism configuration and FP8 quantization on bare metal, see that post. The GRPO rollout server is simpler: you only need single-model serving at high throughput, not the full multi-model stack.

4. Define Reward Functions and Configure GRPOTrainer

python

from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset

# Reward function signatures follow TRL 0.14+ convention:
# inputs are lists, return a list of floats
def correctness_reward(prompts, completions, ground_truth, **kwargs):
    # ... implementation from Section 2 above
    pass

def format_reward(prompts, completions, **kwargs):
    # ... implementation from Section 2 above
    pass

config = GRPOConfig(
    output_dir="./grpo-32b-output",
    # vLLM rollout server connection
    use_vllm=True,
    vllm_server_host="<rollout_node_ip>",
    vllm_server_port=8001,
    # Rollout parameters
    num_generations=8,          # minimum for stable advantage estimates
    temperature=0.8,
    max_completion_length=512,  # start conservative, increase after confirming headroom
    # Training parameters
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,
    num_train_epochs=1,
    # KL penalty: prevents policy from drifting too far from reference
    beta=0.04,
    # Aggressive checkpointing: rollout regeneration is expensive
    save_steps=50,
    save_total_limit=3,
)

train_dataset = load_dataset("your/math-reasoning-dataset", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-32B-Instruct",
    config=config,
    reward_funcs=[correctness_reward, format_reward],
    train_dataset=train_dataset,
)
trainer.train()

5. Run and Monitor

bash

accelerate launch --config_file accelerate_config.yaml train_grpo.py

Key metrics to watch in TRL training logs:

reward/mean: should climb over the first 500-1000 steps, then plateau
reward/std: high variance means the rollout quality is inconsistent; reduce temperature
kl: should stay below 0.1; if it climbs past that and accelerates, reduce beta
loss/policy: should decrease but not go to zero (that is reward hacking)

Multi-Node GRPO with Rollout/Trainer Disaggregation

A GRPO run has two distinct compute shapes that do not belong on the same hardware.

Rollout generation is inference-shaped: pure forward passes, memory-bandwidth-bound, high throughput, no optimizer state, no backward pass. The rollout server holds no gradient information. If it gets preempted, you restart vLLM and regenerate. Total state loss: one batch of prompts.

Policy gradient updates are training-shaped: full forward + backward, compute-bound, requires AdamW optimizer state (3x parameter count in fp32: master weights + first moment + second moment = 12 bytes/param), and must survive to the end of each gradient step. If the trainer is preempted mid-backward, you lose all progress since the last checkpoint.

Running these on separate GPU pools:

[Trainer Node: on-demand H200]
         |
    reward batches + reference log-probs
         |
[Rollout Node 1: spot H200] + [Rollout Node 2: spot H200]

The trainer sends the current policy weights to rollout nodes, which generate completions and return (completion, reward) pairs. The trainer computes advantages and runs the gradient update locally.

Configuration for multi-server rollout: vllm_server_host accepts a single hostname. To distribute load across multiple vLLM servers, put them behind a load balancer (e.g., Nginx, HAProxy) and point TRL at the load balancer address:

python

config = GRPOConfig(
    use_vllm=True,
    vllm_server_host="<load_balancer_ip>",  # single host fronting multiple vLLM servers
    vllm_server_port=8001,
    # ... other params
)

Network requirement: at least 100 GbE between trainer and rollout nodes. Transferring reference log-probs and policy weights for a 32B model, 8 generations, 1024-token responses is roughly 200-300 MB per synchronization step. For multi-node networking tradeoffs, see the multi-node GPU training guide.

Disaggregated cost table:

Model size	Trainer	Rollout	Est. total cost/hr
7B	1x A100 on-demand ($1.04/hr)	1x A100 spot ($0.45/hr)	~$1.49/hr
32B	1x H200 on-demand ($3.96/hr)	2x H200 spot ($2.38/hr)	~$6.34/hr
70B	2x B200 on-demand ($11.08/hr)	4x B200 spot ($6.84/hr)	~$17.92/hr

Pricing fluctuates based on GPU availability. The prices above are based on 22 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Cost Breakdown: GRPO vs PPO vs DPO for a 32B Model

A 24-hour fine-tuning run of a 32B reasoning model, compared across three post-training methods. GPU configuration reflects the minimum practical setup for each method.

Method	Value Model?	GPU Configuration	24hr Cost (on-demand)	24hr Cost (spot where eligible)
DPO	No	1x H200	$95.04	$28.56 (fully spot-eligible)
PPO	Yes (32B critic)	2x H200	$190.08	$190.08 (both need gradient state)
GRPO (single node)	No	1x H200	$95.04	N/A (trainer: no spot)
GRPO (disaggregated)	No	1x H200 on-demand + 2x H200 spot	$285.12 (all on-demand)	$152.16 (spot rollout nodes)

Pricing fluctuates based on GPU availability. The prices above are based on 22 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

A few observations from this table:

PPO is the most expensive option in 2026. The 32B critic doubles VRAM requirements, forcing a second H200 or an upgrade to B200. Both the policy and critic must maintain gradient state, so neither is spot-eligible. For a 24-hour run, PPO costs roughly twice as much as single-node GRPO.

DPO is cheapest per GPU-hour and fully spot-eligible since it does no online generation. The tradeoff: DPO cannot teach reasoning chains your preference dataset does not already contain.

GRPO disaggregated lands between them. The trainer stays on on-demand, but rollout nodes run on spot. For a 24-hour run, disaggregated GRPO at $152.16 is cheaper than PPO ($190.08) and trains a reasoning model DPO cannot match.

For a detailed comparison of billing models, see the spot vs on-demand billing guide. For broader GPU cost strategies, the GPU cost optimization playbook covers amortization patterns across training and inference workloads.

Common Pitfalls and How to Fix Them

Reward Hacking

Symptom: reward/mean climbs steadily but accuracy on a held-out eval set stays flat or drops.

Diagnosis: the model found a shortcut that satisfies the reward function's pattern matching without actually solving the problem. Common examples: producing a boxed answer with the right format but wrong value, or generating test-passing code via hardcoded output rather than actual logic.

Fix: add a second orthogonal reward signal. If your primary reward checks correctness, add a format reward that evaluates chain-of-thought length and structure. Periodically evaluate on a held-out benchmark that the reward function never sees during training. If hacking persists, rotate reward function variants or add adversarial test cases to the evaluation suite.

KL Collapse

Symptom: kl metric grows past 0.1 and keeps accelerating. loss/policy may go NaN.

Diagnosis: the online policy is drifting too far from the reference model. The KL penalty weight (beta) is too low relative to the reward signal magnitude.

Fix: raise beta from 0.04 to 0.08. Counterintuitively, increasing beta increases the KL penalty and slows policy drift. Policy ratio clipping is already enabled via the default epsilon=0.2 in GRPOConfig. To tighten clipping during recovery from KL growth, reduce epsilon from 0.2 to 0.1. If training has already diverged, restore from the last checkpoint taken before KL exceeded 0.1. Do not try to recover from a diverged state by adjusting hyperparameters mid-run.

Rollout Throughput Bottlenecks

Symptom: trainer GPU sits at low utilization while waiting for rollouts. Step time is dominated by generation latency, not gradient computation.

Diagnosis: the rollout server is saturated, the vLLM request queue is backing up, or inter-node bandwidth is insufficient for the volume of logit tensors being transferred.

Fix: add a second spot rollout node with its own vLLM server and front both with a load balancer; set vllm_server_host to the load balancer address. Reduce max_completion_length by 50% as a temporary measure. Increase vLLM --tensor-parallel-size if you have spare GPUs on the rollout node. Reduce num_generations from 16 to 8 if step time is still unacceptable.

VRAM OOM on Rollout Buffer

Symptom: CUDA OOM on the trainer node during or immediately after the generate() call, in single-node GRPO mode.

Diagnosis: the rollout buffer (G generations × T max tokens × model hidden dim × dtype bytes) exceeds available VRAM on the trainer node.

Fix: reduce num_generations from 16 to 8 first. If still OOM, reduce max_completion_length by 50%. If you are already on disaggregated setup, verify the rollout buffer is materialized on the rollout node, not the trainer, and that only reward scalars and reference log-probs are being transferred back to the trainer.

Checkpointing and Fault Tolerance on Preemptible GPUs

GRPO needs more aggressive checkpointing than SFT for a specific reason: each lost training step requires regenerating G rollouts per prompt in the batch. For a batch of 64 prompts with G=8 generations at 1024 tokens, that is 524,288 tokens of regeneration per step. At vLLM throughput of 30K tokens/second on an H200, recovering one lost step takes 17 seconds. Losing 100 steps to a preemption costs nearly 30 minutes of compute.

Checkpointing strategy:

Set save_steps=50 for full trainer checkpoint (online policy weights, optimizer state, step counter).
Save the reference model once at training start to a persistent volume. It is frozen and never changes.
On preemption, the rollout server restarts from the same model weights (it is stateless between steps). Only trainer state needs recovery.
Store checkpoints on persistent network-attached storage, not the ephemeral instance disk.

python

from transformers import TrainerCallback
import json
import os

class GRPOResumeCallback(TrainerCallback):
    def __init__(self, checkpoint_dir, save_every=25):
        self.checkpoint_dir = checkpoint_dir
        self.save_every = save_every
        self.kl_history = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and 'kl' in logs:
            self.kl_history.append(logs['kl'])
            # Early warning if KL is climbing
            if len(self.kl_history) >= 10:
                recent_kl = self.kl_history[-10:]
                if recent_kl[-1] > 0.1 and recent_kl[-1] > recent_kl[0] * 1.5:
                    print(f"WARNING: KL divergence climbing ({recent_kl[-1]:.4f}). Consider raising beta.")

    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step % self.save_every == 0:
            resume_path = os.path.join(
                self.checkpoint_dir,
                f"grpo_resume_step_{state.global_step}.json"
            )
            resume_state = {
                'step': state.global_step,
                'kl_history': self.kl_history[-100:],  # last 100 KL values
                'best_metric': state.best_metric,
            }
            with open(resume_path, 'w') as f:
                json.dump(resume_state, f)

This pattern extends the checkpoint approach from the spot GPU training case study with GRPO-specific additions: KL history tracking and the lightweight per-25-step resume state.

Spot provisioning note: run rollout workers as spot instances. They hold no training state. If preempted, restart vLLM and continue from the last trainer checkpoint with no rollout state loss. Run the trainer on on-demand. The per-minute billing on Spheron means a short preemption on a spot rollout node costs seconds of wasted time, not minutes.

For teams running custom or GPU-native environments as rollout workers, the RL environments on GPU cloud guide covers vectorized environment architecture and Prime Intellect Environments Hub setup.

For training agents that chain multiple tool calls per task, the Agent RFT guide covers trajectory-level reward design and the vLLM rollout architecture for agentic fine-tuning.

GRPO training runs split cleanly into two workloads: a stateful trainer that needs stable on-demand compute, and stateless rollout workers that are ideal for spot pricing. Spheron lets you provision both separately - rent H200 instances for your trainer, add spot H200 rollout workers at a fraction of the on-demand rate, and pay per minute with no minimum commitment.
H200 GPU pricing → | Spheron B200 → | View live GPU pricing →
Start your GRPO run on Spheron →

STEPS / 06

Quick Setup Guide

Estimate VRAM and choose your GPU tier
Calculate VRAM for your model: (model params in bf16) × 2 for reference + online policy, plus optimizer state (3× param count in fp32 for AdamW: master weights + first moment + second moment), plus rollout buffer (a few hundred MB for token IDs and reference log-probs). For 32B GRPO with 8 generations and 1024-token responses: roughly 100-120 GB. Choose H200 (141 GB) for 32B models or B200 (192 GB) for 70B. Add 15-20% headroom for activation peaks.
Set up the vLLM rollout server
Deploy vLLM separately from the trainer on a spot H200 node: `vllm serve <model_path> --tensor-parallel-size 4 --gpu-memory-utilization 0.85 --max-model-len 4096`. TRL's GRPOTrainer connects to this server for rollout generation, freeing trainer GPU memory from inference overhead. Use Spheron spot instances for rollout nodes to cut costs.
Configure TRL GRPOTrainer with external rollout
Install TRL >= 0.14. In GRPOConfig set `use_vllm=True`, `vllm_server_host=<rollout_node_ip>`, `num_generations=8` (minimum for stable advantage estimates), `temperature=0.8`, and `beta=0.04` for the KL penalty. Start with `max_completion_length=512` and increase only after confirming memory headroom.
Implement and register verifiable reward functions
Write reward functions returning scalars in [-1, 1]. For math correctness: parse the model's final answer and compare against a SymPy-evaluated ground truth. For code: execute the generated snippet against a test suite with a 10-second timeout. Register both via `reward_funcs=[correctness_reward, format_reward]` in GRPOConfig - using multiple reward signals reduces reward hacking.
Set up checkpointing for spot trainer nodes
Set `save_steps=50` (much more frequent than SFT) because rollout regeneration after preemption is expensive. Write checkpoints to persistent network-attached storage that survives instance loss. Save the reference model once at the start - it is frozen and never changes. The per-step resume checkpoint only needs the online policy weights, optimizer state, and step counter.
Monitor for reward hacking and KL collapse
Log `reward_mean`, `reward_std`, and `kl_divergence` per step. If reward_mean climbs but downstream task accuracy (measured on a held-out eval set) falls, you have reward hacking - add a format reward signal or diversify reward functions. If KL divergence exceeds 0.1 and keeps climbing, reduce beta by 50% or add per-token KL clipping. Set early stopping if KL > 0.5.

FAQ / 05

Frequently Asked Questions

GRPO (Group Relative Policy Optimization) eliminates the separate value/critic model required by PPO. Instead of maintaining a critic network, GRPO scores multiple rollout completions within the same batch group, uses their mean reward as a baseline, and updates the policy to favor above-average outcomes. This removes the critic training instability that frequently causes PPO to diverge on LLMs and cuts peak VRAM by roughly 30-40% compared to PPO.

Expect 2-3x the VRAM of supervised fine-tuning on the same model. For a 32B model: SFT with LoRA needs roughly 40 GB; GRPO with 8 rollouts per prompt needs 80-120 GB because you hold the reference model, the online policy, the rollout buffer, and gradient state simultaneously. H200 (141 GB) is the practical minimum for 32B GRPO; B200 (192 GB) covers 70B models comfortably.

A good reward function must be deterministic, fast to evaluate, and granular enough to distinguish partially correct from fully correct answers. Math problems graded by a symbolic solver, code judged by a test suite, and structured-output tasks validated by a schema parser all work well. Avoid LLM-as-judge scoring - it is too slow for rollout throughput and introduces systematic bias.

Yes, with the right split. The trainer cluster (gradient updates, reference model) should run on on-demand instances to avoid losing checkpoint state on preemption. Rollout workers - which run inference-shaped workloads generating candidate responses - hold no training state and are ideal spot candidates. On Spheron, H200 spot instances typically run 30-40% cheaper than on-demand.

Disaggregation separates the two compute shapes inside a GRPO run: rollout generation (inference-shaped, memory-bandwidth-bound) and policy gradient updates (training-shaped, compute-bound). Running them on separate GPU pools lets you size each cluster independently - spot pricing for rollout workers, stable on-demand for the trainer - and avoids VRAM contention between the forward inference pass and the backward gradient pass.

What GRPO Is (and Why PPO Breaks on LLMs)

Verifiable Rewards: The Three Categories That Work

Math and Logic

Code Execution

Structured Output and Tool Use

GPU Memory Math: Why GRPO Needs 2-3x More VRAM

Step-by-Step: GRPO on Spheron (TRL + vLLM Rollout Server)

1. Provision Your GPU Instance

2. Install Dependencies

3. Launch the vLLM Rollout Server

4. Define Reward Functions and Configure GRPOTrainer

5. Run and Monitor

Multi-Node GRPO with Rollout/Trainer Disaggregation

Cost Breakdown: GRPO vs PPO vs DPO for a 32B Model

Common Pitfalls and How to Fix Them

Reward Hacking

KL Collapse

Rollout Throughput Bottlenecks

VRAM OOM on Rollout Buffer

Checkpointing and Fault Tolerance on Preemptible GPUs

Quick Setup Guide

Estimate VRAM and choose your GPU tier

Set up the vLLM rollout server

Configure TRL GRPOTrainer with external rollout

Implement and register verifiable reward functions

Set up checkpointing for spot trainer nodes

Monitor for reward hacking and KL collapse

Frequently Asked Questions

01What is GRPO and how does it differ from PPO?

02How much VRAM does GRPO training require?

03What makes a good verifiable reward function for GRPO?

04Can I run GRPO on spot GPU instances?

05What is rollout/trainer disaggregation in GRPO?

Try It on Real GPUs