GRPO is the post-training method behind DeepSeek-R1, R2, and most open-source reasoning models released in 2026. The implementation barrier dropped significantly: TRL 0.14 ships a GRPOTrainer you can point at a vLLM rollout server and run end-to-end. If you need the supervised fine-tuning baseline first, read the SFT fine-tuning guide before continuing here.
GRPO is not just a LoRA job with a different loss function. It eats 2-3x the VRAM of SFT, requires a rollout generation pipeline, and fails in specific ways that are not obvious until you hit them. This guide covers the memory math, the full TRL + vLLM setup on Spheron H200 and B200 nodes, disaggregated multi-node configuration, and a real cost breakdown comparing GRPO against PPO and DPO for a 32B model.
What GRPO Is (and Why PPO Breaks on LLMs)
PPO was the standard RL algorithm for LLM post-training for years. The problem is that it needs a value model (the critic) trained alongside the policy. On LLMs, the critic must learn to assign a scalar value to arbitrary token sequences of unbounded length. Credit assignment across a 1,000-token chain-of-thought is genuinely hard. The critic training introduces a second unstable optimization that diverges frequently, especially when reward signals are sparse or noisy.
DPO sidesteps the critic entirely by using a static offline preference dataset. You train on pairs of preferred and rejected completions without any online rollout generation. That works well for style alignment but cannot teach a model to reason through problems it has never seen solved correctly. The model can only learn strategies already represented in your dataset.
GRPO drops the critic by making the advantage estimate relative within the current batch. For each prompt, you sample G completions (typically 8-16) from the current policy. Each gets a reward score. The advantage for completion i is:
A_i = (r_i - mean(r_1..r_G)) / std(r_1..r_G)The policy update uses a clipped ratio loss identical to PPO's actor loss, plus a KL penalty term to keep the policy from drifting too far from the reference model:
L = E[min(ratio * A, clip(ratio, 1-ε, 1+ε) * A)] - β × KL[π_θ || π_ref]No critic. No value model. Just group-relative advantage from within-batch reward comparisons. DeepSeek-R1-Zero was trained entirely with GRPO on math and code problems. DeepSeek-R1 added a short cold-start SFT phase first to stabilize early training, then applied GRPO on top.
Verifiable Rewards: The Three Categories That Work
The entire GRPO framework depends on having a reward function that can score rollout completions automatically. Not every task has one. Here are the three categories where GRPO consistently works.
Math and Logic
The canonical GRPO reward setup. Your training data contains problems with ground truth answers. The model generates a chain-of-thought and a final boxed answer. You parse the answer and compare it to the ground truth using symbolic evaluation:
import re
import sympy
def correctness_reward(prompts, completions, ground_truth, **kwargs):
rewards = []
for completion, gt in zip(completions, ground_truth):
# Extract boxed answer
match = re.search(r'\\boxed\{([^}]+)\}', completion)
if not match:
rewards.append(-1.0)
continue
try:
predicted = sympy.sympify(match.group(1))
expected = sympy.sympify(gt)
rewards.append(1.0 if sympy.simplify(predicted - expected) == 0 else -1.0)
except Exception:
rewards.append(-1.0)
return rewards
def format_reward(prompts, completions, **kwargs):
rewards = []
for completion in completions:
has_think = bool(re.search(r'<think>.*?</think>', completion, re.DOTALL))
has_answer = bool(re.search(r'<answer>.*?</answer>', completion, re.DOTALL))
rewards.append(0.5 if (has_think and has_answer) else -0.5)
return rewardsThe format reward is just as important as the correctness reward. It teaches the model to use the expected reasoning structure, which improves readability and makes the correctness check more reliable.
Code Execution
For code generation tasks, run the output against a test suite in a subprocess. Warning: the snippet below provides no real sandboxing. subprocess.run inherits the parent process environment, including any API keys or credentials in env vars, and has unrestricted file and network access. Before using this in any training loop, wrap the subprocess call in a proper isolation layer (Docker with --network none --read-only, nsjail, or bubblewrap).
import re
import subprocess
import sys
import tempfile
import os
def code_reward(prompts, completions, test_cases, **kwargs):
rewards = []
for completion, tests in zip(completions, test_cases):
# Extract code block
match = re.search(r'```python\n(.*?)```', completion, re.DOTALL)
if not match:
rewards.append(-1.0)
continue
code = match.group(1)
full_code = code + "\n\n" + tests
fname = None
try:
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
fname = f.name
f.write(full_code)
result = subprocess.run(
[sys.executable, fname],
timeout=10,
capture_output=True
)
rewards.append(1.0 if result.returncode == 0 else -1.0)
except subprocess.TimeoutExpired:
rewards.append(-1.0)
except Exception:
rewards.append(-1.0)
finally:
if fname:
os.unlink(fname)
return rewardsPartial credit (pass rate across test cases) works better than binary pass/fail for diverse test suites. Start with binary until you confirm the reward signal is stable.
Structured Output and Tool Use
Schema validation works well: check that generated JSON matches a Pydantic model (reward 1.0 for valid, -1.0 for invalid). Tool-use tasks can validate that API call parameters are syntactically correct before any actual execution.
What does not work: LLM-as-judge scoring. It is too slow for rollout throughput. Generating 8 completions per prompt and then calling a separate LLM to score each one roughly doubles your total compute cost per step. It also introduces systematic bias (the judge model's own tendencies) that is hard to distinguish from genuine quality improvements. Avoid it.
Human feedback does not scale in a training loop either. You need thousands of reward evaluations per training step.
GPU Memory Math: Why GRPO Needs 2-3x More VRAM
The VRAM formula for a single-node GRPO run:
GRPO_VRAM ≈ (policy_weights × 2) # online policy + reference model, both in bf16
+ (policy_params × 12) # AdamW optimizer state: fp32 master weights + first moment + second moment
+ (G × T × 2 bytes) # rollout buffer: G generations × T max tokens × 2 bytes (bf16 logits)
+ activation_peak # ~15-20% headroomFor a 32B model (64 GB in bf16), with G=8 rollouts and T=1024 tokens:
- Policy × 2: 128 GB (online policy + frozen reference)
- Optimizer state: 384 GB (6× weight size in bf16)
- Rollout buffer: TRL stores only token IDs and reference log-probs, not full logits. A few hundred MB per batch for G=8, T=1024.
- AdamW with LoRA only: optimizer state drops to 3× LoRA param count in fp32 (12 bytes per trainable LoRA parameter)
In practice, GRPO with full-parameter training on 32B requires H200 (141 GB) or larger. With LoRA (r=64), optimizer state shrinks to a few GB and you can fit on H200 comfortably.
| Model | Method | Config | Min VRAM | Recommended GPU |
|---|---|---|---|---|
| 7B | SFT LoRA | r=16 | 16 GB | RTX 4090 |
| 7B | GRPO (single-node) | G=8, T=512 | 32-40 GB | A100 40GB or L40S |
| 32B | SFT LoRA | r=16 | 40 GB | H100 or H200 |
| 32B | GRPO (single-node) | G=8, T=1024 | 100-120 GB | H200 (141 GB) |
| 32B | GRPO (disaggregated trainer only) | G=8, T=1024 | 64-80 GB | H100 or H200 |
| 70B | SFT LoRA | r=16 | 80 GB | H100/H200 |
| 70B | GRPO (single-node) | G=8, T=1024 | 200-240 GB | B200 (192 GB) x2 or B300 |
| 70B | GRPO (disaggregated trainer only) | G=8, T=1024 | 140-160 GB | H200 or B200 |
Note: disaggregated GRPO removes the rollout buffer from the trainer node's VRAM accounting. The rollout server handles generation; the trainer only sees the returned rewards and reference log-probs. This is the practical path to running 70B GRPO on a single H200 trainer.
For deeper VRAM calculation background, see the GPU memory requirements for LLMs guide.
Step-by-Step: GRPO on Spheron (TRL + vLLM Rollout Server)
1. Provision Your GPU Instance
Log into https://app.spheron.ai and rent an H200 instance as your trainer node. For a 32B model with disaggregated rollouts, you need one on-demand H200 for the trainer and one or two spot H200 nodes as rollout servers. The on-demand/spot split matters: the trainer holds gradient state and must not be preempted mid-step.
SSH into your trainer node and verify the GPU:
nvidia-smi
# Should show H200 SXM5, 141 GB VRAM2. Install Dependencies
pip install "trl>=0.14" "vllm>=0.5" transformers accelerate peftVerify TRL includes GRPOTrainer (added in 0.14):
python -c "from trl import GRPOTrainer; print('TRL GRPOTrainer OK')"3. Launch the vLLM Rollout Server
On your spot rollout node, start vLLM pointing at the same base model checkpoint:
# On the rollout node (spot H200 instance)
vllm serve Qwen/Qwen2.5-32B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.85 \
--max-model-len 4096 \
--port 8001For the vLLM production deployment guide covering tensor parallelism configuration and FP8 quantization on bare metal, see that post. The GRPO rollout server is simpler: you only need single-model serving at high throughput, not the full multi-model stack.
4. Define Reward Functions and Configure GRPOTrainer
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
# Reward function signatures follow TRL 0.14+ convention:
# inputs are lists, return a list of floats
def correctness_reward(prompts, completions, ground_truth, **kwargs):
# ... implementation from Section 2 above
pass
def format_reward(prompts, completions, **kwargs):
# ... implementation from Section 2 above
pass
config = GRPOConfig(
output_dir="./grpo-32b-output",
# vLLM rollout server connection
use_vllm=True,
vllm_server_host="<rollout_node_ip>",
vllm_server_port=8001,
# Rollout parameters
num_generations=8, # minimum for stable advantage estimates
temperature=0.8,
max_completion_length=512, # start conservative, increase after confirming headroom
# Training parameters
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=5e-7,
num_train_epochs=1,
# KL penalty: prevents policy from drifting too far from reference
beta=0.04,
# Aggressive checkpointing: rollout regeneration is expensive
save_steps=50,
save_total_limit=3,
)
train_dataset = load_dataset("your/math-reasoning-dataset", split="train")
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-32B-Instruct",
config=config,
reward_funcs=[correctness_reward, format_reward],
train_dataset=train_dataset,
)
trainer.train()5. Run and Monitor
accelerate launch --config_file accelerate_config.yaml train_grpo.pyKey metrics to watch in TRL training logs:
reward/mean: should climb over the first 500-1000 steps, then plateaureward/std: high variance means the rollout quality is inconsistent; reduce temperaturekl: should stay below 0.1; if it climbs past that and accelerates, reducebetaloss/policy: should decrease but not go to zero (that is reward hacking)
Multi-Node GRPO with Rollout/Trainer Disaggregation
A GRPO run has two distinct compute shapes that do not belong on the same hardware.
Rollout generation is inference-shaped: pure forward passes, memory-bandwidth-bound, high throughput, no optimizer state, no backward pass. The rollout server holds no gradient information. If it gets preempted, you restart vLLM and regenerate. Total state loss: one batch of prompts.
Policy gradient updates are training-shaped: full forward + backward, compute-bound, requires AdamW optimizer state (3x parameter count in fp32: master weights + first moment + second moment = 12 bytes/param), and must survive to the end of each gradient step. If the trainer is preempted mid-backward, you lose all progress since the last checkpoint.
Running these on separate GPU pools:
[Trainer Node: on-demand H200]
|
reward batches + reference log-probs
|
[Rollout Node 1: spot H200] + [Rollout Node 2: spot H200]The trainer sends the current policy weights to rollout nodes, which generate completions and return (completion, reward) pairs. The trainer computes advantages and runs the gradient update locally.
Configuration for multi-server rollout: vllm_server_host accepts a single hostname. To distribute load across multiple vLLM servers, put them behind a load balancer (e.g., Nginx, HAProxy) and point TRL at the load balancer address:
config = GRPOConfig(
use_vllm=True,
vllm_server_host="<load_balancer_ip>", # single host fronting multiple vLLM servers
vllm_server_port=8001,
# ... other params
)Network requirement: at least 100 GbE between trainer and rollout nodes. Transferring reference log-probs and policy weights for a 32B model, 8 generations, 1024-token responses is roughly 200-300 MB per synchronization step. For multi-node networking tradeoffs, see the multi-node GPU training guide.
Disaggregated cost table:
| Model size | Trainer | Rollout | Est. total cost/hr |
|---|---|---|---|
| 7B | 1x A100 on-demand ($1.04/hr) | 1x A100 spot ($0.45/hr) | ~$1.49/hr |
| 32B | 1x H200 on-demand ($3.96/hr) | 2x H200 spot ($2.38/hr) | ~$6.34/hr |
| 70B | 2x B200 on-demand ($11.08/hr) | 4x B200 spot ($6.84/hr) | ~$17.92/hr |
Pricing fluctuates based on GPU availability. The prices above are based on 22 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Cost Breakdown: GRPO vs PPO vs DPO for a 32B Model
A 24-hour fine-tuning run of a 32B reasoning model, compared across three post-training methods. GPU configuration reflects the minimum practical setup for each method.
| Method | Value Model? | GPU Configuration | 24hr Cost (on-demand) | 24hr Cost (spot where eligible) |
|---|---|---|---|---|
| DPO | No | 1x H200 | $95.04 | $28.56 (fully spot-eligible) |
| PPO | Yes (32B critic) | 2x H200 | $190.08 | $190.08 (both need gradient state) |
| GRPO (single node) | No | 1x H200 | $95.04 | N/A (trainer: no spot) |
| GRPO (disaggregated) | No | 1x H200 on-demand + 2x H200 spot | $285.12 (all on-demand) | $152.16 (spot rollout nodes) |
Pricing fluctuates based on GPU availability. The prices above are based on 22 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
A few observations from this table:
PPO is the most expensive option in 2026. The 32B critic doubles VRAM requirements, forcing a second H200 or an upgrade to B200. Both the policy and critic must maintain gradient state, so neither is spot-eligible. For a 24-hour run, PPO costs roughly twice as much as single-node GRPO.
DPO is cheapest per GPU-hour and fully spot-eligible since it does no online generation. The tradeoff: DPO cannot teach reasoning chains your preference dataset does not already contain.
GRPO disaggregated lands between them. The trainer stays on on-demand, but rollout nodes run on spot. For a 24-hour run, disaggregated GRPO at $152.16 is cheaper than PPO ($190.08) and trains a reasoning model DPO cannot match.
For a detailed comparison of billing models, see the spot vs on-demand billing guide. For broader GPU cost strategies, the GPU cost optimization playbook covers amortization patterns across training and inference workloads.
Common Pitfalls and How to Fix Them
Reward Hacking
Symptom: reward/mean climbs steadily but accuracy on a held-out eval set stays flat or drops.
Diagnosis: the model found a shortcut that satisfies the reward function's pattern matching without actually solving the problem. Common examples: producing a boxed answer with the right format but wrong value, or generating test-passing code via hardcoded output rather than actual logic.
Fix: add a second orthogonal reward signal. If your primary reward checks correctness, add a format reward that evaluates chain-of-thought length and structure. Periodically evaluate on a held-out benchmark that the reward function never sees during training. If hacking persists, rotate reward function variants or add adversarial test cases to the evaluation suite.
KL Collapse
Symptom: kl metric grows past 0.1 and keeps accelerating. loss/policy may go NaN.
Diagnosis: the online policy is drifting too far from the reference model. The KL penalty weight (beta) is too low relative to the reward signal magnitude.
Fix: raise beta from 0.04 to 0.08. Counterintuitively, increasing beta increases the KL penalty and slows policy drift. Policy ratio clipping is already enabled via the default epsilon=0.2 in GRPOConfig. To tighten clipping during recovery from KL growth, reduce epsilon from 0.2 to 0.1. If training has already diverged, restore from the last checkpoint taken before KL exceeded 0.1. Do not try to recover from a diverged state by adjusting hyperparameters mid-run.
Rollout Throughput Bottlenecks
Symptom: trainer GPU sits at low utilization while waiting for rollouts. Step time is dominated by generation latency, not gradient computation.
Diagnosis: the rollout server is saturated, the vLLM request queue is backing up, or inter-node bandwidth is insufficient for the volume of logit tensors being transferred.
Fix: add a second spot rollout node with its own vLLM server and front both with a load balancer; set vllm_server_host to the load balancer address. Reduce max_completion_length by 50% as a temporary measure. Increase vLLM --tensor-parallel-size if you have spare GPUs on the rollout node. Reduce num_generations from 16 to 8 if step time is still unacceptable.
VRAM OOM on Rollout Buffer
Symptom: CUDA OOM on the trainer node during or immediately after the generate() call, in single-node GRPO mode.
Diagnosis: the rollout buffer (G generations × T max tokens × model hidden dim × dtype bytes) exceeds available VRAM on the trainer node.
Fix: reduce num_generations from 16 to 8 first. If still OOM, reduce max_completion_length by 50%. If you are already on disaggregated setup, verify the rollout buffer is materialized on the rollout node, not the trainer, and that only reward scalars and reference log-probs are being transferred back to the trainer.
Checkpointing and Fault Tolerance on Preemptible GPUs
GRPO needs more aggressive checkpointing than SFT for a specific reason: each lost training step requires regenerating G rollouts per prompt in the batch. For a batch of 64 prompts with G=8 generations at 1024 tokens, that is 524,288 tokens of regeneration per step. At vLLM throughput of 30K tokens/second on an H200, recovering one lost step takes 17 seconds. Losing 100 steps to a preemption costs nearly 30 minutes of compute.
Checkpointing strategy:
- Set
save_steps=50for full trainer checkpoint (online policy weights, optimizer state, step counter). - Save the reference model once at training start to a persistent volume. It is frozen and never changes.
- On preemption, the rollout server restarts from the same model weights (it is stateless between steps). Only trainer state needs recovery.
- Store checkpoints on persistent network-attached storage, not the ephemeral instance disk.
from transformers import TrainerCallback
import json
import os
class GRPOResumeCallback(TrainerCallback):
def __init__(self, checkpoint_dir, save_every=25):
self.checkpoint_dir = checkpoint_dir
self.save_every = save_every
self.kl_history = []
def on_log(self, args, state, control, logs=None, **kwargs):
if logs and 'kl' in logs:
self.kl_history.append(logs['kl'])
# Early warning if KL is climbing
if len(self.kl_history) >= 10:
recent_kl = self.kl_history[-10:]
if recent_kl[-1] > 0.1 and recent_kl[-1] > recent_kl[0] * 1.5:
print(f"WARNING: KL divergence climbing ({recent_kl[-1]:.4f}). Consider raising beta.")
def on_step_end(self, args, state, control, **kwargs):
if state.global_step % self.save_every == 0:
resume_path = os.path.join(
self.checkpoint_dir,
f"grpo_resume_step_{state.global_step}.json"
)
resume_state = {
'step': state.global_step,
'kl_history': self.kl_history[-100:], # last 100 KL values
'best_metric': state.best_metric,
}
with open(resume_path, 'w') as f:
json.dump(resume_state, f)This pattern extends the checkpoint approach from the spot GPU training case study with GRPO-specific additions: KL history tracking and the lightweight per-25-step resume state.
Spot provisioning note: run rollout workers as spot instances. They hold no training state. If preempted, restart vLLM and continue from the last trainer checkpoint with no rollout state loss. Run the trainer on on-demand. The per-minute billing on Spheron means a short preemption on a spot rollout node costs seconds of wasted time, not minutes.
GRPO training runs split cleanly into two workloads: a stateful trainer that needs stable on-demand compute, and stateless rollout workers that are ideal for spot pricing. Spheron lets you provision both separately - rent H200 instances for your trainer, add spot H200 rollout workers at a fraction of the on-demand rate, and pay per minute with no minimum commitment.
