GRPO optimizes single completions: one text output, one score, one gradient update. That works for math reasoning, code generation, and structured output tasks where the unit of work is a single generation. Agent RFT operates differently. The unit of optimization is a trajectory: a sequence of tool calls, their execution results, and the final task state. The distinction matters because agentic tasks do not fail at the last token. They fail mid-sequence, often at tool call 3 of 8, and a training setup that cannot assign credit across steps cannot fix those failures.
This guide covers the mechanics of Agent RFT: how trajectory-level optimization differs from the standard GRPO fine-tuning setup, how to design verifiable rewards for tool-calling agents, GPU sizing for 7B to 70B models, and the disaggregated spot/on-demand architecture that cuts training costs by 30-40% on runs longer than 8 hours.
What Agent RFT Is (and What Makes It Different)
Standard GRPO generates G completions per prompt, scores each independently, computes group-relative advantage, and updates the policy. The optimization signal is per-completion. That is fine when the prompt-response relationship is direct.
An agentic task breaks that assumption. Consider a model trying to complete a customer support task that requires four tool calls: look up the order, check the return policy, process the refund, and send a confirmation email. If the agent makes the first three calls correctly and fails on the fourth, the task fails. A single-completion GRPO setup would assign a -1 reward to the entire sequence, including the three correct steps. That is not informative gradient signal.
Agent RFT handles this by treating the trajectory as the optimization unit:
<tool_call>{"name": "get_order", "parameters": {"order_id": "4722"}}</tool_call>
<tool_result>{"status": "delivered", "date": "2026-05-12", "item": "GPU A100"}</tool_result>
<tool_call>{"name": "check_return_policy", "parameters": {"item_type": "electronics"}}</tool_call>
<tool_result>{"eligible": true, "window_days": 30, "days_since_delivery": 33}</tool_result>
<tool_call>{"name": "process_refund", "parameters": {"order_id": "4722", "amount": 2400}}</tool_call>
<tool_result>{"error": "return_window_exceeded"}</tool_result>The agent had all the information it needed after step 2 (the item is 33 days old, the window is 30 days). The refund attempt was the actual failure. With trajectory-level reward, you can score partial credit: +0.5 for reaching the correct tool calls in the right order, -1.0 for the final wrong action. With per-step shaped rewards, you can score each call individually and provide even tighter gradient signal.
ByteDance's verl 0.4+ supports trajectory rollouts natively via its hybrid actor-rollout engine. Unsloth added a low-VRAM Agent RFT recipe in mid-2026 that fits on a single 80GB card for 7B-8B models using its fast_inference patching and trajectory-aware GRPOTrainer. Both build on the same core insight: the GRPO advantage computation still works at the trajectory level; you just need rollout infrastructure that collects full multi-step sequences instead of single completions.
When Agent RFT Beats SFT and DPO for Agentic Tasks
SFT on demonstration trajectories is always the starting point. If you have a corpus of successful multi-step agent interactions, SFT on those trajectories will get you 60-75% of the way there with significantly less infrastructure complexity. Agent RFT layers on top to push into the failure modes that demonstrations do not cover.
| Method | Training Data Required | Online Rollouts | Agentic Trajectory Support | Task Coverage |
|---|---|---|---|---|
| SFT | Demonstration trajectories | No | Partial (imitation only) | Patterns in data |
| DPO | Preference pairs (win/loss) | No | Limited (static pairs) | Patterns in data |
| GRPO | Prompts + verifiable reward | Yes (single-turn) | No (single completion) | Full distribution |
| Agent RFT | Prompts + tool sandbox | Yes (multi-turn) | Yes (trajectory-level) | Full distribution |
SFT cannot teach the agent to recover from tool errors it has never seen in demonstrations. If the training data only contains successful trajectories (common in practice, since failed trajectories are rarely logged), the model has no signal for how to handle a tool call that returns a 429, a malformed response, or an empty result set.
DPO is wrong for multi-step agentic tasks specifically because it uses static offline preference pairs. In a multi-turn setting, the "preferred" trajectory depends on the actual tool execution results at runtime. You cannot construct a meaningful static preference pair for a trajectory that involves live API calls, because the tool results are not fixed.
Agent RFT applies RL pressure on the exact failure modes by running the agent against a real tool sandbox during training. The model sees real error responses, real partial results, and real constraint violations - and the reward function scores how it handles each one.
For evaluating whether your agent needs RFT, run it on tool calling benchmarks including tau-Bench. If you see systematic patterns at turn 3-5 (state loss, wrong function selection, failure to interpret error responses), those are the signals that Agent RFT can address. If failures are random across turns, the issue is more likely context length or base model quality.
Designing Verifiable Reward Functions for Agent Trajectories
The reward function is the most important part of Agent RFT. A bad reward function produces a model that gets good at gaming the metric, not completing tasks.
Three categories of verifiable rewards that work reliably:
Code execution agents: Run the generated code in a sandbox, execute the test suite, and return pass rate.
import re
import subprocess
import tempfile
import uuid
from pathlib import Path
def code_execution_reward(trajectory: str) -> float:
# WARNING: always sandbox code execution - never run generated code outside Docker
# Use: docker run --network none --read-only python:3.11-slim python /code.py
code = extract_code_from_trajectory(trajectory)
tests = get_test_suite()
tests_source = get_test_suite_source()
total = len(tests)
if total == 0:
return 0.0
cid = f"rft-sandbox-{uuid.uuid4().hex[:8]}"
with tempfile.TemporaryDirectory() as tmpdir:
Path(tmpdir, 'code.py').write_text(code)
Path(tmpdir, 'test_code.py').write_text(tests_source)
try:
result = subprocess.run(
["docker", "run", "--rm", "--name", cid, "--network", "none",
"--read-only", "--tmpfs", "/tmp", "-v", f"{tmpdir}:/workspace",
"python:3.11-slim", "python", "-m", "pytest", "/workspace/",
"-q", "--tb=no"],
capture_output=True, text=True, timeout=30
)
except subprocess.TimeoutExpired:
try:
subprocess.run(["docker", "kill", cid], capture_output=True, timeout=10)
except subprocess.TimeoutExpired:
pass # Docker daemon unresponsive; abandon cleanup and continue
return -1.0 # treat timeout as a crash/infinite loop
m = re.search(r'(\d+) passed', result.stdout)
passed = int(m.group(1)) if m else 0
if passed > 0:
return passed / total
# returncode 0 or 1 means pytest ran (all-pass or some-fail); returncode 2+ means
# collection/internal error — tests never executed, so gradient signal would be wrong
if result.returncode <= 1:
return 0.0 # tests ran but none passed — legitimate zero score, not a crash
return -1.0 # compile error or complete crash, no tests ranAPI call agents: Validate that the trajectory called the correct API with correct parameters and the response schema matches expectations.
def api_call_reward(trajectory: str, expected_schema: dict) -> float:
calls = extract_tool_calls(trajectory)
results = extract_tool_results(trajectory)
schema_valid = 0.0
n = min(len(calls), len(results))
if n == 0:
schema_valid = 0.0
else:
for call, result in zip(calls, results):
if validate_schema(call, expected_schema):
schema_valid += 0.25 / n # partial credit per pair actually evaluated
if result.get("status_code", 0) in range(200, 300):
schema_valid += 0.75 / n
# task completion check
final_state = get_task_completion_state(trajectory)
return schema_valid * 0.5 + float(final_state) * 0.5Database and search agents: Score whether the retrieved result satisfies a structured predicate.
def search_reward(trajectory: str, target_entity: str, predicate: callable) -> float:
results = extract_search_results(trajectory)
if not results:
return -1.0
# does any retrieved result satisfy the predicate?
for r in results:
if predicate(r, target_entity):
return 1.0
return 0.0 # no useful result foundThe composition pattern ties these together into a single reward signal:
def trajectory_reward(trajectory: str, task: dict) -> float:
# format check: valid tool call syntax
format_score = check_tool_call_syntax(trajectory) # [0.0, 1.0]
# execution check: did the right calls produce the right results
execution_score = evaluate_tool_results(trajectory, task) # [0.0, 1.0]
# task completion check: final task state
completion_score = check_task_complete(trajectory, task) # {-1.0, 0.0, 1.0}
return 0.1 * format_score + 0.4 * execution_score + 0.5 * completion_scoreTwo things to avoid: LLM-as-judge (generating 8 trajectories per prompt and then calling a second LLM to score each one roughly doubles total compute per step and adds systematic bias from the judge model's own tendencies) and non-deterministic sandboxes (if the same tool call returns different results at different times, reward variance dominates the training signal and learning stalls).
For the code execution sandbox, use Docker with --network none --read-only. Never run generated code in an unsandboxed subprocess, even in a training environment. The model will generate code designed to access the filesystem, environment variables, or network during training.
GPU Sizing: Single 80GB vs Multi-GPU Rollouts
VRAM for Agent RFT follows the same structure as GRPO but adds the rollout buffer for full trajectories:
Agent_RFT_VRAM ≈
model_bf16 × 2 # online policy + reference model (both bf16)
+ model_fp32 × 12 # AdamW: fp32 master weights + first + second moments
+ G × T × 2 bytes # rollout buffer: G trajectories × T total tokens × 2 bytes/tokenConcrete estimates for common configurations:
| Model | Config | VRAM Estimate | Recommended GPU |
|---|---|---|---|
| 7B | G=8, T=4096, single-node | 70-80 GB | H100 PCIe 80GB or H200 |
| 32B | G=8, T=4096, single-node | 110-130 GB | H200 141GB |
| 70B | G=8, T=4096, multi-GPU | 280-320 GB | 2x H200 or B200 192GB |
For 32B single-node RFT, the H200 141GB is the practical minimum. It gives enough headroom for G=8 trajectories at 4096 total tokens per trajectory without triggering OOM during the backward pass. If VRAM is tight, offload the reference model (frozen, never updated) to CPU: set reference_model_init_kwargs={"device_map": "cpu"} in GRPOConfig. That frees ~30 GB on 32B runs at a 10-15% step-time penalty.
For 70B, the B200 192GB handles single-GPU training with G=8 and T=8192 tokens. At T=16384 (16 tool calls at ~1024 tokens each), the rollout buffer alone is ~4 GB, which stays manageable. If you increase to G=16 generations or T=32768, you will need to switch to 2-GPU tensor parallelism or reduce num_generations.
Important: the rollout buffer scales with G × T total tokens, not G × num_calls. For a 32B run with 16 tool calls at 1024 tokens each, T=16384. At G=8, the buffer is 8 × 16384 × 2 bytes ≈ 262 MB, which is fine. But at G=16 and T=32768, the buffer hits ~1 GB and competes with optimizer state. Reduce num_generations before reducing trajectory depth - shallower trajectories lose the training signal you are trying to capture.
The disaggregated architecture separates the trainer from the rollout workers. The trainer holds optimizer state and runs gradient updates. Rollout workers run inference-shaped workloads, collect trajectories, and send them back. For the architecture details, see the RLHF training infrastructure guide - the verl HybridEngine and OpenRLHF Ray actor patterns transfer directly to Agent RFT.
Hands-On: Agent RFT with verl, OpenRLHF, and Unsloth
verl (best for 32B+, disaggregated rollout)
verl 0.4+ supports trajectory rollouts via a configurable rollout environment. The RolloutWorker handles the multi-turn loop and the reward_fn receives the full trajectory string.
# verl agent rft config
actor_rollout_ref:
model:
path: meta-llama/Llama-3-32B-Instruct
rollout:
name: sglang
multi_turn: true # key: enables trajectory rollouts
max_turns: 8 # max tool calls per trajectory
temperature: 1.0
n: 8 # G=8 trajectories per prompt
max_model_len: 8192
ref:
log_prob_micro_batch_size: 4
algorithm:
kl_ctrl:
kl_coef: 0.04
adv_estimator: grpo # group-relative advantage at trajectory level
trainer:
n_gpus_per_node: 1
nnodes: 1
save_freq: 25Register the verifiable reward function:
from verl.trainer.ppo.ray_trainer import RayPPOTrainer
def trajectory_reward_fn(data_items):
rewards = []
for item in data_items:
trajectory = item.response
task = item.task_metadata
reward = trajectory_reward(trajectory, task)
rewards.append(reward)
return rewards
trainer = RayPPOTrainer(
config=config,
reward_fn=trajectory_reward_fn,
)
trainer.fit()OpenRLHF (best for heterogeneous hardware pools)
OpenRLHF uses Ray actors for the rollout workers. The --agent_func_path flag enables the multi-turn agent loop by passing a token-in-token-out agent function that steps through tool calls and accumulates rewards per trajectory.
# trainer node (on-demand H200)
python -m openrlhf.cli.train_ppo \
--pretrain meta-llama/Llama-3-32B-Instruct \
--reward_pretrain none \
--reward_fn trajectory_reward_fn \
--agent_func_path trajectory_agent_fn \
--max_epochs 2 \
--micro_train_batch_size 4 \
--num_episodes 8 \
--rollout_batch_size 64 \
--save_steps 25 \
--kl_target 0.04 \
--ref_offload # offload reference model to CPUThe --ref_offload flag is worth using even if you have enough VRAM. On 32B runs, it frees the reference model's VRAM footprint (~60 GB) for use during rollout generation, allowing larger batches.
Unsloth (best for single 80GB GPU, 7B-8B models)
Unsloth's GRPOTrainer with agent GRPO support uses fast_inference patching to reduce VRAM for both the online policy and the trajectory buffer. It fits 7B-8B agent RFT on a single H100 PCIe 80GB.
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer
# load with Unsloth's fast_inference patching
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=8192,
load_in_4bit=False, # agent rft needs bf16 for stable gradients
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=32,
)
sandbox = DockerToolSandbox(image="python:3.11-slim", network="none")
def agent_trajectory_rollout(prompts, **kwargs):
"""Custom rollout: run the full agent loop and return complete trajectories.
Note: TRL's GRPOTrainer rollout_func expects a dict with keys
prompt_ids, completion_ids, and logprobs (token ids and log-probs,
not raw strings). The run_agent_loop helper must tokenize the trajectory
and return that dict shape before this function returns.
"""
trajectories = []
for prompt in prompts:
traj = run_agent_loop(
model=model,
tokenizer=tokenizer,
initial_prompt=prompt,
tool_sandbox=sandbox,
max_turns=8,
)
# run_agent_loop returns {"prompt_ids": ..., "completion_ids": ..., "logprobs": ...}
trajectories.append(traj)
return trajectories
config = GRPOConfig(
output_dir="./agent-rft-checkpoints",
num_generations=8,
max_completion_length=8192, # full trajectory length
beta=0.04,
save_steps=25,
use_vllm=False, # use Unsloth's fast_inference instead
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
)
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
args=config,
train_dataset=task_dataset,
reward_funcs=[trajectory_reward],
rollout_func=agent_trajectory_rollout, # inject the multi-turn rollout
)
trainer.train()The rollout_func hook is the key addition. It replaces the default single-completion generation with a full agent loop, letting you use any tool sandbox implementation without changing the training framework.
Cost Math: RFT Training Runs on Spheron Spot vs On-Demand
Agent RFT has two distinct compute phases per training step: the rollout phase (inference-shaped, stateless, runs the agent loop and collects trajectories) and the update phase (gradient-shaped, stateful, computes advantage and runs backprop). The rollout phase is spot-eligible because rollout workers hold no optimizer state. The update phase must stay on on-demand.
Live GPU pricing as of 14 Jun 2026 (all figures per GPU per hour):
| GPU | On-Demand | Spot |
|---|---|---|
| A100 80G SXM4 | $1.80/hr | $0.82/hr |
| H100 SXM5 | $3.92/hr | $1.43/hr |
| H200 SXM5 | $4.84/hr | $1.82/hr |
| B200 SXM6 | $7.41/hr | $2.71/hr |
Pricing fluctuates based on GPU availability. The prices above are based on 14 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
24-hour training cost by configuration:
| Method | Model | GPU Config | 24hr Trainer Cost | 24hr Rollout Cost |
|---|---|---|---|---|
| SFT only | 7B | 1x A100 SXM4 (spot) | - | $19.68 (fully spot) |
| Agent RFT | 7B | 1x H100 (single-node) | $94.08 on-demand | included (single-node) |
| Agent RFT (disaggregated) | 32B | 1x H200 trainer + 2x H200 spot rollout | $116.16 on-demand | $87.36 rollout |
| Agent RFT (disaggregated) | 70B | 1x B200 trainer + 2x B200 spot rollout | $177.84 on-demand | $130.08 rollout |
For 7B single-node runs, disaggregation is not worth the operational overhead. Run everything on one H100 or H200 with use_vllm=False. For 32B and 70B, the disaggregated split pays off: rollout workers on Spheron's 5+ providers can be provisioned and released per training batch, so you only pay for rollout compute while it is actually generating trajectories.
For checkpoint strategy on spot rollout workers, see the spot GPU training resilience guide. The core pattern (checkpoint the trainer every 25 steps, never checkpoint rollout workers) applies directly to Agent RFT. For scaling rollout to large vectorized environment fleets, see the RL environments on GPU cloud guide.
Serving the Fine-Tuned Agent and Closing the Eval Loop
After training, merge any LoRA adapters before deployment:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# load base + adapter
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-32B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-32B-Instruct")
model = PeftModel.from_pretrained(base_model, "./agent-rft-checkpoints/final")
merged = model.merge_and_unload()
merged.save_pretrained("./merged_agent_model")
tokenizer.save_pretrained("./merged_agent_model")Serve with vLLM and tool calling enabled:
vllm serve ./merged_agent_model \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 8192 \
--tensor-parallel-size 1Run tau-Bench on the merged model to close the eval loop:
python run_eval.py \
--domain retail \
--model http://localhost:8000/v1 \
--max-concurrency 8 \
--num-trials 3A successful Agent RFT run adds 10-25 percentage points of task completion over the SFT baseline for tool-calling tasks not covered by the demonstrations. The gain is concentrated on failure modes the training reward specifically targeted: mid-trajectory tool errors, wrong parameter types, failure to recover from empty results.
If post-RFT tau-Bench accuracy is flat or below the SFT baseline, check for reward hacking. Common patterns: the agent learns to produce trajectories that technically satisfy the format_reward component (valid JSON syntax for all tool calls) without actually calling the tools in a useful order, getting a score of 0.1 per step without ever completing a task. Add a hard gate: if the trajectory_success_reward is -1.0, return -1.0 as the total reward regardless of format or execution scores. That removes the incentive to optimize low-weight reward components at the expense of task completion.
The eval-to-train loop works directly: take the tau-Bench failure cases, add them as new training prompts with the same tool sandbox, and run another round of Agent RFT. Each round pushes the agent further into the failure modes the previous round exposed. Three to four rounds typically covers the systematic failure clusters a 7B-32B model starts with.
Agent RFT's bursty rollout-plus-train workload fits Spheron's spot pricing model: run rollout workers on spot H200s, keep the trainer on on-demand, and pay per minute with no minimum commitment. Verified live rates at H200 GPU pricing →.
Quick Setup Guide
VRAM formula for Agent RFT: (model_params_in_bf16 × 2 for online policy + reference) + (model_params × 12 for AdamW optimizer state: fp32 master weights + first and second moments) + (G × T × D bytes for rollout buffer, where G = number of trajectories per batch, T = total tokens per trajectory = num_tool_calls × avg_tokens_per_call, D = model hidden dimension × 2 bytes). For 7B agent RFT with G=8 trajectories, T=4096 tokens each: rollout buffer ≈ 4 GB. Total: ~68-75 GB. Fits on H100 80GB with gradient checkpointing enabled. For 32B with G=8, T=8192: total ≈ 110-130 GB. Use H200 141GB. For 70B: B200 192GB or two H200 with tensor parallelism.
Build a tool execution sandbox before starting RFT. For code agents, use Docker with --network none --read-only for sandboxed subprocess execution (never run generated code outside a sandbox). For API agents, proxy calls through a local mock server that records and validates call sequences. For search or database agents, use a local Elasticsearch or SQLite instance with a fixed snapshot. The sandbox must return a deterministic binary or scalar reward: 1.0 for complete task success, 0.5 for partial (e.g. correct API called, wrong parameters), -1.0 for failure or invalid call. Register the sandbox reward function as the verifiable reward in your training framework. Test it manually with a fixed trajectory before starting the training run.
For TRL GRPOTrainer: implement a custom rollout function that runs the full agent loop (tool call → tool result → next call) and returns the complete trajectory as a single string including tool call XML/JSON and results. Pass the verifiable reward as reward_funcs=[trajectory_reward]. Set max_completion_length to the maximum total trajectory length (e.g., 8192 for 8-step trajectories at 1024 tokens per step). For verl: define a multi-turn rollout environment using multi_turn: true under the rollout config with the SGLang engine. For OpenRLHF: use the --agent_func_path flag to pass a token-in-token-out agent function that steps through tool calls and accumulates rewards. Set num_generations to 8 for stable group-relative advantage estimates.
Provision an H200 141GB on-demand instance (for 7B-32B single-node RFT). Install dependencies: pip install 'trl>=0.14' 'vllm>=0.6' transformers accelerate peft. For disaggregated rollout, provision a separate spot H200 for the vLLM rollout server: vllm serve <model_path> --tensor-parallel-size 2 --gpu-memory-utilization 0.85 --max-model-len 8192. Configure TRL GRPOConfig with use_vllm=True, vllm_server_host=<rollout_node_ip>, num_generations=8, beta=0.04, and save_steps=25. Start training: trainer.train(). Monitor kl_divergence, reward_mean, and reward_std per step in WandB or TensorBoard. For 32B models, set reference_model_init_kwargs device_map='cpu' to offload the reference model if VRAM is tight.
Write the reward function to return float tensors in [-1.0, 1.0]. Structure it as a composition: (1) format_reward checks that the trajectory contains valid tool call syntax (JSON schema, correct tool name, required parameters present) - score 0.25 for valid syntax, -0.5 for malformed calls. (2) execution_reward runs the tool call in the sandbox and checks the result against the expected predicate - score 0.75 for success, -0.5 for execution error or wrong result. (3) trajectory_success_reward checks the final task state after all tool calls - score 1.0 for task completion, 0.0 for partial completion, -1.0 for task failure. Weight and sum these: total_reward = 0.1 × format_reward + 0.4 × execution_reward + 0.5 × trajectory_success_reward. Validate by running 50 manual trajectories through the reward function and confirming scores match human judgment.
After training, merge LoRA adapters (if using LoRA) into the base model: model.merge_and_unload(). Deploy the merged model with vLLM: vllm serve ./merged_model --enable-auto-tool-choice --tool-call-parser hermes --max-model-len 8192. Run a post-training eval on tau-Bench (retail or airline domains) or a held-out tool-calling task suite using the AI agent benchmarking setup from the SWE-bench infrastructure guide. Compare task completion rate against the SFT baseline. A successful RFT run typically adds 10-25 percentage points of task completion over SFT for tool-calling tasks the demonstrations did not cover. If post-RFT accuracy is flat or below SFT baseline, check reward hacking by inspecting whether the agent learned to exploit reward function edge cases rather than genuinely solving tasks.
Frequently Asked Questions
Agent RFT (Reinforcement Fine-Tuning) optimizes an AI agent on full multi-step tool-use trajectories - sequences of tool calls, their results, and the final task outcome - rather than single text completions. Standard GRPO evaluates completions independently: each generated text is scored, and group-relative advantage is computed across the batch. Agent RFT treats the entire trajectory (tool call 1 → result → tool call 2 → result → final answer) as the unit of optimization, assigning credit across steps. This matters because agentic tasks fail mid-trajectory: a correct step 1 followed by an incorrect step 3 should get partial credit, not zero. GRPO cannot express this without trajectory-level rollout design. The reward functions also differ: GRPO typically scores text against a ground truth answer, while Agent RFT's rewards come from actual tool execution results (API call succeeded, code ran, task completed).
Use Agent RFT when your agent needs to improve on tasks it cannot succeed at via imitation alone, and when the success signal is measurable. If your agent fails on real multi-step tasks (hitting 40-60% task completion on tau-Bench or SWE-bench) despite strong SFT on demonstrations, Agent RFT applies RL pressure on the exact failure patterns. SFT is the right start: train on successful trajectories first, then RFT on top to push into failure modes the demonstrations do not cover. DPO is wrong for agentic tasks with multi-step dynamics because it uses static offline preference pairs - it cannot learn from the dynamic interaction between the agent and real tool execution results.
For 7B agent RFT with trajectory length up to 8 tool calls and 512 tokens per step: a single H100 PCIe 80GB or H200 80GB covers the run. VRAM breakdown: 14 GB for 7B BF16 weights (online policy), 14 GB for the reference model copy, ~30 GB for AdamW optimizer states, and ~10-20 GB for the rollout buffer (8 trajectories at 8 steps each). Total: ~70-80 GB, fitting on an 80GB card with gradient checkpointing. For 32B agent RFT with the same rollout depth: use an H200 141GB or B200 192GB. The reference model (frozen) can be offloaded to CPU if VRAM is tight, at a 10-15% step-time penalty on 7B models.
Yes, with the same split used for GRPO: rollout workers (inference-shaped, stateless) are spot-eligible; the trainer (gradient state, optimizer) stays on on-demand. Rollout workers in Agent RFT hold no training state - they execute the agent loop, collect trajectories, and push reward tensors back to the trainer. If a spot rollout worker is preempted, the trainer restarts that rollout batch from the last checkpoint without loss of training progress. On Spheron, B200 spot instances run cheaper than on-demand, making disaggregated spot rollout the default cost strategy for runs longer than 8 hours.
Verifiable rewards for agents require that the correct outcome is observable from tool results - not from LLM judgment. Three working categories: (1) Code execution agents: reward 1.0 if all unit tests pass, partial reward for pass rate, -1.0 for compile errors. (2) API call agents: reward 1.0 if the HTTP response status is 2xx and the response schema validates, -1.0 otherwise. (3) Database or search agents: reward based on whether the retrieved result contains the target entity or satisfies a structured predicate. Avoid rewards that require a second LLM to judge - LLM-as-judge adds 2x compute per step and introduces systematic bias. The key constraint is that the tool environment must be deterministic and sandboxed: two identical tool calls at different times must return the same output, or reward variance will dominate the training signal.
Trajectory-level reward assigns a single scalar to the entire sequence of tool calls (did the task complete?). Step-level reward assigns a scalar to each individual tool call (was this step syntactically valid? did it move toward the goal?). Most working Agent RFT implementations start with trajectory-level rewards for simplicity - one scalar per completed trajectory, no credit assignment between steps. Step-level rewards require a shaped reward function that evaluates intermediate states, which is harder to design but converges faster by reducing credit assignment noise across long trajectories. For tasks with more than 5 tool calls per trajectory, mixing trajectory-level success reward with a per-step format/validity reward (schema compliance, no hallucinated tool names) improves convergence stability.
