Comparison

DPO vs PPO: Which RLHF Algorithm to Use for Production LLM Alignment (2026 Decision Guide)

DPO vs PPOPPO vs DPODPO vs PPO RLHFDPO vs PPO TrainingRLHF AlgorithmsDirect Preference OptimizationProximal Policy Optimization LLMLLM AlignmentRLHF GPU CloudTRL DPOverl PPOKTOORPOIPO
DPO vs PPO: Which RLHF Algorithm to Use for Production LLM Alignment (2026 Decision Guide)

Both DPO and PPO align LLMs to human preferences, but they make opposite trade-offs on GPU cost, feedback data requirements, and output quality ceiling. If you are picking between them for a production alignment job, the algorithm choice belongs before GPU provisioning, not after. For the full DPO implementation walkthrough, see the DPO fine-tuning guide. For complete RLHF infrastructure setup across verl, OpenRLHF, and TRL, see the RLHF training infrastructure guide.

The Core Problem Both Solve

SFT (supervised fine-tuning) teaches a model to follow instructions by training on human-written demonstrations. What it does not teach is which of two valid responses a human would prefer. A model can follow instructions accurately and still produce outputs that are too verbose, unsafe, or stylistically wrong.

RLHF is the post-training phase that closes this gap. The input is a preference signal: humans (or a proxy reward model) label which of two model outputs is better. The goal is to shift the policy toward the preferred outputs without completely breaking what SFT built. DPO and PPO are two different implementations of that shift.

Both use a reference model (the frozen SFT checkpoint) as a KL anchor to prevent the policy from diverging too far. The difference is what they do between the preference labels and the gradient update.

PPO Mechanics

PPO requires four model copies running simultaneously.

The actor is the online policy being trained. It generates completions during rollout and receives gradient updates. The critic is a value model with the same architecture as the actor, trained to estimate the expected return from any given token state. The reference is the frozen SFT checkpoint, used only to compute KL divergence against the actor's current distribution. The reward model is a separately trained model that scores each rollout completion with a scalar reward.

The training loop: sample a batch of prompts, generate completions from the actor (rollout), score each completion with the reward model, compute per-token advantage estimates using GAE (generalized advantage estimation) with the critic's value predictions, and update both the actor (policy gradient with clipped surrogate loss) and critic (value regression loss) simultaneously.

The KL penalty in the PPO objective constrains the actor from moving too far from the reference per update:

r(θ) = π_θ(a|s) / π_old(a|s)   # ratio of new to old policy
L_CLIP = E[min(r(θ)A, clip(r(θ), 1-ε, 1+ε)A)] - β * KL(π_θ || π_ref)

The critic is expensive for a specific reason: it needs the same architecture as the actor, which means it carries the same weight count and the same full AdamW optimizer state. For a 7B model, that is 98 GB for the actor-plus-optimizer and another 98 GB for the critic-plus-optimizer.

ComponentMemory (7B model)GPU Assignment
Actor (training)~98 GB (BF16 + AdamW)On-demand, FSDP
Critic (training)~98 GB (BF16 + AdamW)On-demand, FSDP
Reference (inference)~14 GB (BF16 only)Spot-eligible
Reward (inference)~14 GB (BF16 only)Spot-eligible
Total~224 GB+4x H100 80GB min

The reference and reward models are inference-only, so they do not need optimizer state. They are also spot-eligible since they hold no training state that is expensive to regenerate on preemption.

DPO Mechanics

DPO's insight is that when a reward model is optimal, the optimal policy has a closed-form relationship to the reference model. That relationship can be reparameterized: instead of training a reward model and then training the policy against it, express the reward directly in terms of the policy's own log-probabilities.

The DPO loss given a preference pair (chosen response y_w, rejected response y_l, prompt x):

L_DPO = -log σ(β * (log π_θ(y_w|x) - log π_ref(y_w|x)) - β * (log π_θ(y_l|x) - log π_ref(y_l|x)))

β is the KL penalty strength (typically 0.05-0.2). Increasing β keeps the policy closer to the reference; decreasing it allows more deviation.

No reward model, no critic, no rollout generation. The training loop is a standard forward-backward pass on preference pairs. Two model copies: the trainable policy and the frozen reference.

ComponentMemory (7B model)GPU Assignment
Policy (training)14 GB weights + 84 GB AdamW = 98 GBOn-demand or spot
Reference (inference)~14 GB (BF16 only)Shared GPU or spot
Total~112 GB2x H100 80GB

DPO halves the GPU requirement of PPO at 7B scale and removes the reward model training phase entirely.

GPU Memory Footprint: PPO vs DPO

The formulas for computing total VRAM, unsharded:

DPO:

VRAM = ((params * 2 bytes)      # policy weights in BF16
     + (params * 2 bytes)       # reference weights in BF16
     + (params * 12 bytes))     # AdamW: FP32 master + first moment + second moment
     * 1.15-1.20                # activation headroom

PPO:

VRAM = ((params * 14 bytes)     # actor: BF16 weights + AdamW
     + (params * 14 bytes)      # critic: BF16 weights + AdamW
     + (params * 2 bytes)       # reference: BF16 inference only
     + (params * 2 bytes))      # reward model: BF16 inference only
     * 1.15-1.20                # activation headroom + rollout buffers

Divide by GPU count after applying FSDP or ZeRO-3 sharding.

Scaling across model sizes:

ModelDPO VRAMDPO GPU Count (H100 80GB)PPO VRAMPPO GPU Count (H100 80GB)
7B~112 GB2x~224 GB+4x
13B~208 GB3x~415 GB+6x
32B~512 GB7x~1,024 GB+13x
70B~1,120 GB14x~2,240 GB+28x

At 70B, DPO already needs 14x H100 80GB GPUs for full-parameter training. PPO doubles that. For anything above 13B, use H200 (141 GB) or B200 (192 GB) to cut the GPU count, or switch to LoRA DPO where optimizer state covers only the adapter parameters.

PPO rollout workers (vLLM) can run on spot instances since they hold no optimizer state. The actor and critic must run on on-demand nodes to avoid losing checkpoint state on preemption.

Training Throughput and Wall-Clock on H100 and H200

DPO is compute-bound. The training loop is one forward pass through the policy (for log-probabilities), one forward pass through the reference, and a backward pass through the policy. No generation step, no rollout buffer to fill. DPO throughput scales linearly with GPU count up to the point where communication overhead dominates.

PPO is I/O bound by rollout generation. Each PPO iteration requires generating complete responses from the actor before the gradient update can begin. The rollout phase does not overlap with the update phase in most frameworks. That means GPU utilization drops to near zero during rollout on the training nodes if rollout is colocated. Separate rollout nodes fix the colocated idle time but add transfer overhead.

Indicative benchmarks for a 7B model:

AlgorithmHardwareTime to convergenceNotes
DPO2x H100 SXM56-10 hours (50K preference pairs)No rollout bottleneck
PPO4x H100 SXM524-36 hours (includes reward model training + PPO loop)Rollout generation dominates wall-clock

H200's advantage for PPO is the memory bandwidth. H200 SXM5 has 4.8 TB/s HBM3e bandwidth versus H100 SXM5's 3.35 TB/s. Rollout generation is memory-bandwidth-bound (KV cache reads dominate), so H200 cuts rollout latency meaningfully. For DPO, which is compute-bound, the H200 advantage is smaller.

Sample Efficiency and Stability

DPO has one hyperparameter that matters: beta. The loss is stable by default. Typical training runs converge without exploding gradients, NaN losses, or policy collapse. The main failure mode is a preference dataset that is inconsistent or low-quality, which shows up as a reward margin that never widens.

PPO has six hyperparameters that interact with each other: init_kl_coef, clip_ratio, gae_lambda, the actor learning rate, the critic learning rate, and ppo_epochs per rollout batch. Common failure modes:

  • Reward hacking: reward score climbs but held-out task accuracy falls. The policy has learned to exploit the reward model's scoring patterns rather than underlying quality.
  • KL collapse: kl_div crosses 0.1 and keeps climbing. The policy is drifting too far from the reference per update. Fix by raising init_kl_coef by 50%.
  • Critic divergence: val_loss goes NaN in the first 100-200 steps. The critic is receiving unbounded advantage estimates from a near-random initialization. Fix by warming up the critic for one epoch before actor updates begin.

DPO runs on a 7B model with the default TRL config typically converge without intervention. PPO on the same model needs active monitoring of kl_div and reward_mean against held-out accuracy from the first training step.

Output Quality: Where Each Algorithm Wins

Task TypeDPOPPO
Instruction following (single-turn)StrongOverkill
Style and tone alignmentStrongOverkill
Safety alignmentStrongBetter with trained RM
Multi-turn conversationWeakerStronger
Long-horizon credit assignmentNot designed for thisDesigned for this
Verifiable reasoning tasksUse GRPO insteadUse GRPO instead
Frontier model alignmentCeiling may be lowerHigher ceiling

DPO applies a single preference label to an entire completion. If a response has one incorrect reasoning step buried in an otherwise good answer, DPO's loss treats the full response as "chosen" or "rejected." PPO's per-token advantage estimates (via GAE) can credit specific tokens for specific reward increments, which makes it better for tasks where credit assignment across a long sequence matters.

For tasks with verifiable rewards (math, code, structured output), GRPO eliminates the critic and is a better alternative to PPO. GRPO keeps online rollout generation for fresh exploration but replaces the value network with group-relative advantage normalization within the rollout batch.

GPU Cloud Cost Comparison: 7B Model End-to-End

Using live Spheron H100 SXM5 pricing ($1.49/hr spot, $4.06/hr on-demand at time of writing):

DPO run (7B model, 50K preference pairs):

StageGPUsInstance TypeHoursCost
DPO training2x H100 SXM5Spot8$23.84
Evaluation1x H100 SXM5Spot1$1.49
Total DPO~9 hr~$25

PPO run (7B model, including reward model training):

StageGPUsInstance TypeHoursCost
Reward model training2x H100 SXM5Spot12$35.76
PPO loop (actor + critic)4x H100 SXM5On-demand24~$390
Evaluation1x H100 SXM5Spot1$1.49
Total PPO~37 hr~$427

PPO costs roughly 17x more than DPO end-to-end for the same 7B model and dataset. Most of that difference is the reward model training phase and the 4-model GPU requirement for the PPO loop.

Spheron H100 SXM5 instances are spot-eligible for DPO and for PPO's rollout workers. For PPO's actor and critic trainers, use on-demand to avoid checkpoint loss on preemption. H200 GPU capacity on Spheron reduces GPU count for larger models where H100's 80 GB per card forces aggressive sharding.

Pricing fluctuates based on GPU availability. The prices above are based on 07 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

When to Choose DPO

Use DPO when:

  • You have or can collect matched preference pairs (prompt, chosen response, rejected response)
  • Your task is single-turn or short-context (instruction following, style, tone, format, safety)
  • You want to minimize infrastructure complexity and avoid training a separate reward model
  • Your budget makes PPO's reward model training phase prohibitive
  • You are aligning a 7B-30B model for style, format, or instruction compliance
  • Your team is already running SFT with TRL or Axolotl and wants to stay on the same stack

DPO's default TRL configuration works for most of these cases with minimal tuning. The main knob is beta (0.05-0.2). Start at 0.1.

When to Choose PPO

Use PPO when:

  • You need a trained reward model for subjective criteria that cannot be expressed as a deterministic function (general helpfulness, nuanced safety, conversational quality)
  • Multi-turn conversation quality is the primary target and you need long-horizon credit assignment
  • You are building frontier-scale alignment where DPO's offline learning ceiling is insufficient
  • Your team has infrastructure experience with multi-node rollout disaggregation (verl, OpenRLHF)
  • You have already exhausted DPO and GRPO and observed a clear quality ceiling

The infrastructure overhead is substantial. A 70B PPO run is a 48-72 hour multi-cluster job with four model copies, separate rollout workers, and active monitoring requirements. Be certain DPO cannot solve the task before committing to that overhead.

Hybrid Strategies: IPO, KTO, ORPO

Several algorithms sit between DPO and PPO. All of them reduce infrastructure complexity versus PPO while addressing specific limitations of DPO.

IPO (Identity Preference Optimization): Fixes DPO's tendency to overfit on preference pairs by adding a regularization term that penalizes the margin from growing too large. Drop-in replacement for DPO in TRL. Use it when your training reward margin saturates early but validation win-rate has not improved.

KTO (Kahneman-Tversky Optimization): Designed for unpaired binary feedback (thumbs up or thumbs down per completion, not matched chosen/rejected pairs). Same GPU footprint as DPO. Use KTO when you have binary labels across individual responses without a paired alternative. KTO uses prospect-theory weighting to treat gains and losses asymmetrically, matching human feedback behavior more closely than symmetric preference optimization.

ORPO (Odds-Ratio Preference Optimization): Removes the reference model entirely by incorporating the preference signal directly into the SFT loss. Single-model training with lower VRAM than DPO. Suitable when you do not have a strong SFT checkpoint to anchor against, or when you want to combine SFT and alignment in a single pass.

GRPO (Group Relative Policy Optimization): Keeps online rollout generation for fresh exploration but eliminates the critic entirely by using group-relative advantage normalization within the rollout batch. Lower memory than PPO and well-suited for tasks with verifiable rewards (math, code, structured output). Not a DPO substitute for tasks without a scalar reward function.

See the axolotl vs unsloth vs torchtune comparison for the training framework layer that runs on top of these algorithms.

Code Starters

DPO with TRL on Spheron (2x H100):

python
from trl import DPOTrainer, DPOConfig

config = DPOConfig(
    model_name_or_path="mistralai/Mistral-7B-Instruct-v0.3",
    beta=0.1,                              # KL penalty; tune between 0.05-0.2
    learning_rate=5e-7,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    bf16=True,
    gradient_checkpointing=True,
    save_steps=100,
    output_dir="./dpo-checkpoint",
)

trainer = DPOTrainer(
    model=model,
    ref_model=None,          # TRL creates frozen reference copy automatically
    args=config,
    train_dataset=preference_dataset,
    processing_class=tokenizer,
)
trainer.train()

PPO reward model training with TRL:

python
from trl import RewardTrainer, RewardConfig

config = RewardConfig(
    model_name_or_path="mistralai/Mistral-7B-v0.3",
    learning_rate=1e-5,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=2,
    bf16=True,
    gradient_checkpointing=True,
    output_dir="./reward-model",
)

trainer = RewardTrainer(
    model=model,
    args=config,
    train_dataset=preference_dataset,   # (prompt, chosen, rejected) triplets
    processing_class=tokenizer,
)
trainer.train()

PPO loop with TRL PPOTrainer (7B-13B scale):

python
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

config = PPOConfig(
    model_name="mistralai/Mistral-7B-Instruct-v0.3",
    learning_rate=1.41e-5,
    batch_size=128,
    mini_batch_size=8,
    gradient_accumulation_steps=4,
    optimize_cuda_cache=True,
    early_stopping=True,
    target_kl=0.1,
    kl_penalty="kl",
    init_kl_coef=0.2,    # monitor kl_div; raise if > 0.1 without reward gain
    adap_kl_ctrl=True,
)

# PPO needs value head on top of policy
policy = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)

For verl configuration for 70B+ PPO runs, see the full verl YAML config in the RLHF training infrastructure guide.

Decision Flowchart

Do you have preference pairs (chosen/rejected)?
  NO  -> Does your feedback come as binary thumbs up/down?
           YES -> KTO
           NO  -> Can you define a verifiable reward function (code test, math checker)?
                    YES -> GRPO
                    NO  -> Collect preference pairs first, then return to DPO
  YES -> Is your task multi-turn or does it require long-horizon credit assignment?
           YES -> Does your team have multi-node RLHF infrastructure?
                    YES -> PPO + trained reward model
                    NO  -> Start with DPO; revisit PPO if quality ceiling is hit
           NO  -> DPO (or IPO for more conservative regularization)

Both DPO and PPO run well on Spheron's H100 and H200 fleet. DPO's lighter 2-model footprint fits comfortably on spot instances, while PPO's 4-model setup has access to the VRAM headroom it needs on SXM5 nodes.

H100 GPU pricing → | H200 capacity on Spheron → | View all GPU pricing →

STEPS / 05

Quick Setup Guide

  1. Audit your feedback data format

    Before choosing an algorithm, assess what signal you have. Matched preference pairs (prompt + chosen + rejected) point to DPO, IPO, or SimPO. Binary labels without pairs (thumbs up/down) point to KTO. A verifiable scalar reward (unit test pass, math checker) points to GRPO. Subjective multi-criteria feedback with a trained reward model points to PPO. The data format determines the algorithm, not the reverse.

  2. Estimate VRAM requirements for each algorithm

    Calculate VRAM for DPO: (policy_params * 2 bytes) + (reference_params * 2 bytes) + (trainable_params * 12 bytes for AdamW). Calculate VRAM for PPO: (actor_params * 14 bytes) + (critic_params * 14 bytes) + (reference_params * 2 bytes) + (reward_params * 2 bytes), then divide each by GPU count under FSDP/ZeRO-3. Add 15-20% activation headroom. For a 7B model: DPO needs ~112 GB (2x H100); PPO needs ~224 GB+ (4x H100 minimum).

  3. Set up DPO with TRL on Spheron

    Provision 2-8x H100 or H200 instances on Spheron. Install trl>=0.8. Configure DPOTrainer with beta=0.1, learning_rate=5e-7, per_device_train_batch_size=1, gradient_accumulation_steps=4-8. For multi-GPU, use accelerate launch with a ZeRO-3 DeepSpeed config. The reference model defaults to a frozen copy of the policy at initialization; no separate training phase is needed.

  4. Set up PPO with verl or TRL on Spheron

    For PPO, first train a reward model on your preference dataset using TRL's RewardTrainer (8x H100 for 7B reward model). Then configure the PPO loop: use verl for 70B+ runs (HybridEngine avoids a second GPU allocation for rollout), TRL PPOTrainer for 7B-30B (lower setup friction). Set init_kl_coef=0.2, monitor kl_div each step, and raise the coefficient if KL climbs above 0.1 without reward improvement.

  5. Run a cost-per-alignment-run comparison

    Benchmark total cost before committing to either algorithm. For DPO, track: (hours to convergence) x (GPU count) x (hourly rate). For PPO, add the reward model training phase. On Spheron H100 SXM5 spot, DPO on a 7B model typically costs $20-50 end-to-end. The equivalent PPO run (including reward model training) costs $150-400 depending on reward model size and PPO loop length. Use Spheron spot instances for DPO and for the rollout workers in PPO.

FAQ / 05

Frequently Asked Questions

PPO (Proximal Policy Optimization) runs four model copies simultaneously: actor (the policy being trained), critic (value estimator), reference (frozen SFT checkpoint), and reward (trained scorer). DPO (Direct Preference Optimization) derives a closed-form loss directly from preference pairs, eliminating the reward model and critic entirely. DPO uses two model copies instead of four, reducing VRAM by 40-60% and removing the reward model training phase. PPO is the right choice when you need a learned, general-purpose reward signal for complex or multi-turn outputs. DPO is the right choice for style, tone, instruction-following, and safety alignment where preference pairs can be collected.

For a 7B model, DPO holds two BF16 model copies plus AdamW optimizer state: roughly 14 GB (policy weights) + 14 GB (reference weights) + 84 GB (optimizer states) = ~112 GB total, fitting on 2x H100 80GB GPUs with FSDP. Full PPO holds four model copies: actor (~98 GB with optimizer), critic (~98 GB with optimizer), reference (~14 GB inference-only), reward (~14 GB inference-only) = ~224 GB+, requiring at minimum 4x H100 80GB GPUs or 2x H200 141GB GPUs.

Use DPO when you have or can collect preference pairs (chosen/rejected completions), your task is single-turn (instruction following, style, tone, safety), you want the lowest possible GPU cost and training complexity, and you do not need fine-grained per-token credit assignment. The majority of production alignment tasks in 2026 - style tuning, format compliance, safety - fit DPO well. If you also want to evaluate GRPO (which drops the critic but keeps online generation), see the comparison in the GRPO fine-tuning guide.

PPO remains the right choice for frontier model alignment where you need a trained reward model that scores outputs on criteria too subjective or complex to express as a verifiable function (GRPO's requirement) but also too complex to capture purely from offline preference pairs (DPO's requirement). Specific cases: multi-turn conversation alignment with long-horizon credit assignment, safety alignment at scale where a reward model trained on diverse human feedback is the signal, and tasks where your preference dataset does not cover the distribution your policy explores online.

Several algorithms occupy the space between DPO and PPO. IPO (Identity Preference Optimization) fixes DPO's tendency to overfit on preference pairs by regularizing toward the reference model more aggressively. KTO works with unpaired binary feedback (thumbs-up/thumbs-down) rather than matched chosen/rejected pairs. ORPO (Odds-Ratio Preference Optimization) eliminates the reference model entirely by incorporating the preference signal directly into the SFT loss. GRPO keeps online rollout generation but eliminates the critic. The right hybrid depends on your feedback data format and whether you need online generation.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.