Engineering

DPO Fine-Tuning on GPU Cloud: Direct Preference Optimization Training Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 30, 2026
DPO Fine-TuningDirect Preference OptimizationRLHFLLM AlignmentDPO vs GRPOKTOLoRA DPOTRLLLM TrainingGPU CloudH200B200
DPO Fine-Tuning on GPU Cloud: Direct Preference Optimization Training Guide (2026)

DPO is the standard method for preference alignment in 2026: style, tone, instruction-following, and safety tuning. If you need the supervised fine-tuning baseline first, read the SFT fine-tuning guide before continuing here. If your goal is teaching the model to reason through problems with verifiable answers, read the GRPO fine-tuning guide instead.

This post covers what TRL docs and Hugging Face tutorials skip: the GPU memory math for holding two model copies, the three reference model strategies and when to use each, multi-GPU FSDP and ZeRO-3 configuration for 70B runs, and a real cost comparison against GRPO and PPO.

DPO vs GRPO vs PPO vs KTO

Pick the method first, then size your hardware.

MethodNeeds Rollout Generation?Needs Critic/Value Model?Best ForGPU Cost Tier
DPONoNoStyle, format, instruction-following, safetyLowest
KTONoNoUnpaired feedback (binary thumbs up/down)Lowest
GRPOYesNoVerifiable reasoning (math, code)Medium
PPOYesYes (32B critic for 32B policy)Fine-grained credit assignmentHighest

DPO is cheapest because it uses a static offline dataset with no generation cost. You do one forward pass per batch through the policy and reference model, compute the DPO loss, and update the policy weights. No sampling, no rollout buffers, no reward model in the training loop.

KTO (Kahneman-Tversky Optimization) is worth knowing if you only have thumbs-up/thumbs-down signals rather than paired chosen/rejected completions. It applies a prospect-theory weighting function that treats gains and losses asymmetrically, matching how humans actually give feedback. If you have binary labels without matched pairs, KTO is the right DPO variant to reach for.

GRPO generates rollouts online, which means you need a functioning reward function that scores model outputs automatically. The reward function can be anything computable: a math solver, a code test suite, a schema validator. But if your task doesn't have a verifiable correct answer, GRPO has nothing to train on. DPO covers that gap: preference pairs can be collected for any task, including subjective ones like writing quality or tone. For the full reasoning-RL treatment, see the GRPO fine-tuning guide.

PPO adds a critic model that must be the same size as the policy. For a 32B policy, you maintain a 32B critic trained simultaneously. The critic learns to estimate the value of arbitrary token sequences, which is genuinely hard and frequently diverges on LLMs. In 2026, PPO is rarely the right choice unless you need per-token credit assignment across very long sequences.

GPU Memory Math for DPO

The core DPO memory equation is different from SFT because you always hold two model copies:

DPO_VRAM = policy_weights_bf16          # trainable model
          + reference_weights_bf16      # frozen copy, full precision by default
          + optimizer_state             # 12 bytes/param for AdamW (fp32 master + 2 moments)
                                        # or 12 bytes/LoRA-param if LoRA-only
          + activation_peak             # ~15% headroom

For a 7B model in bf16: 7B 2 bytes = 14 GB per copy, so 28 GB baseline before optimizer state. Full AdamW on all parameters adds another 84 GB (12 bytes 7B params). That's why DPO of 7B with full-parameter training needs over 100 GB, and LoRA-only DPO is the practical path for single-GPU runs.

Practical sizing table:

ModelMethodConfigMin VRAMRecommended GPU
7BFull-param DPObf16112 GB+H200 SXM5 (single)
7BLoRA DPOr=6418-22 GBA100 40G or L40S
32BFull-param DPObf16, ZeRO-3512 GB+H200 SXM5
32BLoRA DPOr=6450-60 GBH100 SXM5
70BFull-param DPObf16, ZeRO-31,120 GB+H200 SXM5 (ZeRO-3)
70BLoRA DPOr=64, bf16, ref offload140-160 GBB200 (1x) or 2×H200

For the full model size breakdown across parameter counts and quantization formats, see GPU memory requirements for LLMs.

Full-param DPO of 7B needs 112 GB+ in total: 28 GB for the two bf16 weight copies plus 84 GB for the AdamW optimizer state (12 bytes per parameter). H100 PCIe at 80 GB is fundamentally insufficient for this, not just tight for activations. It cannot hold even the optimizer state alone. H200 SXM5 at 141 GB is the minimum single-GPU option for 7B full-param. For 32B full-param, the optimizer state alone comes to 384 GB (32B × 12 bytes), pushing the full requirement past 512 GB. No single GPU covers this. Use four or more H200 GPUs with ZeRO-3 to shard the optimizer state across cards.

Reference Model Handling: Three Strategies

TRL supports three distinct approaches to the frozen reference model. Each trades VRAM against throughput.

Strategy A: Frozen Full Copy on GPU

Default behavior. Both the policy and the reference model live in GPU VRAM. Log-probabilities for the DPO loss are computed on-device with no transfer overhead.

Use this when you have enough VRAM to hold both copies. Fastest per-step performance. For a 7B LoRA DPO run on an A100 80G, this is fine: the policy is full-precision, the reference is the same size, and you're well within 80 GB.

python
from trl import DPOTrainer, DPOConfig
import torch

config = DPOConfig(
    beta=0.1,
    learning_rate=5e-7,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    warmup_ratio=0.1,
    max_length=2048,
    max_prompt_length=512,
    bf16=True,
)

trainer = DPOTrainer(
    model=policy_model,           # trainable
    ref_model=reference_model,    # frozen, on GPU
    args=config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

Strategy B: Reference Model CPU Offload

The reference model lives on CPU. For each batch, log-probs are computed on CPU and transferred to GPU for the DPO loss calculation. This halves the GPU VRAM used by the reference model, but the overhead depends on model size.

For models up to ~13B, the CPU forward pass is fast enough relative to GPU throughput that per-step latency increases by roughly 10-20%. For 70B models, a single CPU forward pass takes tens of seconds on even a 256-core server, versus 0.2-0.5 s on GPU. That is a 10-100x slowdown per step, not the small percentage overhead that PCIe log-prob transfer alone would suggest. CPU reference offload is impractical for 70B models in production. If you need to fit a 70B model in constrained VRAM, use Strategy C (precomputed log-probs) instead.

Use Strategy B when your VRAM fits the policy model but not two full copies, and your model is 13B or smaller.

python
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM
import torch

# Load reference model explicitly on CPU to free GPU VRAM
reference_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cpu",
    torch_dtype=torch.bfloat16,
)

config = DPOConfig(
    beta=0.1,
    learning_rate=2e-6,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    warmup_ratio=0.1,
    max_length=2048,
    max_prompt_length=512,
    bf16=True,
)

trainer = DPOTrainer(
    model=policy_model,
    ref_model=reference_model,   # already on CPU, GPU VRAM freed
    args=config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

Strategy C: Precomputed Reference Log-Probs

Run the reference model once over the entire dataset before training starts, save the log-probs to disk as a column in your dataset, and load them during training. No reference model in GPU memory during the training loop at all.

Use this when your dataset is fixed and preprocessed. It maximizes training throughput because every step is just a policy forward pass plus the cached log-probs. The limitation: you cannot use this with dynamic data augmentation or online data pipelines.

python
from trl import DPOTrainer, DPOConfig
from datasets import load_from_disk

# Step 1: create a temporary trainer with the reference model to compute log-probs
precompute_config = DPOConfig(beta=0.1)
precompute_trainer = DPOTrainer(
    model=reference_model,
    ref_model=None,
    args=precompute_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)
dataset_with_logps = precompute_trainer.compute_reference_log_probs(dataset)
dataset_with_logps.save_to_disk("./dataset_with_reference_logps")

# Step 2: train using precomputed log-probs, no reference model in memory
dataset_cached = load_from_disk("./dataset_with_reference_logps")

config = DPOConfig(
    beta=0.1,
    learning_rate=5e-7,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    bf16=True,
    precompute_ref_log_probs=True,   # skip reference model during training
)

trainer = DPOTrainer(
    model=policy_model,
    ref_model=None,              # no reference model at training time
    args=config,
    train_dataset=dataset_cached,
    tokenizer=tokenizer,
)

Multi-GPU DPO with FSDP and DeepSpeed ZeRO-3

Single-GPU DPO runs out of room at different points depending on the strategy above. Once you exceed 141 GB (H200) or 192 GB (B200), you need to shard across GPUs. Two approaches apply here.

FSDP (PyTorch native)

Best for LoRA DPO where only a small fraction of parameters are trainable. With FULL_SHARD, FSDP shards the entire model (including the frozen reference) across GPUs. Since the optimizer state for LoRA is small, the per-GPU memory drops sharply with more GPUs.

A 70B LoRA DPO run that needs ~145 GB on a single node (bf16 policy weights plus LoRA state, with reference on CPU) drops to under 45 GB per GPU across four H200 GPUs with FSDP full_shard. FSDP communication overhead is lower than ZeRO-3 for LoRA workloads because the optimizer state sharding isn't the bottleneck.

DeepSpeed ZeRO-3

Best for full-parameter DPO on 32B-70B models. ZeRO-3 shards model parameters, gradients, and all three optimizer state components across GPUs. A 70B full-parameter DPO run requires 280+ GB of model weights alone, which exceeds any single GPU's VRAM. ZeRO-3 across 8x H200 GPUs on a single node brings per-GPU memory down to ~140 GB (~35 GB for sharded weights plus ~105 GB for sharded optimizer state).

ZeRO-3 config for full-parameter 70B DPO (zero3_dpo.yaml):

yaml
compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: "no"
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true

Launch command:

bash
accelerate launch --config_file zero3_dpo.yaml train_dpo.py

For networking tradeoffs when running Spheron multi-node jobs without InfiniBand, see multi-node GPU training without InfiniBand for a cost analysis comparing TCP vs RDMA configurations.

Dataset Prep and Preference Pair Generation

Every DPO example needs three fields: prompt, chosen, and rejected. The three sourcing paths differ in cost, quality, and setup time.

Path 1: Existing Open Dataset

Fastest path to a working DPO run. Good starting points:

  • Anthropic HH-RLHF: Helpfulness and harmlessness preference pairs from human annotators
  • UltraFeedback: GPT-4 scored preference data across diverse tasks
  • Orca-DPO-Pairs: Teacher-student DPO pairs from Orca-style prompts
  • Capybara: Diverse multi-turn preference data

Before training, validate data quality: check that chosen and rejected lengths are balanced. A dataset where chosen responses are consistently longer than rejected teaches the model to be verbose, not genuinely better. Filter pairs where the length ratio exceeds 2:1.

Path 2: Sample and Score with a Reward Model

Run your SFT checkpoint at temperature=0.8, generate two completions per prompt, score both with a reward model (ArmoRM, Skywork-Reward-Llama-3.1-8B, or similar), and label the higher-scoring completion as chosen. This is the standard pipeline for domain-specific DPO where no public dataset covers your task.

The reward model quality here matters a lot. A reward model trained on generic helpfulness preferences will give you generic helpfulness alignment. If you need domain-specific style or accuracy, you either need a domain-tuned RM or move to Path 3.

Path 3: LLM-as-Judge for Pairwise Preferences

Generate two completions per prompt, then ask a strong judge model (Llama 4 Maverick, Qwen3-235B) which completion is better and why. This provides richer preference signal than a scalar reward model and handles nuanced style preferences that RMs miss. The tradeoff is cost and speed: an LLM judge is 10-50x slower per example than a reward model.

For dataset preprocessing patterns shared across DPO, SFT, and RL pipelines, see axolotl-vs-unsloth-vs-torchtune for framework comparison.

Critical data quality checks before training:

  • Remove near-identical pairs (cosine similarity > 0.95 between chosen and rejected)
  • Confirm chosen is genuinely better, not just longer
  • Keep a held-out validation split (10-15%) for win-rate evaluation during training
  • Check for distribution shift: if your preference data came from a different model family than your base, the reference log-probs can be miscalibrated

Full Recipe: DPO-Tuning a 70B Model on 8x H200 with TRL and Axolotl

Step 1: Provision 8x H200 on Spheron

Rent B200 GPUs on Spheron if the 192 GB per-card headroom is useful for your configuration. For 70B LoRA DPO, provision a single B200 (192 GB) or at least 2x H200; for full-parameter DPO, provision at least 8x H200. For the single-node 8x H200 full-parameter recipe here, log into app.spheron.ai, rent an 8x H200 SXM5 on-demand node, and SSH in:

bash
# Verify your GPUs
nvidia-smi
# Expected: 8x H200 SXM5, 141 GB each

Use on-demand for a multi-day training run. DPO is spot-eligible (no rollout generation), but if preempted mid-run on a long job, you restart from the last checkpoint and pay for the compute twice. For runs under 6 hours, spot is the better call.

Step 2: Install Dependencies

bash
pip install "trl>=0.8" transformers accelerate peft deepspeed datasets

Step 3: DPO with TRL DPOTrainer (LoRA, Single-Node 8x H200)

python
import torch
from datasets import load_dataset
from peft import LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer

model_name = "meta-llama/Llama-3.1-70B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# LoRA config for 70B - r=64 is aggressive but appropriate for strong alignment
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)

dataset = load_dataset("your-preference-dataset")

config = DPOConfig(
    output_dir="./dpo-llama-70b",
    beta=0.05,                         # lower beta for 70B, more room to deviate
    learning_rate=2e-6,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,     # effective batch size = 8 * 8 GPUs = 64
    warmup_ratio=0.1,
    num_train_epochs=1,
    max_length=2048,
    max_prompt_length=512,
    bf16=True,
    save_steps=100,
    logging_steps=10,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
    lr_scheduler_type="cosine",
)

trainer = DPOTrainer(
    model=model,
    ref_model=None,              # TRL creates reference from model copy
    peft_config=lora_config,
    args=config,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./dpo-llama-70b-final")

Step 4: DPO with Axolotl (YAML Config)

Axolotl has native DPO support via rl: dpo. This is the recommended path for teams already using Axolotl for SFT since you get the same YAML-based configuration, the same dataset preprocessing, and the same multi-GPU launch commands. For the full Axolotl vs. TRL framework comparison, see axolotl-vs-unsloth-vs-torchtune.

dpo_70b.yaml:

yaml
base_model: meta-llama/Llama-3.1-70B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

# DPO configuration
rl: dpo
dpo_beta: 0.05

datasets:
  - path: your-preference-dataset
    type: chatml.intel

dataset_prepared_path: ./prepared_dpo_data
val_set_size: 0.1

# LoRA
adapter: lora
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Training
sequence_len: 2048
sample_packing: false   # must be false for DPO
bf16: true
gradient_checkpointing: true
flash_attention: true
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-6
warmup_ratio: 0.1

# Checkpointing
saves_per_epoch: 4
output_dir: ./dpo-llama-70b-axolotl
logging_steps: 10

Step 5: Launch with Accelerate + ZeRO-3

bash
accelerate launch --config_file zero3_dpo.yaml axolotl/scripts/finetune.py dpo_70b.yaml

For LoRA runs on a single 8x H200 node, FSDP is simpler:

bash
accelerate launch --num_processes=8 --mixed_precision=bf16 train_dpo.py

Step 6: Monitor Training

TRL logs these metrics natively. Watch all four:

  • train/loss: DPO loss should decrease steadily. Erratic swings usually mean the dataset has inconsistent preference signal or your beta is too low.
  • train/rewards/chosen: should increase over training. This is the average implicit reward on chosen completions.
  • train/rewards/rejected: should decrease or stay flat. If it increases, the model is learning to assign high reward to both, which means beta is too low.
  • train/rewards/margins: chosen minus rejected. This is the most direct metric for DPO progress. A widening margin is what you want. If it plateaus early, check data quality first before tuning hyperparameters.

Hyperparameter Cheat Sheet

Model SizebetaLearning RateWarmup StepsBatch SizeEpochs
7B full0.15e-710064 (accum)1-3
7B LoRA r=640.11e-510064 (accum)2-4
32B full0.12e-710032 (accum)1-2
32B LoRA r=640.15e-610032 (accum)2-3
70B LoRA r=640.05-0.12e-615016 (accum)1-2

beta is the most important hyperparameter to understand. Higher beta (0.3-0.5) keeps the policy close to the reference, which is conservative: the model changes less per gradient step, and training is stable but slow. Lower beta (0.01-0.05) allows more deviation from the reference, which means faster learning but more risk of divergence if data quality is inconsistent. Start at 0.1 and adjust based on reward margin growth.

For 70B models, start at beta=0.05. The model is already strong and the reference is high quality; you want measured updates, not aggressive deviation.

Evaluation: Win-Rate, Reward Score, Safety Regression

A complete DPO evaluation requires three checks. Running only the training loss as a proxy is not sufficient.

Win-Rate vs SFT Base

Sample 200-500 prompts from a held-out set not in training data. Generate one completion each from the DPO-tuned model and the SFT base model. Use an LLM judge or reward model to score pairwise: which completion is better? A well-tuned DPO model should win 60-70% of pairwise comparisons over the SFT base on the same domain.

For the full LLM-as-judge deployment setup on GPU cloud, see LLM-as-judge evaluation pipelines on GPU Cloud.

Win-rate below 55% after two epochs typically means your preference data doesn't cover the evaluation distribution. Win-rate above 80% can mean overfitting to the preference dataset.

Reward Model Scoring

Score DPO model outputs on the same reward model used to generate your training labels. This is a sanity check for reward hacking. If reward scores are very high but win-rate is modest, the model has learned to optimize the RM's scoring patterns rather than underlying quality. Symptoms: outputs are formulaic, use hedging language that RMs tend to score highly, and subjectively feel "assistant-brained."

If you see this pattern, diversify your reward signal: use multiple RMs with different strengths, or switch to Path 3 (LLM-as-judge) for a subset of your data.

Safety Regression

Run a safety classifier (LlamaGuard 3 or similar) on outputs from the DPO-tuned model using adversarial prompts not in the training set. DPO on style and helpfulness preference pairs can accidentally degrade safety alignment if the preference data over-represents assertive, direct responses that border on unsafe.

This is not hypothetical. Teams have shipped DPO models that score well on helpfulness benchmarks and fail basic safety evaluations because their "chosen" completions were assertive in ways that correlated with boundary-pushing. Catch this before deployment.

Cost Benchmarks Across H100/H200/B200

For a 70B LoRA DPO run, the most common configuration in 2026:

GPUConfigOn-Demand/hrSpot/hr24hr Run (on-demand)24hr Run (spot)
H100 SXM52x (160 GB)$5.80$1.60$139.20$38.40
H200 SXM52x (282 GB)$9.08N/A$217.92N/A
B200 SXM61x (192 GB)$6.73$1.71$161.52$41.04

H200 spot pricing is not currently available via the public API; check the pricing page for current spot rates.

Pricing fluctuates based on GPU availability. The prices above are based on 30 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

DPO is fully spot-eligible. Unlike GRPO (where the rollout trainer holds state and preemption is costly) or PPO (which requires coordinating policy and critic checkpoints), DPO is a standard forward-backward training loop. Preemption just means resuming from the last checkpoint. Set save_steps=100 and you lose at most 100 steps of training on preemption.

Compared to GRPO for 32B on a single H200 at $4.54/hr, DPO is roughly 2x cheaper over the same wallclock time because you skip rollout generation entirely. The tradeoff is that DPO cannot reach the same reasoning benchmark improvements that GRPO achieves on math and code tasks.

For the full billing model breakdown on spot vs on-demand, see serverless vs on-demand vs reserved GPU billing.

After DPO, if inference cost is the next concern, model distillation on GPU cloud covers distilling a DPO-tuned 70B into an 8B student to cut serving costs without losing most of the alignment gains.

When to Graduate from DPO to GRPO or PPO

DPO has real limits. Three signals tell you it has hit its ceiling.

1. The task requires reasoning chains not in your training data. DPO can only reinforce strategies already present in chosen completions. If the model cannot solve the problem at SFT baseline, DPO cannot teach it to. You need online generation with a verifiable reward signal. This is exactly what GRPO does. See the GRPO fine-tuning guide for the full setup.

2. Win-rate plateaus with more data. If adding 10x more preference pairs improves win-rate by under 2 percentage points, the model has extracted most of the learnable signal from the preference distribution. GRPO's online generation can continue improving because it generates novel reasoning traces that weren't in the original dataset.

3. You need credit assignment across long sequences. DPO applies a single preference label to an entire completion. For tasks where specific reasoning steps are correct or incorrect, PPO's per-token advantage estimates are more informative. This is expensive and rare. Exhaust DPO and GRPO before going here, and confirm your task genuinely requires per-token signals rather than completion-level preference.


DPO runs are short-burst jobs, typically hours to a few days, that don't justify reserved GPU capacity. Spheron's per-minute on-demand and spot H200 nodes fit this well: pay only for the hours you train, skip the reserved commitment, and scale from single-node LoRA DPO to single-node 8x H200 full-parameter runs without changing providers.

Rent H200 on Spheron → | Rent B200 → | View current GPU pricing →

Start your DPO run →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.