Engineering

Beyond LoRA: DoRA, GaLore, PiSSA, and VeRA Fine-Tuning on GPU Cloud (2026 PEFT Decision Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 11, 2026
DoRA Fine-Tuning GPUGaLore vs LoRAPiSSA Fine-TuningPEFT Methods 2026Advanced LoRA AlternativesVeRA Fine-TuningLoRA-FALLM Fine-TuningGPU CloudH100
Beyond LoRA: DoRA, GaLore, PiSSA, and VeRA Fine-Tuning on GPU Cloud (2026 PEFT Decision Guide)

LoRA became the default for good reason. It is fast, cheap, and backed by every major fine-tuning framework. But it has two failure modes that start to matter once you push accuracy. For a 7B model trained on 100K instruction pairs, DoRA closes roughly half the gap between LoRA and full fine-tuning with only 5-10% more VRAM overhead. That is not a reason to abandon LoRA everywhere, but it is a reason to know when each method is the right tool. If you need the LoRA and SFT baseline first, start with the LLM fine-tuning guide before continuing here.

This guide covers DoRA, GaLore, PiSSA, VeRA, MoRA, and LoRA-FA: what each method actually does, when to use it, VRAM tables for 7B through 70B models, and a cost comparison built on live Spheron pricing.

Why LoRA Falls Short

LoRA constrains every weight update to a low-rank matrix. If the true update for a layer has rank 50 and you set r=16, LoRA cannot represent it. The constraint is intentional: it reduces trainable parameters by 99%+ and keeps VRAM under control. But the constraint is also real. Increasing rank above 64 rarely helps in practice because the update subspace is already saturated at lower ranks and you are mostly adding noise.

The second problem is magnitude-direction coupling. When LoRA updates a weight matrix, the direction and magnitude of that update are coupled through the same BA product. Full fine-tuning implicitly decouples these: the optimizer updates each parameter independently, so a weight's direction can shift while its magnitude stays stable, or vice versa. LoRA cannot replicate this because both properties are controlled by the same low-rank decomposition.

The DoRA paper quantified this on NLU benchmarks. On commonsense reasoning tasks (HellaSwag, WinoGrande, ARC-Challenge), LoRA's update pattern deviates significantly from full fine-tuning. On pure language model perplexity, the deviation is much smaller. This is why the accuracy gap shows up on downstream benchmarks before it shows up on your training loss curve.

None of this means LoRA is broken. For most production fine-tuning jobs, LoRA at rank 16-32 is still the correct starting point. The newer methods in this guide address specific failure modes. Start with LoRA, measure, and upgrade if the gap is real.

Method Overview: Quick Decision Matrix

MethodUpdatesVRAM vs LoRAAccuracy vs Full FTConvergenceBest For
LoRAAdapter onlyBaseline-3-5%BaselineMost production fine-tuning
DoRAAdapter + magnitude+5-10%-1-2%SimilarMax quality with adapter economics
GaLoreAll (projected gradient)+30-40% vs LoRA-1-3%Slower (more steps)Full-param quality, memory-limited nodes
PiSSAAdapter (SVD init)Same as LoRA-2-4%30-50% fasterSpeed-constrained, fixed GPU budget
VeRAShared frozen adapterMinimal-4-6%SlowerExtremely VRAM-limited, large models
MoRAHigh-rank square adapterSimilar-2-4%SimilarFactual memorization, sequential data
LoRA-FAAdapter (frozen A)-15-25%-3-5%SimilarSingle-GPU 13B-70B under VRAM pressure

The decision logic is straightforward. If your eval metric is downstream accuracy and you have room for 5-10% more VRAM than LoRA, use DoRA. If time-to-convergence is the constraint on a fixed compute budget, use PiSSA. If you need full-parameter quality on a memory-limited node, GaLore is the only option that delivers it. If VRAM is extremely tight, reach for VeRA or LoRA-FA.

DoRA: Separating Direction from Magnitude

DoRA decomposes the weight matrix into a magnitude component and a directional component, then adapts each independently:

W = m * (W_0 + BA) / ||W_0 + BA||_c

Here m is a learnable magnitude vector (one scalar per output dimension), BA is a standard LoRA rank decomposition, and the denominator normalizes the combined matrix column-wise. The base weights W_0 are frozen; m and BA are trainable.

This matters because direction and magnitude have different sensitivity profiles during training. When you train from a pretrained model, you often need large directional shifts in some layers (adapting attention patterns for a new domain) while keeping magnitudes stable. LoRA cannot do this: any directional change in BA also changes the magnitude of the combined weight, and vice versa. DoRA lets each component adapt at its own effective learning rate.

Typical accuracy gains over LoRA: 1-3% on commonsense NLU benchmarks (HellaSwag, WinoGrande, ARC-Challenge), 1-2% on MT-Bench instruction-following scores. Near-zero gain on pure perplexity tasks where the language modeling objective already guides the update efficiently.

When not to use DoRA: if your evaluation only tracks loss or perplexity, the accuracy gain is invisible and the extra 5-10% VRAM is unjustified. Save DoRA for runs where downstream task accuracy is the primary metric.

Code Recipe (PEFT)

python
from peft import LoraConfig, get_peft_model, TaskType

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    use_dora=True,          # the only change from standard LoRA
    bias="none",
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 81,920,080 || all params: 6,738,451,456 || trainable%: 1.216

Requires PEFT >= 0.10.0.

Axolotl Config

yaml
peft:
  lora_r: 16
  lora_alpha: 32
  use_dora: true
  lora_target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

Unsloth 2026.04+ supports DoRA via use_dora=True in FastLanguageModel.get_peft_model(). The implementation is fused and runs at the same speed as Unsloth's LoRA kernels.

GaLore: Full Parameters at LoRA Cost

GaLore takes a different approach entirely. Instead of adding adapter matrices, it trains all parameters but projects the gradient into a low-rank subspace before computing optimizer state. The mechanism:

  1. Compute the full gradient G for a weight layer (shape: n x m).
  2. Compute the top-r singular vectors of G, forming projection matrices P (shape: n x r) and Q (shape: m x r).
  3. Project the gradient: G_projected = P^T G Q (shape: r x r).
  4. Run Adam on G_projected, storing momentum buffers of size r x r instead of n x m.
  5. Reconstruct the full update: update = P G_projected Q^T.
  6. Apply the full-size update to the actual weights.

The projection matrices P and Q are recomputed every T steps (typically T=200). Between updates, they are fixed, which is what keeps the optimizer state small. After T steps cover enough of the gradient space, the optimizer has effectively seen and updated all parameters.

Memory equation: standard Adam stores 2 fp32 momentum buffers per parameter. For a 7B model, that is 84 GB of optimizer state alone, which is why full fine-tuning with Adam requires 8x A100s. GaLore's projected Adam stores buffers of size r x r per layer instead of n x m. At r=128 on a 4096x4096 weight, that is a 32x reduction in optimizer state for that layer. Across the full 7B model, total training VRAM drops from ~80 GB (full fine-tuning) to ~18 GB (GaLore r=128).

This comes with a tradeoff: GaLore needs 20-40% more training steps than LoRA to reach equivalent loss, because the projected gradient update is noisier than a direct adapter update. Plan for longer runs.

When GaLore makes sense: tasks requiring genuine full-parameter adaptation such as learning a new language domain, adapting across architectural changes, or fine-tuning when you cannot identify which layers to target with LoRA. Also useful when you have two H100s instead of eight and need to squeeze 70B full-parameter quality out of limited GPU count.

Code Recipe (galore-torch)

python
from galore_torch import GaLoreAdamW

# Drop-in replacement for AdamW in any training loop
optimizer = GaLoreAdamW(
    model.parameters(),
    lr=1e-4,
    weight_decay=0.01,
    rank=128,            # GaLore projection rank
    update_proj_gap=200, # how often to recompute projection subspace
    scale=0.25,          # learning rate scale factor for projected update
    proj_type="std",     # 'std' or 'reverse_std'
)

Install: pip install galore-torch. For TRL SFTTrainer, pass via optimizer_cls_and_kwargs:

python
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        optimizer="adamw_torch",  # placeholder; override below
        # ... other SFTConfig args ...
    ),
    optimizers=(optimizer, None),  # pass pre-built optimizer, no scheduler
)

One caution: combining GaLore with 4-bit quantization is experimental in 2026. The gradient projection depends on a clean gradient signal from the forward pass. With NF4 quantization, the gradients going into the projection are noisier, which can degrade the subspace quality. If VRAM is your constraint and you want GaLore's quality, use 8-bit quantization (bitsandbytes int8) rather than 4-bit NF4. For fitting models in tight VRAM without GaLore, QLoRA or QDoRA is the safer path.

GaLore integration in Axolotl is experimental as of early 2026. Use the raw galore_torch.GaLoreAdamW with TRL's SFTTrainer rather than the Axolotl YAML optimizer path to avoid issues on Axolotl stable.

PiSSA: Initialization Is Half the Battle

Standard LoRA initializes A with random Gaussian values and B with zeros. The first several hundred training steps are spent learning where the meaningful update subspace is. PiSSA skips that warmup by initializing directly from the model's principal singular values.

The initialization:

  1. Compute truncated SVD of the pretrained weight W: W ≈ U_r S_r V_r^T.
  2. Set A = sqrt(S_r) V_r^T and B = U_r sqrt(S_r). The adapter BA now equals the top-r component of W.
  3. Freeze the residual W_res = W - BA. Train the adapter BA as in LoRA.

The adapter starts at the most informative region of the weight's parameter space, the principal subspace, rather than at a random perturbation orthogonal to useful directions. Training steps immediately refine meaningful structure rather than discovering it.

Typical convergence speedup: 30-50% fewer training steps to reach the same validation loss as LoRA at equivalent rank on instruction-following benchmarks. At equivalent compute (same number of steps), PiSSA frequently matches DoRA accuracy. The choice between DoRA and PiSSA comes down to whether your bottleneck is steps or VRAM: PiSSA uses the same VRAM as LoRA, DoRA uses 5-10% more.

Use PiSSA when you have a fixed compute window, such as a 4-hour spot instance, and need to maximize the training signal per step.

Code Recipe (PEFT)

python
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    init_lora_weights="pissa",   # PiSSA initialization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.0,            # PiSSA paper recommends no dropout
)
model = get_peft_model(model, config)

Requires PEFT >= 0.11.0. The SVD initialization runs once at startup and adds 30-60 seconds for a 7B model (one truncated SVD pass per adapted layer). For models over 30B, use init_lora_weights="pissa_niter_4" to switch to approximate fast SVD and keep init time under 2 minutes.

VeRA, MoRA, and LoRA-FA: When the Niche Methods Win

VeRA

VeRA takes parameter reduction further than any other method here. Instead of per-layer low-rank matrices, VeRA shares a single pair of random frozen matrices (A and B) across all layers, then learns only a tiny scaling vector b and d per layer. Trainable parameters drop to roughly 1.8M regardless of model size.

This matters in two scenarios: fitting a 13B model onto a 12 GB GPU without quantization, or running proof-of-concept fine-tuning where you just need the model to respond to a new format and accuracy is secondary. The accuracy cost is real: 4-6% below LoRA on diverse benchmarks. Use VeRA to prototype, then switch to LoRA or DoRA for production. PEFT >= 0.12.0.

MoRA

MoRA replaces LoRA's rectangular low-rank matrices with square matrices of the same rank, using rotation-based compression to maintain the low-parameter property while capturing more information per parameter. The key difference from LoRA: MoRA's square structure allows a higher effective rank for the same parameter count on tasks where the update needs to be dense within a subspace.

Empirically, MoRA outperforms LoRA on tasks requiring high factual memorization: question answering over new corpora, domain-specific classification with many categories, sequential prediction tasks. Same VRAM overhead as LoRA. PEFT support was added in 0.13.

LoRA-FA

LoRA-FA freezes the A (down-projection) matrix after random initialization and only trains B (up-projection). Because A is frozen, you do not need to store activations for A's backward pass, which cuts training VRAM by 15-25% versus standard LoRA at the same rank.

The accuracy drop is modest: 0.5-1.5% below LoRA on most benchmarks, since B alone still covers a meaningful update subspace. The right use case is fitting a 13B or 70B model onto a single GPU where the batch size or sequence length is the binding constraint. At 70B with LoRA r=16, LoRA-FA saves 6-12 GB of VRAM compared to full LoRA, which can be the difference between batch size 1 and batch size 4.

VRAM Comparison Tables

All figures assume rank 16 (r=128 for GaLore), batch size 2, sequence length 512, BF16 base model weights, and Adam optimizer. Add 15-20% headroom for actual training runs. QLoRA/QDoRA estimates use 4-bit NF4 quantization with BF16 adapters.

For the baseline model memory math behind these numbers, see the GPU memory requirements for LLMs guide.

7B Model

MethodRTX 4090 (24 GB)L40S (48 GB)A100 80 GBH100 80 GB
Full FT (BF16)OOMOOM~60 GB~60 GB
LoRA r=16~10 GB~10 GB~10 GB~10 GB
DoRA r=16~11 GB~11 GB~11 GB~11 GB
GaLore r=128~18 GB~18 GB~18 GB~18 GB
PiSSA r=16~10 GB~10 GB~10 GB~10 GB
VeRA~9 GB~9 GB~9 GB~9 GB
LoRA-FA r=16~8.5 GB~8.5 GB~8.5 GB~8.5 GB
QLoRA r=16 (4-bit)~6 GB~6 GB~6 GB~6 GB
QDoRA r=16 (4-bit)~7 GB~7 GB~7 GB~7 GB

13B Model

MethodRTX 4090 (24 GB)L40S (48 GB)A100 80 GBH100 80 GB
Full FT (BF16)OOMOOMOOMOOM
LoRA r=16OOM~18 GB~18 GB~18 GB
DoRA r=16OOM~20 GB~20 GB~20 GB
GaLore r=128OOM~32 GB~32 GB~32 GB
PiSSA r=16OOM~18 GB~18 GB~18 GB
VeRA~16 GB~16 GB~16 GB~16 GB
LoRA-FA r=16~14 GB~14 GB~14 GB~14 GB
QLoRA r=16 (4-bit)~10 GB~10 GB~10 GB~10 GB
QDoRA r=16 (4-bit)~12 GB~12 GB~12 GB~12 GB

70B Model

MethodRTX 4090 (24 GB)L40S (48 GB)A100 80 GBH100 80 GB
Full FT (BF16)OOMOOMOOM (need 8x)OOM (need 8x)
LoRA r=16OOMOOMOOM~72 GB (1x)
DoRA r=16OOMOOMOOM~76 GB (1x)
GaLore r=128OOMOOMOOMOOM (need 2x)
PiSSA r=16OOMOOMOOM~72 GB (1x)
LoRA-FA r=16OOMOOM~62 GB~62 GB
QLoRA r=16 (4-bit)OOM~38 GB~38 GB~38 GB
QDoRA r=16 (4-bit)OOM~42 GB~42 GB~42 GB

For 70B at full BF16 precision, LoRA and DoRA require a single H100 80 GB (tight), LoRA-FA fits on A100 80 GB, and GaLore needs at least 2x H100. Full fine-tuning requires 8x H100 with ZeRO-3 sharding.

Cost Comparison: 7B Model, 100K Training Examples

Live Spheron GPU pricing fetched 11 May 2026 via the public API:

GPUOn-Demand/hr
H100 SXM5$4.21
A100 80G SXM4$1.64
A100 80G PCIe$1.04
L40S PCIe$0.72

Note: RTX 4090 is not currently listed via the public pricing API. Check current GPU pricing for the latest availability. L40S at $0.72/hr is shown as the budget single-GPU option.

MethodGPUEst. HoursOn-Demand $/hrTotal Cost
Full FTH100 SXM512$4.21$50.52
LoRA r=16L40S3$0.72$2.16
DoRA r=16L40S3.3$0.72$2.38
GaLore r=128A100 80G SXM45$1.64$8.20
PiSSA r=16L40S2$0.72$1.44
QLoRA r=16L40S2.5$0.72$1.80
QDoRA r=16L40S2.8$0.72$2.02

The DoRA premium over LoRA is $0.22 for this workload. That is the cost of the extra 5-10% VRAM and the slightly longer training time from the magnitude parameter updates. If DoRA gives you 1-2% higher accuracy on your downstream benchmark, it is almost always worth $0.22.

GaLore costs $8.20 for this run, roughly 4x more than LoRA, because it needs a larger GPU tier and more training steps. The payoff is full-parameter adaptation quality on tasks where LoRA's rank constraint genuinely limits accuracy. For most instruction-following tasks, LoRA wins on cost-adjusted accuracy.

For spot pricing strategy and GPU selection, see GPU cost optimization playbook and the spot GPU training case study.

Pricing fluctuates based on GPU availability. The prices above are based on 11 May 2026 and may have changed. Check current GPU pricing → for live rates.

Inference Implications: Merging, Deployment, and Multi-Adapter Serving

All adapter-based methods merge identically to LoRA via model.merge_and_unload(). The merged checkpoint size equals the base model size. No inference overhead after merge.

For DoRA specifically: the magnitude vector is folded back into the merged weight during merge_and_unload(). This requires PEFT >= 0.10 to handle correctly. On older PEFT versions, DoRA adapters will merge but the magnitude component will be applied incorrectly, degrading model quality silently. Pin your PEFT version.

GaLore produces a full-weight checkpoint at the end of training. No merge step needed. Deploy exactly like any Hugging Face model checkpoint.

Multi-adapter serving with DoRA and PiSSA works through vLLM 0.6+'s LoRA serving path. The magnitude scaling in DoRA adapters is handled in the adapter weights themselves, so vLLM loads and applies them the same way as standard LoRA. Launch with --enable-lora. For multi-adapter serving patterns across dozens of adapters on a single base model, see the guide on choosing your PEFT training method for multi-adapter deployments.

Adapter size on disk: DoRA adapters are 5-10% larger than equivalent LoRA adapters due to the magnitude vectors. At rank 16 for a 7B model, expect 130-150 MB for DoRA versus 100-120 MB for LoRA. At scale with hundreds of adapters, the storage difference is measurable but not a major concern.

Deploy on Spheron: DoRA Fine-Tuning Walkthrough

This section walks through a complete DoRA fine-tuning job for Llama-3.1-8B on Spheron. The same steps apply to PiSSA by swapping use_dora=True for init_lora_weights="pissa" in the PEFT config.

1. Provision an Instance

Rent H100 on Spheron for a production DoRA run, or A100 on Spheron for 7B models where the 80 GB gives you plenty of headroom for larger batch sizes. SSH in and verify the environment:

bash
ssh root@<instance-ip>
nvidia-smi
# Expected: H100 or A100 shown, CUDA 12.x

2. Install Dependencies

bash
pip install "transformers>=4.40" "peft>=0.10.0" "trl>=0.8" \
    bitsandbytes accelerate datasets

The PEFT version constraint matters. DoRA support landed in 0.9.0; 0.10.0 extended it. Pin to >= 0.10.0 for the most complete implementation.

3. Prepare Dataset and Training Script

python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer, SFTConfig
import torch

# Load model (QDoRA: 4-bit base + DoRA adapter)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# DoRA config
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    use_dora=True,
    bias="none",
)

dataset = load_dataset("your_dataset", split="train")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="./dora-adapter",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        bf16=True,
        logging_steps=10,
        save_steps=100,
    ),
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
)
trainer.train()
trainer.save_model("./dora-adapter")

4. Launch Training

bash
accelerate launch --mixed_precision bf16 train_dora.py \
  --model_name_or_path meta-llama/Llama-3.1-8B \
  --dataset_name your_dataset \
  --output_dir ./dora-adapter \
  --num_train_epochs 3 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --lora_r 16 \
  --use_dora

5. Merge and Push

python
import torch
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "./dora-adapter",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
merged = model.merge_and_unload()
merged.save_pretrained("./dora-merged")

Requires PEFT >= 0.10 for correct DoRA magnitude folding during merge.

6. Serve with vLLM

bash
vllm serve ./dora-merged --dtype bfloat16 --max-model-len 4096

The merged model runs at base model speed with no adapter overhead. If you want to serve the unmerged adapter for multi-adapter switching, use --enable-lora and point vLLM at the adapter directory instead.

For Axolotl users: add use_dora: true to your existing peft: config block. No other changes are required. See docs.spheron.ai for GPU provisioning reference.


Making the Choice

The method matrix above covers the full decision space, but most people need one of three answers:

For 7B on a 24 GB GPU: LoRA or PiSSA at r=16. Use PiSSA if training time matters. Use DoRA if you have downstream accuracy benchmarks and want the extra 1-2%.

For 13B-70B on A100 or H100: LoRA-FA if VRAM is tight and you are willing to trade 0.5-1.5% accuracy for 15-25% less memory. Full LoRA or DoRA if you have headroom. QDoRA if 4-bit quantization is acceptable.

For full-parameter quality without 8x GPU: GaLore at r=128. Expect longer training runs and a larger GPU tier than LoRA, but you get full-weight checkpoints and genuine full-parameter adaptation.

Run a vanilla LoRA baseline first. If the downstream accuracy gap is real and your task is commonsense reasoning or instruction-following rather than pure language modeling, DoRA is the upgrade to try. If you are compute-bound and convergence speed matters more than VRAM, PiSSA is the lever to pull.

DoRA, GaLore, and PiSSA each cut a different corner of the full fine-tuning triangle: accuracy, memory, and speed. Spheron's per-hour H100 and A100 pricing makes it practical to run head-to-head ablations across methods for under $50 before committing to one for a production run.

Rent H100 → | Rent A100 → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.