GPU memory is the constraint that determines whether your fine-tuning job runs at all. Before writing a single line of training code, you need to know whether your chosen model and method fit the GPU you're renting. The answer varies by a factor of 16x depending on whether you use full fine-tuning, LoRA, or QLoRA.
This guide covers the VRAM math for every common scenario: 7B through 70B models and MoE variants, across all three fine-tuning regimes. For a broader picture of what fine-tuning costs and which frameworks to use, start with the complete LLM fine-tuning guide for 2026. If you want to estimate total training cost for a specific token budget, use the training cost calculator.
Quick Answer: VRAM by Model and Method
If you just need the number fast, this table has it. The "total" column includes weights, gradients, optimizer states, and activations at batch size 1, sequence length 512.
| Model Size | Full Fine-Tuning | LoRA (r=64, BF16 base) | QLoRA (4-bit NF4) | QLoRA Min GPU |
|---|---|---|---|---|
| 7B/8B | ~88 GB | ~20 GB | ~8 GB | RTX 5090 (32 GB) |
| 14B | ~174 GB | ~35 GB | ~14 GB | RTX 5090 (32 GB) |
| 32B | ~394 GB | ~76 GB | ~28 GB | RTX 5090 (32 GB) |
| 70B/72B | ~860 GB | ~159 GB | ~52 GB | H100 80GB |
| MoE 30B A3B | ~105 GB | ~69 GB | ~21 GB | RTX 5090 (32 GB) |
The jump from QLoRA to full fine-tuning is not incremental. For 70B, you go from one H100 to eleven H100 SXM5s. That is an order-of-magnitude cost and complexity difference.
The Four Memory Components in Training
Training uses four distinct VRAM pools. Inference only needs one (model weights). Every GPU sizing mistake I have seen comes from accounting for only one or two of these components.
1. Model Weights
Weight memory depends on param count and your training dtype.
- BF16 (standard for LoRA and full FT): 2 bytes per parameter. A 7B model takes 14 GB; a 70B model takes 140 GB.
- INT4 / NF4 (quantized base in QLoRA): 0.5 bytes per parameter. The same 70B model takes 35 GB, which is why QLoRA fits in a single H100.
For QLoRA, the base model weights are frozen and quantized. The adapter weights (A and B matrices) are stored separately in BF16 and are tiny: typically 0.5-2% of total parameter count.
2. Gradients
Gradients are stored at the same precision as the weights being updated.
- Full fine-tuning: Gradient memory matches weight memory. A 70B model in BF16 needs another 140 GB for gradients.
- LoRA: Only adapter gradients are stored. With 1% of parameters trainable, gradient memory drops to 1-3 GB for a 70B model.
- QLoRA: Same as LoRA - adapter gradients only. The frozen quantized base does not accumulate gradients.
Gradient checkpointing (model.gradient_checkpointing_enable()) does not affect this component. It only reduces activation memory.
3. Optimizer States
This is the component most people underestimate. Adam and AdamW store two FP32 running statistics per trainable parameter: a first moment (mean of gradients) and a second moment (mean of squared gradients). Each is stored at FP32 regardless of your training dtype.
The formula: trainable_params × 4 bytes × 2 moments
| Model | Method | Trainable Params | Optimizer State |
|---|---|---|---|
| 7B | Full FT | 7B | ~56 GB |
| 7B | LoRA r=64 (~1% adapters) | ~70M | ~0.6 GB |
| 7B | QLoRA | ~70M | ~0.6 GB |
| 70B | Full FT | 70B | ~560 GB |
| 70B | LoRA r=64 | ~700M | ~5.6 GB |
| 70B | QLoRA | ~700M | ~5.6 GB |
This is why full fine-tuning of a 70B model needs 560 GB just for Adam's momentum buffers. It is also why switching from LoRA to QLoRA does not change optimizer memory: both have the same number of trainable adapter parameters.
Paged AdamW (optim="paged_adamw_32bit" in TrainingArguments) offloads optimizer states to CPU RAM in pages when GPU VRAM gets tight. It adds CPU-GPU transfer overhead but can save 2-6 GB on smaller models where the optimizer state is a meaningful fraction of total VRAM.
4. Activations
Activations are the intermediate outputs stored during the forward pass for use in backpropagation. They scale with batch size and sequence length.
At batch size 1, sequence length 512, expect roughly 2-5 GB for 7B models and 8-20 GB for 70B models. Longer sequences increase this quadratically (attention scores scale with sequence length squared before flash attention, or linearly with flash attention).
Gradient checkpointing cuts activation memory by 40-60% by discarding intermediate activations and recomputing them on the backward pass. The tradeoff is 25-30% more compute time. For most fine-tuning jobs where VRAM is the constraint, this tradeoff is worth it.
Full Fine-Tuning
Full fine-tuning updates every parameter in the base model. You store gradients and optimizer states for all parameters. This gives the best possible accuracy but is expensive in both VRAM and compute.
VRAM formula for full fine-tuning in BF16:
total_vram = (params × 2) # weights in BF16
+ (params × 2) # gradients in BF16
+ (params × 8) # Adam: 2 moments × FP32
+ activations # 2-20 GB depending on batch/seq lengthFor a 7B model: 14 + 14 + 56 + 4 = 88 GB. This already exceeds a single A100 80G's 80 GB capacity (weights + gradients + optimizer alone total 84 GB before activations). Run full FT on 2x A100 80G with tensor parallelism, or use an H200 SXM5 (141 GB). Note: standard mixed-precision training also stores FP32 master weights (~4 bytes/param, ~28 GB extra for 7B), pushing the true total toward ~116 GB; the formula above assumes these are shared with the BF16 weight buffer as in many framework implementations.
For a 70B model: 140 + 140 + 560 + 20 = 860 GB. No single GPU handles this. You need 11x H100 SXM5 with FSDP2 or DeepSpeed ZeRO-3 to shard gradients and optimizer states across cards. (8x H100 SXM5 provides only 640 GB total, a 220 GB shortfall; each card would need to hold ~107.5 GB on average, exceeding its 80 GB capacity.)
Use full fine-tuning when:
- You need maximum accuracy and are willing to pay for it
- You plan to distill or merge the fine-tuned weights into a new model
- Your dataset is large enough (100K+ examples) that adapter methods leave accuracy on the table
For FSDP2 cluster setup and multi-node distributed training, see the distributed LLM training guide.
LoRA
LoRA (Low-Rank Adaptation) adds trainable A and B matrices alongside frozen base model weights. You train only the adapters, which are a fraction of total parameter count. The base model stays in BF16 in memory.
VRAM formula for LoRA:
total_vram = (params × 2) # base model weights in BF16 (frozen but in VRAM)
+ adapter_weights # small: typically 0.5-5 GB depending on rank/model size
+ adapter_gradients # same size as adapter weights
+ adapter_optimizer # Adam states for adapters: ~2x adapter weights in FP32
+ activations # 2-20 GB depending on configKey insight: the base model must still fit in VRAM at BF16 for standard LoRA. That means a 70B LoRA run starts at 140 GB before you add anything else. An H100 SXM5 at 80 GB cannot hold this alone.
Practical minimums for LoRA at rank 64:
- 7B: ~20 GB total (RTX 5090 at 32 GB with headroom)
- 14B: ~35 GB total (A100 40G at 40 GB is tight; RTX 5090 is better)
- 32B: ~76 GB total (H100 80G is the minimum)
- 70B: ~159 GB total (minimum 2x H100 SXM5; the H200 SXM5's 141 GB is insufficient even with gradient checkpointing since the non-activation floor is ~149 GB)
For advanced adapter methods that improve accuracy over standard LoRA at similar VRAM cost, see the PEFT guide covering DoRA, PiSSA, and GaLore.
QLoRA
QLoRA combines 4-bit NF4 quantization of the base model with BF16 adapters. The frozen base is quantized; the trainable adapters stay at full precision. This is what makes 70B fine-tuning practical on a single H100.
VRAM formula for QLoRA:
total_vram = (params × 0.5) # base model weights in 4-bit NF4
+ adapter_weights # adapters in BF16: ~1-2 GB for 70B at r=64
+ adapter_gradients # ~1-2 GB
+ adapter_optimizer # ~5-6 GB for 70B (AdamW on 700M adapter params)
+ activations # 4-8 GB at batch 1, 512 tokensFor a 70B model: 35 + 1.5 + 1.5 + 5.6 + 8 = ~52 GB. An H100 80GB handles this with about 28 GB to spare for larger batches or longer sequences.
QLoRA accuracy is typically 1-3% below full fine-tuning and 0.5-1% below standard LoRA on the same base model. For most production use cases, this difference is not meaningful.
The quality you lose with quantization comes from rounding error in the 4-bit base weights. Unsloth's dynamic 4-bit implementation reduces this gap to 0.02 perplexity points compared to 8-bit, making it effectively lossless for most downstream tasks.
Master VRAM Table
Full breakdown for each model size and method. All estimates assume batch size 1, sequence length 512, AdamW optimizer, rank 64 for LoRA/QLoRA. Add 15-20% headroom for real training runs (peak allocations spike above average).
7B / 8B Models (Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B)
| Method | Base Weights | Adapters | Gradients | Optimizer | Activations | Total | Min GPU | $/hr |
|---|---|---|---|---|---|---|---|---|
| Full FT | 14 GB | N/A | 14 GB | 56 GB | 4 GB | ~88 GB | 2x A100 80G PCIe | $2.96 |
| LoRA r=64 | 14 GB | 0.2 GB | 0.2 GB | 0.6 GB | 4 GB | ~19 GB | RTX 5090 (32 GB) | $0.92 |
| QLoRA r=64 | 3.5 GB | 0.2 GB | 0.2 GB | 0.6 GB | 4 GB | ~8 GB | RTX 5090 (32 GB) | $0.92 |
14B Models (Qwen 2.5 14B, Phi-4 14B)
| Method | Base Weights | Adapters | Gradients | Optimizer | Activations | Total | Min GPU | $/hr |
|---|---|---|---|---|---|---|---|---|
| Full FT | 28 GB | N/A | 28 GB | 112 GB | 6 GB | ~174 GB | 3x A100 80G | $4.44 ($1.48/GPU) |
| LoRA r=64 | 28 GB | 0.4 GB | 0.4 GB | 0.8 GB | 5 GB | ~35 GB | RTX 5090 (32 GB, tight) | $0.92 |
| QLoRA r=64 | 7 GB | 0.4 GB | 0.4 GB | 0.8 GB | 5 GB | ~14 GB | RTX 5090 (32 GB) | $0.92 |
Note on 14B LoRA: 35 GB is tight for a 32 GB GPU. Enable gradient checkpointing and reduce batch size to 1. For more headroom, use an A100 40G ($1.71/hr).
32B Models (Qwen 3 32B, Qwen 2.5 32B, Gemma 3 27B)
| Method | Base Weights | Adapters | Gradients | Optimizer | Activations | Total | Min GPU | $/hr |
|---|---|---|---|---|---|---|---|---|
| Full FT | 64 GB | N/A | 64 GB | 256 GB | 10 GB | ~394 GB | 5x A100 80G | $7.40 ($1.48/GPU) |
| LoRA r=64 | 64 GB | 1 GB | 1 GB | 2 GB | 8 GB | ~76 GB | H100 PCIe (80 GB) | $2.01 |
| QLoRA r=64 | 16 GB | 1 GB | 1 GB | 2 GB | 8 GB | ~28 GB | RTX 5090 (32 GB) | $0.92 |
70B / 72B Models (Llama 3.3 70B, Qwen 2.5 72B)
| Method | Base Weights | Adapters | Gradients | Optimizer | Activations | Total | Min GPU | $/hr |
|---|---|---|---|---|---|---|---|---|
| Full FT | 140 GB | N/A | 140 GB | 560 GB | 20 GB | ~860 GB | 11x H100 SXM5 + FSDP | $55.77 ($5.07/GPU) |
| LoRA r=64 | 140 GB | 1.5 GB | 1.5 GB | 5.6 GB | 10 GB | ~159 GB | 2x H100 SXM5 | $10.14 ($5.07/GPU) |
| QLoRA r=64 | 35 GB | 1.5 GB | 1.5 GB | 5.6 GB | 8 GB | ~52 GB | H100 PCIe 80GB | $2.01 |
GC = gradient checkpointing enabled. 70B LoRA has a non-activation memory floor of ~149 GB (weights 140 GB + adapters 1.5 GB + gradients 1.5 GB + optimizer 5.6 GB), which already exceeds the H200 SXM5's 141 GB even if activations are fully eliminated. Gradient checkpointing only reduces activation memory, not the non-activation components. The minimum configuration for 70B LoRA is 2x H100 SXM5 (160 GB combined).
MoE: Qwen3 30B A3B (30B Total / 3B Active Parameters)
MoE VRAM behaves differently from dense models. All expert weights must be loaded into VRAM (for weight memory calculation, use total params). But only active expert parameters receive gradients per step, so gradient and optimizer memory uses active params.
| Method | Base Weights | Adapters (active experts) | Gradients (active) | Optimizer (active) | Activations | Total | Min GPU | $/hr |
|---|---|---|---|---|---|---|---|---|
| Full FT | 60 GB | N/A | 6 GB | 24 GB | 15 GB | ~105 GB | 2x A100 80G | $2.96 ($1.48/GPU) |
| LoRA r=64 | 60 GB | 0.2 GB | 0.2 GB | 0.4 GB | 8 GB | ~69 GB | H100 PCIe (80 GB) | $2.01 |
| QLoRA r=64 | 15 GB | 0.1 GB | 0.1 GB | 0.5 GB | 5 GB | ~21 GB | RTX 5090 (32 GB) | $0.92 |
For MoE QLoRA, only active expert adapters need gradients and optimizer states per step, which is why the numbers look surprisingly small at 0.5 GB for optimizer states. Over training, all experts get covered because different tokens activate different experts. Unsloth benchmarks confirm Qwen3 30B-A3B QLoRA running at 17.5 GB, consistent with this estimate.
GPU Selection Guide
Mapping scenarios to the right GPU tier:
| Training Scenario | Min VRAM | GPU on Spheron | On-Demand $/hr |
|---|---|---|---|
| 7B QLoRA | 8 GB | RTX 5090 | $0.92 |
| 7B LoRA | 20 GB | RTX 5090 | $0.92 |
| 7B Full FT | 88 GB | 2x A100 80G PCIe | $2.96 |
| 14B QLoRA | 14 GB | RTX 5090 | $0.92 |
| 14B LoRA | 35 GB | RTX 5090 (tight) or A100 40G | $0.92-1.71 |
| 32B QLoRA | 28 GB | RTX 5090 | $0.92 |
| 32B LoRA | 76 GB | H100 PCIe (80 GB) | $2.01 |
| 70B QLoRA | 52 GB | H100 PCIe (80 GB) or H100 SXM5 (80 GB) | $2.01-5.07 |
| 70B LoRA | 159 GB | 2x H100 SXM5 | $10.14 ($5.07/GPU) |
| 70B Full FT | 860 GB | 11x H100 SXM5 + FSDP/ZeRO-3 | $55.77 ($5.07/GPU) |
| MoE 30B A3B QLoRA | 21 GB | RTX 5090 | $0.92 |
| MoE 30B A3B LoRA | 69 GB | H100 PCIe (80 GB) | $2.01 |
A few notes on the table:
The RTX 5090 covers a surprisingly wide range. Any QLoRA job up to 70B and LoRA for 7B-14B can run on it at $0.92/hr. For prototyping and experiments on smaller models, this is the obvious starting point.
For 70B QLoRA, H100 PCIe (80 GB) at $2.01/hr is the cost-efficient choice: 52 GB fits with 28 GB to spare. H100 SXM5 at $5.07/hr is worth it if you need higher memory bandwidth for larger batch sizes or faster iteration, but for a standard fine-tuning job the PCIe version is more than adequate.
For 70B LoRA, 2x H100 SXM5 is the minimum at $10.14/hr. The H200 SXM5 (141 GB) cannot fit this workload: the non-activation memory floor is ~149 GB (weights 140 GB + adapters 1.5 GB + gradients 1.5 GB + optimizer 5.6 GB), which already exceeds 141 GB. Gradient checkpointing only reduces activation memory, so even eliminating activations entirely leaves 149 GB that will not fit. Enable gradient checkpointing (model.gradient_checkpointing_enable()) when using 2x H100 SXM5 to reduce per-card activation pressure.
For RTX PRO 6000 (96 GB VRAM, $2.39/hr), it comfortably handles 70B QLoRA and 32B LoRA with plenty of headroom. Check RTX PRO 6000 availability if you need a single-GPU option between the RTX 5090 and H100 tiers.
Cost Worked Examples
Three concrete examples using current Spheron pricing.
Example 1: 7B QLoRA (Llama 3.1 8B) on RTX 5090
Fine-tuning on a custom instruction dataset (50K examples, 512 token average length):
- GPU: RTX 5090 PCIe (32 GB VRAM)
- Method: QLoRA, rank 64, batch size 4, gradient accumulation 4 steps
- Estimated training time: 3 hours (2 epochs)
- On-demand cost: 3 × $0.92 = $2.76
For anything under 7B-8B, this is the cheapest path that still produces useful adapters.
Example 2: 70B QLoRA (Llama 3.3 70B) on H100
Fine-tuning on a domain-specific dataset (25K examples, 512 tokens):
- GPU: 1x H100 PCIe (80 GB VRAM), cost-efficient option at $2.01/hr
- Method: QLoRA, rank 64, paged AdamW, gradient checkpointing
- Estimated training time: 10 hours
- On-demand cost: 10 × $2.01 = $20.10
If you need higher memory bandwidth for larger batches or faster iteration, swap to H100 SXM5 ($5.07/hr): 10 × $5.07 = $50.70 on-demand (or $29.10 spot at $2.91/hr, with checkpointing enabled). For most fine-tuning jobs, H100 PCIe delivers the same result at 60% lower cost.
This is the most common production setup for 70B fine-tuning. One H100, under $55, returns a 200-500 MB adapter file you merge with the base model for inference.
Example 3: 70B Full Fine-Tuning on 11x H100 SXM5
Full fine-tuning for a production model needing maximum accuracy:
- GPU cluster: 11x H100 SXM5 with FSDP2 / ZeRO-3
- Method: Full FT, BF16, gradient checkpointing, batch size 2 per GPU
- Estimated training time: 32 hours (100K examples, 3 epochs)
- On-demand cost per hour: 11 × $5.07 = $55.77/hr
- Total on-demand cost: 32 × $55.77 = $1,785
Full fine-tuning at 70B scale is a capital decision, not a casual experiment. Budget appropriately.
Pricing fluctuates based on GPU availability. The prices above are based on 01 Jul 2026 and may have changed. Check current GPU pricing for live rates.
VRAM Reduction Techniques
If your job is running close to VRAM limits or you want to fit more into each run, these techniques stack with each other.
Gradient checkpointing is the highest-leverage change. Enable it with:
model.gradient_checkpointing_enable()This reduces activation memory by 40-60% at a 25-30% compute cost. Enable it first before trying anything else.
Paged AdamW offloads optimizer states to CPU RAM in pages when VRAM gets full:
training_args = TrainingArguments(
optim="paged_adamw_32bit",
...
)Saves 2-6 GB on models where optimizer states are a meaningful fraction of total VRAM. Most useful for 7B-32B QLoRA runs where you want to push batch size higher.
Liger Kernel replaces RMSNorm, RoPE, SwiGLU, and cross-entropy with fused Triton kernels. One line of code, 40-60% VRAM savings on activation peaks:
from liger_kernel.transformers import apply_liger_kernel_to_llama
apply_liger_kernel_to_llama()See the Liger Kernel training guide for full setup and benchmarks across H100, H200, and B200.
Flash Attention reduces activation memory during self-attention by 30-40% for sequences above 512 tokens. Pass attn_implementation="flash_attention_2" when loading the model. The savings compound with gradient checkpointing for long-context fine-tuning.
See the FlashAttention 2 vs 3 guide for hardware-specific performance comparison.
Reduce batch size and increase gradient accumulation. Cutting batch size from 8 to 1 and setting gradient_accumulation_steps=8 keeps effective batch size identical while reducing peak activation memory by 8x. The training curve is identical; throughput is slightly lower.
LoRA rank reduction. Dropping from r=64 to r=16 cuts adapter memory and optimizer states by 4x with minimal accuracy impact for most tasks. Start at r=16 and only increase if your validation loss plateaus early.
Multi-GPU Setup
Two scenarios where you need to distribute across GPUs.
70B Full Fine-Tuning with FSDP2 or ZeRO-3
Axolotl and TorchTune both support FSDP2 natively. For 11x H100 SXM5 with FSDP2:
# axolotl config
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_offload_params: false
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayerThis shards model weights, gradients, and optimizer states evenly across 11 GPUs. See the full distributed training setup guide for NCCL tuning, interconnect requirements, and multi-node extensions.
70B LoRA on 2x H100
LoRA with a frozen base model still requires FSDP for 70B on 2x H100. Standard data parallelism (DDP) is not an option: DDP replicates the full model on each GPU, so each H100 would need to hold all ~159 GB by itself, which far exceeds its 80 GB capacity. FSDP shards the model weights (frozen base included), gradients, and optimizer states across both cards. Each GPU holds roughly half the sharded state (~80 GB), and only the adapter gradients need synchronization at each backward pass. Use the same full_shard FSDP config as for full fine-tuning, just targeting 2 GPUs instead of 11.
For spot-eligible rollout workers and preemption-resilient training jobs, see the spot GPU training resilience guide.
Run the numbers before you spin up a GPU: the training cost calculator turns any model size and token count into a dollar estimate across H100, H200, B200, and A100, with spot vs on-demand split. When you are ready to start, check current GPU pricing on Spheron
Quick Setup Guide
Determine your model's parameter count and your training dtype. For full fine-tuning: multiply params by 2 bytes for BF16 weights, again for gradients, and by 8 bytes for Adam optimizer states (two FP32 momentum buffers). For QLoRA: multiply params by 0.5 bytes for the 4-bit NF4 base, then add adapter weights, gradients, and optimizer states (around 8-9 GB total for 70B: 1.5 GB adapters, 1.5 GB gradients, 5.6 GB optimizer states).
Under 24 GB: use QLoRA only. 24-80 GB: QLoRA comfortably covers all sizes up to 70B; LoRA works for 7B-14B. 80-160 GB: multi-GPU full fine-tuning for 7B, LoRA for 32B. Over 160 GB or multi-GPU: full fine-tuning for 14B+, LoRA for 70B+.
Match your VRAM estimate to the GPU, then add 15-20% headroom. RTX 5090 (32 GB) covers 7B-32B QLoRA and 7B-14B LoRA. H100 80GB covers 70B QLoRA and 32B LoRA. 2x H100 SXM5 (160 GB combined) is the minimum for 70B LoRA (H200 SXM5 at 141 GB is insufficient even with gradient checkpointing). Use 11x H100 with FSDP/ZeRO-3 for 70B full fine-tuning.
Multiply your GPU's on-demand rate by estimated training hours. Spot instances run 40-60% cheaper but can be interrupted. Enable FSDP or ZeRO-3 checkpointing before using spot - preemption midway through a 10-hour job without a checkpoint wastes everything. Check current rates at spheron.network/pricing/ before starting.
Frequently Asked Questions
It depends on your method. Full fine-tuning of a 7B model needs roughly 88 GB using 12 bytes/param (a single A100 80G has only 80 GB, so you need 2x A100 with tensor parallelism or an H200 SXM5). LoRA at rank 64 on a BF16 base needs about 20 GB (RTX 5090 or A100 40G). QLoRA with a 4-bit quantized base needs only 6-10 GB, which fits on consumer GPUs with 12 GB or more VRAM.
An H100 80GB is the minimum for 70B QLoRA. Peak VRAM is roughly 48-52 GB: 35 GB for the 4-bit quantized base weights, plus 1.5 GB for adapter weights, 1.5 GB for gradients, about 5-6 GB for optimizer states, and 4-8 GB for activations. An H100 PCIe at 80 GB fits this at $2.01/hr; H100 SXM5 provides more memory bandwidth at $5.07/hr for higher throughput.
For a 70B model, the difference is roughly 16x. Full fine-tuning needs around 860 GB across 11x H100 SXM5 minimum. QLoRA needs roughly 50-54 GB on a single H100. The gap comes from storing BF16 gradients and FP32 Adam optimizer states for all 70 billion parameters, versus only storing those for the small adapter layers in QLoRA.
On 2x H100 SXM5 ($10.14/hr combined), a 70B LoRA run typically takes 8-14 hours for $81-142 on-demand. The H200 SXM5 (141 GB) cannot fit 70B LoRA: the non-activation memory floor alone is ~149 GB (weights 140 GB + adapters 1.5 GB + gradients 1.5 GB + optimizer 5.6 GB), so gradient checkpointing cannot close the gap. QLoRA on a single H100 PCIe ($2.01/hr) is cheaper: 8-12 hours at $16-24.
Yes, easily. A 7B QLoRA run uses only 6-10 GB VRAM, well within the RTX 5090's 32 GB. Even LoRA on a BF16 base fits at 16-20 GB. Full fine-tuning does not fit on any single consumer GPU - it needs 88 GB minimum (more with FP32 master weights), which also exceeds a single A100 80G's 80 GB; minimum is 2x A100 with model parallelism.
