Case Study: How a 12-Person AI Startup Trained a 70B Model for $11,200 Using Spot GPUs

Training a 70B parameter model is no longer reserved for companies with eight-figure compute budgets. But the default path — spinning up 8x H100 Dedicated instances and running for two weeks straight — still costs more than most early-stage teams can justify.

This case study documents how a 12-person AI startup completed a full supervised fine-tuning run of a 70B parameter model using a hybrid Spot-Dedicated GPU strategy. The total cost came in at $11,200 — roughly 73% less than the $41,500 they budgeted for a purely Dedicated approach.

The techniques here aren't theoretical. They're the actual steps, configurations, and failure-recovery patterns the team used across 312 hours of cumulative GPU time on bare metal H100 servers.

The Problem

The team was building a domain-specific legal reasoning model. They had a curated dataset of 2.1 million instruction-response pairs covering contract analysis, regulatory compliance, and case law interpretation. Their base model was Qwen 2.5 72B, chosen for its strong reasoning baseline and open license.

Their initial cost estimate was sobering. Full fine-tuning of a 72B model requires 8x H100 80GB GPUs to hold the model weights, gradients, and optimizer states in memory. At the standard Dedicated rate of $2.00/hr per GPU ($16.00/hr for the 8-GPU node), a 14-day training run would cost approximately $5,376. But that assumed zero restarts, zero failed runs, and zero hyperparameter experimentation — unrealistic for a first-time fine-tuning effort.

Accounting for 2-3 experimental runs and inevitable reruns, their realistic budget sat at $41,500.

The Strategy

The team designed a three-phase approach that used Dedicated instances only when absolutely necessary and ran the bulk of computation on Spot instances.

Phase 1: Hyperparameter Search on Spot (72 hours)

Before committing to the full training run, they needed to find the right learning rate, warmup schedule, and batch size. This experimentation phase is where most teams burn money — running 5-10 partial training runs that each consume significant GPU hours.

They provisioned 8x H100 Spot instances at $1.49/hr per GPU ($11.92/hr total). Over three days, they ran 8 experimental configurations, each training for 2,000 steps — enough to see loss curve trajectories and catch obvious problems like gradient explosions or learning rate mismatches.

Spot instances were interrupted twice during this phase. Each interruption cost them roughly 30 minutes of re-provisioning and checkpoint loading — negligible compared to the savings. The key was their checkpointing setup (detailed below), which saved state every 200 steps.

Phase 1 cost: 72 hours × $11.92/hr = $858

Phase 2: Full Training Run on Spot with Automated Failover (216 hours)

With hyperparameters locked, they launched the full fine-tuning run on Spot instances. The critical infrastructure decision was building an automated checkpoint-and-resume pipeline that could handle Spot interruptions without human intervention.

Their pipeline worked in three layers.

Layer 1 — Aggressive checkpointing. They configured the training script to save a full checkpoint every 500 steps and a lightweight "resume checkpoint" (optimizer state + step counter + RNG state, no full model weights) every 100 steps. Full checkpoints were written to network-attached storage that persisted across instance restarts.

Layer 2 — Health monitoring. A lightweight monitoring script polled nvidia-smi every 30 seconds and tracked training throughput (tokens/sec). If throughput dropped below 60% of baseline for more than 2 minutes, it triggered an early checkpoint save before the instance was reclaimed.

Layer 3 — Automated re-provisioning. When a Spot instance was reclaimed, their provisioning script automatically requested a new 8x H100 Spot node. Once provisioned (typically 2-8 minutes), the training script resumed from the latest checkpoint with no manual intervention.

Over the 216-hour training phase, they experienced 7 Spot interruptions. Average recovery time was 11 minutes. Total lost compute from interruptions: approximately 4.2 hours.

Phase 2 cost: 216 hours × $11.92/hr = $2,575

Phase 3: Final Validation on Dedicated (24 hours)

For the final evaluation and benchmark runs, they switched to Dedicated instances. This phase required stable, uninterrupted compute for consistent benchmark results — running the fine-tuned model through their evaluation suite of 15,000 test cases covering contract clause identification, regulatory risk scoring, and case law retrieval.

They also used this phase to run inference benchmarks (tokens/sec, time-to-first-token) that would inform their production serving configuration.

Phase 3 cost: 24 hours × $16.00/hr = $384

The Checkpointing Architecture

The checkpointing system was the most critical piece of infrastructure. Without it, every Spot interruption would have meant restarting training from the last manual save — potentially losing hours of compute.

Here's the core configuration they used with the Hugging Face Trainer:

python

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="/persistent-storage/checkpoints/legal-72b",

    # Aggressive checkpoint schedule
    save_strategy="steps",
    save_steps=500,
    save_total_limit=3,  # Keep only 3 most recent full checkpoints

    # Resume from checkpoint automatically
    resume_from_checkpoint=True,

    # Training configuration
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=1.5e-5,
    warmup_steps=200,
    bf16=True,

    # Logging for monitoring
    logging_steps=10,
    logging_dir="/persistent-storage/logs",
)

The lightweight resume checkpoints (every 100 steps) used a custom callback:

python

from transformers import TrainerCallback
import torch
import os

class SpotCheckpointCallback(TrainerCallback):
    def __init__(self, save_dir, save_interval=100):
        self.save_dir = save_dir
        self.save_interval = save_interval

    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step % self.save_interval == 0:
            resume_state = {
                "global_step": state.global_step,
                "epoch": state.epoch,
                "optimizer_state": kwargs.get("optimizer").state_dict(),
                "scheduler_state": kwargs.get("lr_scheduler").state_dict(),
                "rng_state": torch.random.get_rng_state(),
                "cuda_rng_state": torch.cuda.get_rng_state_all(),
            }
            path = os.path.join(self.save_dir, "resume_latest.pt")
            torch.save(resume_state, path)

And the health monitoring script that detected impending preemptions:

bash

#!/bin/bash
# spot-health-monitor.sh
BASELINE_THROUGHPUT=1850  # tokens/sec established during first 1000 steps
THRESHOLD=0.60
MIN_THROUGHPUT=$(echo "$BASELINE_THROUGHPUT * $THRESHOLD" | bc)
ALERT_COUNT=0

while true; do
    # Parse current throughput from training logs
    CURRENT=$(tail -1 /persistent-storage/logs/throughput.log | awk '{print $2}')

    if (( $(echo "$CURRENT < $MIN_THROUGHPUT" | bc -l) )); then
        ALERT_COUNT=$((ALERT_COUNT + 1))
        if [ $ALERT_COUNT -ge 4 ]; then  # 4 checks × 30s = 2 minutes
            echo "$(date): Throughput degraded. Triggering emergency checkpoint."
            kill -USR1 $(cat /tmp/training.pid)  # Signal training to save
            ALERT_COUNT=0
        fi
    else
        ALERT_COUNT=0
    fi
    sleep 30
done

The Results

Metric	Dedicated-Only (Estimated)	Hybrid Spot-Dedicated (Actual)
Total GPU hours	~2,600 hrs	~2,496 hrs
Total cost	~$41,500	$11,200
Cost per GPU hour	$16.00	$4.49
Spot interruptions	0	7
Lost compute from interruptions	0 hrs	4.2 hrs
Training quality impact	Baseline	No measurable difference
Time to completion	~14 days	~16 days

The 73% cost reduction came with a 2-day timeline extension — the team spent those extra days on interruption recovery and the initial Spot provisioning queue. For a startup watching its cloud spend, the tradeoff was obvious.

Model quality was unaffected. Their fine-tuned Qwen 72B achieved 91.3% accuracy on contract clause identification (up from 74.8% base model), 88.7% on regulatory risk scoring, and outperformed GPT-4o on 8 of their 12 domain-specific benchmarks.

Cost Breakdown

Phase	Instance Type	Duration	Rate	Cost
Hyperparameter search	8x H100 Spot	72 hrs	$11.92/hr	$858
Full training run	8x H100 Spot	216 hrs	$11.92/hr	$2,575
Evaluation & benchmarks	8x H100 Dedicated	24 hrs	$16.00/hr	$384
Network storage (1.5TB)	Persistent volume	16 days	$0.10/GB/mo	$78
Total	312 hrs GPU	$11,200

The storage cost is often overlooked. Persistent network-attached storage is essential for checkpoint survival across Spot interruptions. Budget $0.08-0.12 per GB per month for NVMe-backed network storage.

What Would Change at Larger Scale

This approach scales, but the economics shift at different model sizes.

For 7B-13B models: Spot instances are almost always the right choice. These models fit on a single GPU, checkpoints are small (under 30GB), and recovery is fast. Expected savings: 50-70%.

For 70B models (this case study): The hybrid approach works well. Checkpoints are large (200-400GB) but manageable with network storage. Recovery adds meaningful time but costs are still dramatically lower. Expected savings: 60-75%.

For 200B+ models: Spot becomes riskier. Checkpoints exceed 1TB, recovery time grows, and the probability of interruption during a long training run approaches certainty. Consider using Spot for hyperparameter search only, then switching to Dedicated for the full run. Expected savings: 30-50%.

Key Takeaways

The largest cost savings came not from any single optimization but from the combination of Spot pricing with infrastructure that made Spot interruptions painless. Without automated checkpointing and re-provisioning, Spot instances are a liability. With them, Spot instances are the most cost-effective training infrastructure available.

Three specific decisions made the biggest difference. First, running all hyperparameter experimentation on Spot — this phase is inherently tolerant of interruptions and represents 20-40% of total training compute for most projects. Second, using aggressive checkpoint intervals (every 100 steps for lightweight state, every 500 for full model) — the storage cost is trivial compared to lost GPU hours. Third, building the health monitoring system that detected throughput drops before actual preemption — this saved an estimated 3-4 hours of compute that would otherwise have been lost.

For teams considering this approach, the infrastructure investment is roughly 2-3 days of engineering time to build the checkpointing and monitoring pipeline. That investment pays for itself on the first training run.

On Spheron AI, H100 Spot instances start at $1.49/hr per GPU — with bare metal servers that give you direct GPU access without virtualization overhead. The combination of Spot economics and bare metal performance is what made this cost structure possible.