Case Study

Case Study: Running 10 Concurrent Fine-Tuning Jobs on Bare Metal H100s — Architecture and Cost Breakdown

Back to BlogWritten by SpheronFeb 24, 2026
Case StudyFine-TuningQLoRAH100 SXMMulti-TenantLLM CustomizationAI InfrastructureBare Metal
Case Study: Running 10 Concurrent Fine-Tuning Jobs on Bare Metal H100s — Architecture and Cost Breakdown

Fine-tuning a single model is straightforward. Fine-tuning 10 models simultaneously — each with different datasets, hyperparameters, and quality requirements — is an infrastructure problem that most guides ignore entirely.

This case study documents how an AI API company built a multi-tenant fine-tuning pipeline that runs 10 concurrent QLoRA jobs on bare metal H100 servers. The architecture handles job scheduling, GPU allocation, checkpoint management, and quality validation — all without a single job interfering with another.

The result: 10 customer-specific model variants fine-tuned in parallel, completing in 18 hours instead of the 180 hours it would take running sequentially. Total cost per fine-tuning job: $89.

The Business Context

The company operates an AI API platform serving enterprise customers. Each customer needs a model variant fine-tuned on their proprietary data — customer support transcripts, internal documentation, domain-specific terminology, and compliance-sensitive language patterns.

Their customer pipeline looked like this: onboard a new enterprise client, collect their training data (typically 50K-500K instruction-response pairs), fine-tune a model variant, evaluate quality against the customer's benchmarks, then deploy to a dedicated inference endpoint. The entire process from data collection to deployment needed to happen within 48 hours to meet their SLA.

The bottleneck was fine-tuning. Running one job at a time on a single GPU server meant each customer waited in a queue. With 10 customers onboarding in the same week, the queue stretched to 12 days — far beyond their 48-hour SLA.

Hardware Configuration

They provisioned two bare metal servers, each with 8x H100 80GB GPUs connected via NVLink.

Why bare metal over VMs: Fine-tuning workloads are GPU-memory-bound. Virtualization overhead reduces available VRAM by 2-5% and adds latency to GPU memory operations. On a 80GB GPU running QLoRA with a 70B base model, that 2-5% overhead is the difference between fitting the job and running out of memory. Bare metal eliminated this margin entirely.

Why H100 over A100: The H100's 80GB HBM3 memory with 3.35 TB/s bandwidth provides roughly 2x the memory bandwidth of the A100. For QLoRA fine-tuning, where the bottleneck is reading quantized base model weights and writing gradient updates to the adapter layers, this bandwidth advantage translates directly to faster training iterations.

Server specifications:

ComponentPer ServerTotal (2 Servers)
GPUs8x H100 80GB SXM16x H100
Total VRAM640 GB1,280 GB
GPU InterconnectNVLink 4 (900 GB/s)
System RAM2 TB DDR54 TB
Storage8 TB NVMe (local) + 20 TB NAS
Network400 Gbps InfiniBand

The Multi-Tenant Architecture

The core challenge was running 10 independent fine-tuning jobs without interference — no GPU memory leaks between jobs, no shared state corruption, no one job starving another of compute.

GPU Allocation Strategy

Each QLoRA fine-tuning job for a Llama 3.3 70B base model requires approximately 48-55 GB of VRAM: the 4-bit quantized base model (~35 GB), adapter weights (~2 GB), optimizer states (~4 GB), activations and KV cache (~7-14 GB depending on sequence length). This means each job fits on a single H100 80GB GPU with 25-32 GB of headroom.

With 16 GPUs across 2 servers, they allocated one GPU per job — running 10 concurrent jobs with 6 GPUs held in reserve for job failures, reruns, and evaluation workloads.

The allocation was static, not dynamic. Each job was pinned to a specific GPU using CUDA_VISIBLE_DEVICES. Static allocation eliminated the risk of GPU memory fragmentation that occurs when jobs dynamically share GPUs.

bash
# Job launcher — each job gets exactly one GPU
for i in $(seq 0 9); do
    GPU_ID=$((i % 8))
    SERVER_ID=$((i / 8))

    CUDA_VISIBLE_DEVICES=$GPU_ID python finetune.py \
        --config configs/customer_${i}.yaml \
        --output_dir /nas/checkpoints/customer_${i} \
        --gpu_id $GPU_ID \
        &
done

QLoRA Configuration

Every job used the same base model (Llama 3.3 70B, 4-bit quantized with bitsandbytes NF4) but different LoRA configurations tuned to each customer's data characteristics.

The base QLoRA configuration:

python
from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig
import torch

# Quantization config — same for all jobs
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# LoRA config — per-customer tuning
lora_config = LoraConfig(
    r=64,                    # Rank — higher for complex domains
    lora_alpha=128,          # Scaling factor
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

The LoRA rank (r) varied by customer. Customers with highly specialized domains (medical, legal) used r=128 for more expressive adapters. Customers with general-purpose customization (tone, formatting) used r=32 for faster training and smaller adapter files.

Job Scheduling and Monitoring

They built a lightweight job scheduler that tracked each fine-tuning run's progress and handled failures automatically.

python
# Simplified job monitor
import subprocess
import json
import time
from pathlib import Path

class FinetuneJobMonitor:
    def __init__(self, jobs_config_path):
        self.jobs = json.loads(Path(jobs_config_path).read_text())
        self.status = {j["customer_id"]: "pending" for j in self.jobs}

    def check_gpu_health(self, gpu_id, server):
        """Check GPU memory usage and temperature"""
        result = subprocess.run(
            ["ssh", server, f"nvidia-smi --id={gpu_id} --query-gpu=memory.used,temperature.gpu --format=csv,noheader"],
            capture_output=True, text=True
        )
        mem_used, temp = result.stdout.strip().split(", ")
        return {
            "memory_mb": int(mem_used.replace(" MiB", "")),
            "temp_c": int(temp)
        }

    def check_training_progress(self, customer_id):
        """Read latest training metrics from log file"""
        log_path = f"/nas/logs/{customer_id}/trainer_state.json"
        try:
            state = json.loads(Path(log_path).read_text())
            return {
                "step": state["global_step"],
                "loss": state["log_history"][-1].get("loss", None),
                "learning_rate": state["log_history"][-1].get("learning_rate", None),
            }
        except (FileNotFoundError, json.JSONDecodeError):
            return None

    def run(self, check_interval=60):
        while any(s != "completed" for s in self.status.values()):
            for job in self.jobs:
                cid = job["customer_id"]
                if self.status[cid] in ("completed", "failed"):
                    continue

                progress = self.check_training_progress(cid)
                if progress and progress["step"] >= job["total_steps"]:
                    self.status[cid] = "completed"
                    print(f"[{cid}] Completed at step {progress['step']}")
                elif progress:
                    gpu_health = self.check_gpu_health(
                        job["gpu_id"], job["server"]
                    )
                    print(f"[{cid}] Step {progress['step']}/{job['total_steps']} "
                          f"loss={progress['loss']:.4f} "
                          f"GPU mem={gpu_health['memory_mb']}MB "
                          f"temp={gpu_health['temp_c']}C")

            time.sleep(check_interval)

Checkpoint and Artifact Management

With 10 jobs running simultaneously, checkpoint storage adds up fast. Each QLoRA checkpoint is relatively small (the adapter weights are typically 200-800 MB depending on LoRA rank), but they saved checkpoints every 200 steps, and each job ran 2,000-5,000 steps.

Storage strategy:

  • Local NVMe for active checkpoints (fast write speed, no network bottleneck during training)
  • Network-attached storage (NAS) for completed checkpoints and final adapters
  • Retention policy: Keep only the 2 most recent checkpoints per job on local storage. Archive final adapters to NAS permanently.
bash
# Checkpoint cleanup cron — runs every 30 minutes
find /local-nvme/checkpoints/*/checkpoint-* -maxdepth 0 -type d |
    sort -t- -k2 -n |
    head -n -2 |
    xargs rm -rf

Performance Results

All 10 fine-tuning jobs completed within 18 hours. The longest job (a legal domain customer with 500K training examples and r=128) took 17.2 hours. The shortest (a customer service tone adaptation with 50K examples and r=32) completed in 4.1 hours.

Training Metrics

MetricAverage Across 10 JobsRange
Training steps3,2001,500 - 5,000
Training time11.4 hours4.1 - 17.2 hours
Peak GPU utilization94%89% - 97%
Peak VRAM usage62 GB48 - 71 GB
Final training loss0.820.61 - 1.14
Adapter size480 MB190 - 820 MB

GPU utilization averaged 94% across all 16 GPUs — indicating that QLoRA fine-tuning on bare metal H100s is almost perfectly compute-bound, with minimal idle time from data loading or checkpoint writes.

Quality Validation

Each fine-tuned adapter was evaluated against the customer's held-out test set (10% of their training data, never seen during training) plus a general capability benchmark (MMLU subset) to verify the adapter didn't degrade base model performance.

Customer DomainBase Model ScoreFine-Tuned ScoreGeneral Capability Δ
Legal contracts71.2%93.8%-0.3%
Medical records68.5%91.2%-0.5%
Customer support74.1%89.7%-0.1%
Financial analysis69.8%90.4%-0.4%
Technical docs76.3%92.1%-0.2%
E-commerce72.0%87.3%+0.1%
HR/recruiting70.5%88.9%-0.2%
Insurance claims67.9%90.6%-0.6%
Real estate73.4%89.1%-0.1%
Logistics71.8%87.8%-0.3%

Average domain-specific improvement: +20.1 percentage points. Average general capability degradation: -0.26% — effectively zero. This confirms that QLoRA with appropriate rank selection preserves the base model's general capabilities while dramatically improving domain performance.

Cost Analysis

ItemCost
2x bare metal 8x H100 servers (18 hrs × $16.00/hr each)$576
Network-attached storage (20 TB, 1 month)$160
Total for 10 fine-tuning jobs$736
Cost per fine-tuning job$73.60

Compare this to the alternatives:

ApproachCost per JobTime for 10 JobsTotal Cost
Sequential on 1x 8-GPU server$1157.5 days$1,150
Concurrent on 2x 8-GPU servers$73.6018 hours$736
API-based fine-tuning (typical)$200-5002-5 days$2,000-5,000
Managed fine-tuning platform$300-8001-3 days$3,000-8,000

The concurrent approach is 36% cheaper than sequential (because the servers are utilized for 18 hours instead of sitting idle between jobs) and 73-91% cheaper than managed fine-tuning platforms.

Lessons Learned

Static GPU allocation beats dynamic scheduling for fine-tuning. Dynamic GPU schedulers (like Kubernetes with GPU sharing) add complexity and introduce memory fragmentation risks. For fine-tuning workloads where each job's VRAM requirement is predictable, pinning jobs to specific GPUs is simpler and more reliable.

LoRA rank is the most impactful hyperparameter. Across 10 customer deployments, the single variable that most affected final quality was LoRA rank. Domains with specialized vocabulary and reasoning patterns (legal, medical) needed r=128. General-purpose adaptations (tone, formatting) worked well with r=32. Over-provisioning rank wastes training time; under-provisioning caps quality.

Bare metal eliminates the VRAM margin problem. On virtualized GPU instances, VRAM overhead from the hypervisor consumed 2-5 GB per GPU. At r=128 with a 70B base model, some jobs peaked at 71 GB VRAM usage — leaving only 9 GB of headroom on an 80GB GPU. With virtualization overhead, these jobs would have OOM'd. Bare metal gave them the full 80 GB.

Checkpoint storage is cheap — losing a training run is expensive. They budgeted $160/month for 20 TB of network storage. A single lost training run (due to a job failure without recent checkpoints) would cost $57 in wasted GPU time plus hours of delay. Aggressive checkpointing with cheap network storage is always the right tradeoff.

For teams running regular fine-tuning workloads, bare metal GPU servers provide the most predictable and cost-effective infrastructure. On Spheron AI, you can provision multi-GPU bare metal servers with H100, H200, and B300 GPUs — available as both Spot and Dedicated instances through a single console.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.


GlobeGLOBAL COMPUTE, BROUGHT TO YOU BY