The Hugging Face Open LLM Leaderboard's top spots are consistently held by merged models, not fine-tuned ones. Merging a 70B model takes 45 minutes and costs under $2 on a single H100 instance. That is not a trade-off, it is a different category of operation entirely.
Model merging combines the weight tensors of two or more trained models without gradient descent, new data, or GPU clusters. The result is a single checkpoint you can deploy anywhere a parent model would run. If your goal is task-specific behavior learned from labeled examples, the fine-tuning guide covers the full workflow. If you need a permanently smaller model for cheaper inference, model distillation produces a student that runs at a fraction of the teacher's cost. Merging is the third path: combine capabilities from existing checkpoints in under an hour, no data required.
Three Ways to Customize an LLM Without Pretraining
Before picking a method, it helps to know which problem each one actually solves:
| Approach | What Changes | Needs New Data? | GPU Cost | Time | When to Use |
|---|---|---|---|---|---|
| Fine-tuning | Model weights via gradient updates | Yes | High (hours of training) | Hours to days | New behavior not in base model |
| Distillation | Student weights trained on teacher outputs | Yes (teacher outputs) | High (training) | Hours to days | Permanent inference cost reduction |
| Merging | Weights interpolated from existing checkpoints | No | Near zero | Minutes to 1 hour | Combine capabilities from existing fine-tunes |
For reinforcement learning from verifiable rewards, see GRPO fine-tuning for the RL-based path that teaches models to reason through problems. Merging is complementary: use it when you already have the checkpoints and want to combine what they each do well.
What Model Merging Actually Does
In plain terms: a model's weights are a point in a very high-dimensional parameter space. Two models fine-tuned from the same base end up in nearby regions of that space because they share the same starting point and loss landscape topology. Merging moves between those two points by interpolating the weight tensors directly.
Why does that produce a coherent model? Because models fine-tuned from the same base tend to occupy a connected loss basin. The mode connectivity research from 2020 onward showed that there are low-loss paths between the weight configurations of models with shared initialization. The lottery ticket hypothesis adds another angle: the base model's "winning subnetworks" are preserved through fine-tuning, so merged models still benefit from those subnetworks even when the fine-tuned weights diverge.
What merging cannot do: inject knowledge that is not in any parent, produce a model smaller than the parents, or work across architectures. If your two best models are a Llama 4 fine-tune and a Qwen 3 fine-tune, merging will not work. The weight tensor shapes must match exactly.
The Methods: Linear, SLERP, Task Arithmetic, TIES, DARE, and DARE-TIES
Linear Interpolation
The simplest approach: for each weight tensor W, compute W_merged = alpha * W_A + (1 - alpha) * W_B. At alpha=0.5, each model contributes equally.
Works well for models that are close in weight space (same base, similar fine-tuning). Falls apart at high mixing ratios when the fine-tuning directions conflict, producing what is called catastrophic interference: a model that is worse than either parent on both tasks.
merge_method: linear
models:
- model: ./model-a
parameters:
weight: 0.5
- model: ./model-b
parameters:
weight: 0.5
dtype: bfloat16SLERP
Spherical linear interpolation treats weight tensors as vectors and interpolates along the geodesic on the unit sphere rather than the straight Euclidean line. The formula is:
W_merged = sin((1-t)*theta) / sin(theta) * W_A + sin(t*theta) / sin(theta) * W_B
where theta is the angle between the normalized weight vectors.
SLERP preserves the norm of the merged weights, which means it avoids the shrinkage that linear interpolation causes at intermediate t values. In practice this translates to better coherence at t=0.5. SLERP is the best default for two-model blends from the same base family.
merge_method: slerp
models:
- model: ./model-a
- model: ./model-b
base_model: ./base-model
parameters:
t: 0.5
dtype: bfloat16Task Arithmetic
Task arithmetic computes the "task vector" for each fine-tuned model: tau = W_finetuned - W_base. You then scale and add task vectors to the base:
W_merged = W_base + lambda_A * tau_A + lambda_B * tau_B
This is composable. You can add, subtract, and scale task vectors independently. Subtracting a task vector removes a capability (useful for removing toxic behavior that was fine-tuned in). The original Ilharco et al. 2023 paper showed that task vectors from models fine-tuned on different tasks are approximately orthogonal, which is why adding them does not erase the other task's gains.
TIES
TIES (Trim, Elect Sign, Merge) was designed for the three-or-more-model case where linear averaging produces sign conflicts. When W_A has a positive delta for a parameter and W_B has a negative delta, averaging them produces a near-zero result, discarding both fine-tuned values.
TIES resolves this with three steps:
- Trim: zero out the bottom
1 - densityfraction of delta weights by magnitude in each model (removes low-magnitude noise) - Elect sign: for each parameter, take a majority vote on sign direction across all input models
- Merge: average only the values that agree with the elected sign
merge_method: ties
models:
- model: ./model-a
parameters:
weight: 0.4
- model: ./model-b
parameters:
weight: 0.3
- model: ./model-c
parameters:
weight: 0.3
base_model: ./base-model
parameters:
density: 0.5
normalize: true
dtype: bfloat16Use TIES when merging three or more models, or any time you see the merged model performing worse than either parent on both tasks (a sign that linear averaging is canceling deltas).
DARE
DARE (Drop And REscale) addresses a different problem: when a fine-tuned model has been trained aggressively (many epochs, high learning rate, merged LoRA adapters), its delta from the base model is large and noisy. Averaging that noisy delta with another model's delta amplifies the noise.
DARE randomly masks out a fraction p of the delta weights and rescales the survivors by 1/(1-p) to maintain expected magnitude. This prunes redundant parameters while keeping the expected value of the delta unchanged. The math: tau_DARE = mask(tau, p) / (1-p).
Run DARE as a preprocessing step before TIES or linear merge when one of your inputs is an aggressively fine-tuned model.
DARE-TIES
The current strongest general method for multi-model merges. Apply DARE's random pruning to each model's task vector first, then run TIES's sign-election step on the pruned deltas. DARE removes the low-signal noise; TIES resolves the remaining sign conflicts.
merge_method: dare_ties
models:
- model: ./model-a
parameters:
weight: 0.4
density: 0.7
- model: ./model-b
parameters:
weight: 0.6
density: 0.7
base_model: ./base-model
parameters:
normalize: true
dtype: bfloat16Evolutionary Merging: When Hand-Tuning Is Not Enough
Sakana AI's 2024 evolutionary merging paper reframed merging as a search problem. Instead of hand-picking coefficients, you treat the merge configuration (weights, densities, per-layer coefficients) as a parameter vector and evolve it using a fitness function based on benchmark performance.
The search algorithm is CMA-ES (Covariance Matrix Adaptation Evolution Strategy), which efficiently explores the coefficient space by maintaining a multivariate Gaussian distribution over candidates and updating it toward high-fitness regions. Each generation samples a batch of merge configurations, evaluates each on your benchmark, and adapts the distribution toward the survivors.
When to use evolutionary merging: you have an automatable evaluation metric (test suite pass rate, GSM8K accuracy, domain benchmark score), and you are merging three or more models where the optimal coefficients are not obvious. The algorithm finds coefficient combinations that no human would arrive at by intuition.
Hardware for a sweep: 100 eval runs, each requiring a full benchmark pass on a freshly merged checkpoint, roughly 7 minutes per run on an H100 PCIe rental for a 13B model. That is about 12 GPU-hours total, or roughly $24 at on-demand rates. Parallelizing across 4 H100 nodes cuts wall-clock time to 3 hours.
Here is the core loop for an evolutionary sweep using optuna and the mergekit Python API:
import glob
import json
import os
import shutil
import subprocess
import optuna
def evaluate_merge(config_path: str, merge_dir: str, eval_result_path: str) -> float:
# Run mergekit merge
subprocess.run(
["mergekit-yaml", config_path, merge_dir, "--cuda", "--lazy-unpickle"],
check=True
)
# Run evaluation with lm-evaluation-harness
subprocess.run(
["lm_eval", "--model", "hf", "--model_args", f"pretrained={merge_dir}",
"--tasks", "gsm8k,mmlu", "--output_path", eval_result_path],
capture_output=True, text=True, check=True
)
# lm-eval v0.4+ treats --output_path as a directory regardless of extension,
# so traverse it to find the actual results file.
result_files = glob.glob(
os.path.join(eval_result_path, "**", "results_*.json"), recursive=True
)
if not result_files:
raise FileNotFoundError(f"No lm-eval results found in {eval_result_path}")
with open(result_files[0]) as f:
scores = json.load(f)
return scores["results"]["gsm8k"]["acc,none"]
def objective(trial):
weight_a = trial.suggest_float("weight_a", 0.2, 0.8)
density = trial.suggest_float("density", 0.3, 0.9)
trial_config_path = f"./trial-{trial.number}-config.yaml"
merge_dir = f"./tmp-merged-{trial.number}"
eval_result_path = f"./eval-result-{trial.number}.json"
config = f"""
merge_method: dare_ties
models:
- model: ./model-a
parameters:
weight: {weight_a}
density: {density}
- model: ./model-b
parameters:
weight: {1 - weight_a}
density: {density}
base_model: ./base-model
parameters:
normalize: true
dtype: bfloat16
"""
with open(trial_config_path, "w") as f:
f.write(config)
try:
return evaluate_merge(trial_config_path, merge_dir, eval_result_path)
finally:
for path in [trial_config_path, eval_result_path]:
try:
if os.path.isfile(path):
os.remove(path)
elif os.path.isdir(path):
shutil.rmtree(path)
except OSError:
pass
try:
if os.path.exists(merge_dir):
shutil.rmtree(merge_dir)
except OSError:
pass
sampler = optuna.samplers.CmaEsSampler()
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=100)
print(study.best_params)Run this on H200 instances when sweeping 70B models, where each eval run takes longer and you want the VRAM headroom to avoid OOM on the benchmark pass.
Hardware Requirements: Why Merging a 70B Model Needs Less GPU Than You Think
Mergekit's --lazy-unpickle flag enables layer-by-layer streaming. It loads one transformer block from each source model at a time, merges that block, writes it to the output, then moves to the next. Peak VRAM is determined by the size of one block from each model, not the full model size.
| Model Size | Method | Peak VRAM | Peak RAM | Recommended Instance | On-Demand Rate |
|---|---|---|---|---|---|
| 7B | Any | 14 GB | 28 GB | RTX 4090 | $0.79/hr |
| 13B | Any | 26 GB | 52 GB | RTX 4090 or A100 PCIe | $0.79/hr or $1.07/hr |
| 70B | SLERP/TIES | ~16 GB streaming | 140 GB | H100 PCIe with CPU offload | $2.01/hr |
| 70B | Evolutionary | 80 GB (eval pass) | 140 GB | H100 PCIe | $2.01/hr |
| 120B+ | Any | CPU-only feasible | 256 GB+ | High-RAM CPU node |
Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing for live rates.
For 70B TIES merges with CPU offload, a single H100 PCIe handles the job in about 45 minutes. The merge itself barely stresses the GPU. The bottleneck is loading checkpoints from disk, so fast NVMe storage matters more than GPU count.
Deploying Mergekit on Spheron: Merging Llama 4 Models in One Workflow
Provision an H100 PCIe instance from the Spheron console (see Spheron docs for instance setup). Then:
pip install mergekit transformers accelerate
huggingface-cli loginWrite a config.yaml to merge a Llama 4 Scout base with a domain fine-tune using SLERP:
merge_method: slerp
models:
- model: meta-llama/Llama-4-Scout-17B-16E-Instruct
- model: ./llama4-scout-finetuned-domain
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
parameters:
t: 0.4
dtype: bfloat16Run the merge with CPU offload to keep VRAM under 20 GB:
mergekit-yaml config.yaml ./output-model \
--cuda \
--lazy-unpickle \
--copy-tokenizerThe --copy-tokenizer flag copies the base model's tokenizer into the output directory so the merged checkpoint is a self-contained HuggingFace repo. Total runtime on an H100: 30-45 minutes for Llama 4 Scout (17B active / 109B total).
One licensing note: verify the fine-tuned model's license before publishing the merge output. Llama 4 and most Qwen 3 variants permit derivative model creation, but some fine-tunes add restrictions. Always check the fine-tune's LICENSE file before distributing.
Evaluating Merged Models: MMLU, GSM8K, and LLM-as-Judge
MMLU and GSM8K are the right regression guardrails because they are broad, fast to run, and well-understood. A merged model that drops more than 2-3 points on either has likely experienced catastrophic interference.
lm_eval \
--model hf \
--model_args pretrained=./output-model \
--tasks mmlu,gsm8k \
--device cuda \
--output_path ./eval-results/For generation quality, use an LLM-as-judge approach: run both the parent model and the merged model on a sample of your production queries, then have a judge model score each response. The LLM-as-judge evaluation pipeline guide covers the full setup including judge prompt design and scoring stability.
A typical evaluation comparison looks like this:
| Model | MMLU | GSM8K | Domain Task Acc | Production Quality (judge) |
|---|---|---|---|---|
| Parent A (base fine-tune) | 73.2 | 68.4 | 71.0 | 7.1/10 |
| Parent B (domain fine-tune) | 70.1 | 65.2 | 83.5 | 6.8/10 |
| SLERP t=0.5 | 72.8 | 67.1 | 80.2 | 7.4/10 |
| TIES density=0.5 | 74.1 | 69.3 | 84.1 | 7.8/10 |
| SFT fine-tune baseline | 72.5 | 67.8 | 82.0 | 7.3/10 |
The TIES merge often sits above either parent on the combined metric. That is the combination effect working correctly: each model's strengths survive the sign-election step.
Common Failure Modes
Catastrophic interference. The merged model underperforms both parents on every task. This happens when the fine-tuning directions directly conflict, and linear averaging cancels both. Fix: switch from linear to TIES, reduce density to trim more noisy deltas, or lower the merge coefficient closer to 0.3 to weight one model more heavily.
Merge-induced hallucinations. The merged model's output distribution is slightly off from either parent, producing confident text that neither parent would generate. Monitor perplexity on a held-out set of 500-1000 examples from your domain. A perplexity increase of more than 5% over the better parent signals distribution drift.
License conflicts. Some fine-tune licenses prohibit redistribution of merge derivatives. This catches teams off guard when they merge a popular fine-tune from HuggingFace Hub and then publish the output. Llama 4, Qwen 3, and Mistral variants are generally merge-friendly. Always check before publishing.
Tokenizer mismatch. If the two models have different tokenizers, mergekit will error on shape mismatch before producing any output. This is actually the correct behavior: a silent tokenizer mismatch would produce a working-looking model that generates garbage. Confirm vocab_size and tokenizer_class in both config.json files before starting.
Production Deployment: vLLM and Regression Monitoring
The merged checkpoint is a standard HuggingFace format model. No special vLLM flags or configuration:
vllm serve ./output-model \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--max-model-len 8192See the vLLM production deployment guide for tensor parallelism configuration, load balancing, and monitoring setup.
For regression monitoring, use an A/B shadow pattern: route 10% of production traffic to the merged model and 90% to the parent. Compare LLM-as-judge scores between the two streams over 24-48 hours. If the merged model's quality score stays within 3% of the parent, it is safe to promote to full traffic. A score drop outside that window means the merge introduced a regression that your offline eval did not catch.
Cost Comparison: Spot GPU Evolutionary Merging vs Fine-Tuning
Using live Spheron on-demand rates as of 25 Apr 2026:
| Approach | Hardware | Duration | Cost |
|---|---|---|---|
| TIES merge (single pass) | 1x H100 PCIe | 45 min | ~$1.51 |
| Evolutionary merge (100-run sweep, parallel) | 4x H100 PCIe | 3 hrs | ~$24.16 |
| SFT fine-tuning (7B, QLoRA) | 1x RTX 4090 | 4 hrs | ~$3.17 |
| SFT fine-tuning (70B, QLoRA) | 1x H100 PCIe | 10 hrs | ~$20.14 |
| GRPO training (32B) | 4x H200 | 48+ hrs | ~$760+ |
Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing for live rates.
A 100-run evolutionary sweep costs roughly the same as a single SFT fine-tuning run on a 70B model. If you already have the checkpoints, the sweep is almost always worth running before committing to a new fine-tuning job.
When Merging Wins vs Fine-Tuning vs Distillation
| Situation | Best Approach | Why |
|---|---|---|
| Two existing domain fine-tunes, want combined | Merge (TIES) | No new data needed, under 1 hour |
| No existing fine-tunes, have labeled data | Fine-tune | Merging needs something to merge |
| Need 10x cheaper inference at same quality | Distillation | Produces a smaller model |
| Unknown optimal merge coefficients, have eval metric | Evolutionary merge | Finds combinations humans miss |
| Need reasoning not in training data | GRPO | Merging cannot inject new capabilities |
| Production inference cost matters most | Distillation + quantization | Smallest runtime footprint |
| Adjacent domain fine-tunes exist, tight deadline | Merge first, then fine-tune delta | Fastest path to combined capability |
Merging is not a replacement for fine-tuning. It is what you do when you already have the fine-tunes and want to see if you can skip another training run. Most teams that use merging seriously end up doing both: merge to create a strong starting checkpoint, then fine-tune on the specific delta where the merged model falls short.
Model merging is burst, parallel, and checkpoint-heavy, which is exactly the workload pattern spot GPUs serve well. Run an evolutionary merge sweep on Spheron for the cost of a single fine-tuning epoch on a hyperscaler, with per-minute billing and no minimum commitment.
