Engineering

Model Merging on GPU Cloud: TIES, DARE, SLERP, and Evolutionary Merging for Custom LLMs (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderApr 25, 2026
Model Merging LLMMergekit GPU CloudTIES DARE SLERP MergingEvolutionary Model MergingMerge LLMs Without TrainingTask Arithmetic LLMGPU CloudH100Open Weight Models
Model Merging on GPU Cloud: TIES, DARE, SLERP, and Evolutionary Merging for Custom LLMs (2026 Guide)

The Hugging Face Open LLM Leaderboard's top spots are consistently held by merged models, not fine-tuned ones. Merging a 70B model takes 45 minutes and costs under $2 on a single H100 instance. That is not a trade-off, it is a different category of operation entirely.

Model merging combines the weight tensors of two or more trained models without gradient descent, new data, or GPU clusters. The result is a single checkpoint you can deploy anywhere a parent model would run. If your goal is task-specific behavior learned from labeled examples, the fine-tuning guide covers the full workflow. If you need a permanently smaller model for cheaper inference, model distillation produces a student that runs at a fraction of the teacher's cost. Merging is the third path: combine capabilities from existing checkpoints in under an hour, no data required.

Three Ways to Customize an LLM Without Pretraining

Before picking a method, it helps to know which problem each one actually solves:

ApproachWhat ChangesNeeds New Data?GPU CostTimeWhen to Use
Fine-tuningModel weights via gradient updatesYesHigh (hours of training)Hours to daysNew behavior not in base model
DistillationStudent weights trained on teacher outputsYes (teacher outputs)High (training)Hours to daysPermanent inference cost reduction
MergingWeights interpolated from existing checkpointsNoNear zeroMinutes to 1 hourCombine capabilities from existing fine-tunes

For reinforcement learning from verifiable rewards, see GRPO fine-tuning for the RL-based path that teaches models to reason through problems. Merging is complementary: use it when you already have the checkpoints and want to combine what they each do well.

What Model Merging Actually Does

In plain terms: a model's weights are a point in a very high-dimensional parameter space. Two models fine-tuned from the same base end up in nearby regions of that space because they share the same starting point and loss landscape topology. Merging moves between those two points by interpolating the weight tensors directly.

Why does that produce a coherent model? Because models fine-tuned from the same base tend to occupy a connected loss basin. The mode connectivity research from 2020 onward showed that there are low-loss paths between the weight configurations of models with shared initialization. The lottery ticket hypothesis adds another angle: the base model's "winning subnetworks" are preserved through fine-tuning, so merged models still benefit from those subnetworks even when the fine-tuned weights diverge.

What merging cannot do: inject knowledge that is not in any parent, produce a model smaller than the parents, or work across architectures. If your two best models are a Llama 4 fine-tune and a Qwen 3 fine-tune, merging will not work. The weight tensor shapes must match exactly.

The Methods: Linear, SLERP, Task Arithmetic, TIES, DARE, and DARE-TIES

Linear Interpolation

The simplest approach: for each weight tensor W, compute W_merged = alpha * W_A + (1 - alpha) * W_B. At alpha=0.5, each model contributes equally.

Works well for models that are close in weight space (same base, similar fine-tuning). Falls apart at high mixing ratios when the fine-tuning directions conflict, producing what is called catastrophic interference: a model that is worse than either parent on both tasks.

yaml
merge_method: linear
models:
  - model: ./model-a
    parameters:
      weight: 0.5
  - model: ./model-b
    parameters:
      weight: 0.5
dtype: bfloat16

SLERP

Spherical linear interpolation treats weight tensors as vectors and interpolates along the geodesic on the unit sphere rather than the straight Euclidean line. The formula is:

W_merged = sin((1-t)*theta) / sin(theta) * W_A + sin(t*theta) / sin(theta) * W_B

where theta is the angle between the normalized weight vectors.

SLERP preserves the norm of the merged weights, which means it avoids the shrinkage that linear interpolation causes at intermediate t values. In practice this translates to better coherence at t=0.5. SLERP is the best default for two-model blends from the same base family.

yaml
merge_method: slerp
models:
  - model: ./model-a
  - model: ./model-b
base_model: ./base-model
parameters:
  t: 0.5
dtype: bfloat16

Task Arithmetic

Task arithmetic computes the "task vector" for each fine-tuned model: tau = W_finetuned - W_base. You then scale and add task vectors to the base:

W_merged = W_base + lambda_A * tau_A + lambda_B * tau_B

This is composable. You can add, subtract, and scale task vectors independently. Subtracting a task vector removes a capability (useful for removing toxic behavior that was fine-tuned in). The original Ilharco et al. 2023 paper showed that task vectors from models fine-tuned on different tasks are approximately orthogonal, which is why adding them does not erase the other task's gains.

TIES

TIES (Trim, Elect Sign, Merge) was designed for the three-or-more-model case where linear averaging produces sign conflicts. When W_A has a positive delta for a parameter and W_B has a negative delta, averaging them produces a near-zero result, discarding both fine-tuned values.

TIES resolves this with three steps:

  1. Trim: zero out the bottom 1 - density fraction of delta weights by magnitude in each model (removes low-magnitude noise)
  2. Elect sign: for each parameter, take a majority vote on sign direction across all input models
  3. Merge: average only the values that agree with the elected sign
yaml
merge_method: ties
models:
  - model: ./model-a
    parameters:
      weight: 0.4
  - model: ./model-b
    parameters:
      weight: 0.3
  - model: ./model-c
    parameters:
      weight: 0.3
base_model: ./base-model
parameters:
  density: 0.5
  normalize: true
dtype: bfloat16

Use TIES when merging three or more models, or any time you see the merged model performing worse than either parent on both tasks (a sign that linear averaging is canceling deltas).

DARE

DARE (Drop And REscale) addresses a different problem: when a fine-tuned model has been trained aggressively (many epochs, high learning rate, merged LoRA adapters), its delta from the base model is large and noisy. Averaging that noisy delta with another model's delta amplifies the noise.

DARE randomly masks out a fraction p of the delta weights and rescales the survivors by 1/(1-p) to maintain expected magnitude. This prunes redundant parameters while keeping the expected value of the delta unchanged. The math: tau_DARE = mask(tau, p) / (1-p).

Run DARE as a preprocessing step before TIES or linear merge when one of your inputs is an aggressively fine-tuned model.

DARE-TIES

The current strongest general method for multi-model merges. Apply DARE's random pruning to each model's task vector first, then run TIES's sign-election step on the pruned deltas. DARE removes the low-signal noise; TIES resolves the remaining sign conflicts.

yaml
merge_method: dare_ties
models:
  - model: ./model-a
    parameters:
      weight: 0.4
      density: 0.7
  - model: ./model-b
    parameters:
      weight: 0.6
      density: 0.7
base_model: ./base-model
parameters:
  normalize: true
dtype: bfloat16

Evolutionary Merging: When Hand-Tuning Is Not Enough

Sakana AI's 2024 evolutionary merging paper reframed merging as a search problem. Instead of hand-picking coefficients, you treat the merge configuration (weights, densities, per-layer coefficients) as a parameter vector and evolve it using a fitness function based on benchmark performance.

The search algorithm is CMA-ES (Covariance Matrix Adaptation Evolution Strategy), which efficiently explores the coefficient space by maintaining a multivariate Gaussian distribution over candidates and updating it toward high-fitness regions. Each generation samples a batch of merge configurations, evaluates each on your benchmark, and adapts the distribution toward the survivors.

When to use evolutionary merging: you have an automatable evaluation metric (test suite pass rate, GSM8K accuracy, domain benchmark score), and you are merging three or more models where the optimal coefficients are not obvious. The algorithm finds coefficient combinations that no human would arrive at by intuition.

Hardware for a sweep: 100 eval runs, each requiring a full benchmark pass on a freshly merged checkpoint, roughly 7 minutes per run on an H100 PCIe rental for a 13B model. That is about 12 GPU-hours total, or roughly $24 at on-demand rates. Parallelizing across 4 H100 nodes cuts wall-clock time to 3 hours.

Here is the core loop for an evolutionary sweep using optuna and the mergekit Python API:

python
import glob
import json
import os
import shutil
import subprocess

import optuna

def evaluate_merge(config_path: str, merge_dir: str, eval_result_path: str) -> float:
    # Run mergekit merge
    subprocess.run(
        ["mergekit-yaml", config_path, merge_dir, "--cuda", "--lazy-unpickle"],
        check=True
    )
    # Run evaluation with lm-evaluation-harness
    subprocess.run(
        ["lm_eval", "--model", "hf", "--model_args", f"pretrained={merge_dir}",
         "--tasks", "gsm8k,mmlu", "--output_path", eval_result_path],
        capture_output=True, text=True, check=True
    )
    # lm-eval v0.4+ treats --output_path as a directory regardless of extension,
    # so traverse it to find the actual results file.
    result_files = glob.glob(
        os.path.join(eval_result_path, "**", "results_*.json"), recursive=True
    )
    if not result_files:
        raise FileNotFoundError(f"No lm-eval results found in {eval_result_path}")
    with open(result_files[0]) as f:
        scores = json.load(f)
    return scores["results"]["gsm8k"]["acc,none"]

def objective(trial):
    weight_a = trial.suggest_float("weight_a", 0.2, 0.8)
    density = trial.suggest_float("density", 0.3, 0.9)
    trial_config_path = f"./trial-{trial.number}-config.yaml"
    merge_dir = f"./tmp-merged-{trial.number}"
    eval_result_path = f"./eval-result-{trial.number}.json"

    config = f"""
merge_method: dare_ties
models:
  - model: ./model-a
    parameters:
      weight: {weight_a}
      density: {density}
  - model: ./model-b
    parameters:
      weight: {1 - weight_a}
      density: {density}
base_model: ./base-model
parameters:
  normalize: true
dtype: bfloat16
"""
    with open(trial_config_path, "w") as f:
        f.write(config)
    try:
        return evaluate_merge(trial_config_path, merge_dir, eval_result_path)
    finally:
        for path in [trial_config_path, eval_result_path]:
            try:
                if os.path.isfile(path):
                    os.remove(path)
                elif os.path.isdir(path):
                    shutil.rmtree(path)
            except OSError:
                pass
        try:
            if os.path.exists(merge_dir):
                shutil.rmtree(merge_dir)
        except OSError:
            pass

sampler = optuna.samplers.CmaEsSampler()
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=100)
print(study.best_params)

Run this on H200 instances when sweeping 70B models, where each eval run takes longer and you want the VRAM headroom to avoid OOM on the benchmark pass.

Hardware Requirements: Why Merging a 70B Model Needs Less GPU Than You Think

Mergekit's --lazy-unpickle flag enables layer-by-layer streaming. It loads one transformer block from each source model at a time, merges that block, writes it to the output, then moves to the next. Peak VRAM is determined by the size of one block from each model, not the full model size.

Model SizeMethodPeak VRAMPeak RAMRecommended InstanceOn-Demand Rate
7BAny14 GB28 GBRTX 4090$0.79/hr
13BAny26 GB52 GBRTX 4090 or A100 PCIe$0.79/hr or $1.07/hr
70BSLERP/TIES~16 GB streaming140 GBH100 PCIe with CPU offload$2.01/hr
70BEvolutionary80 GB (eval pass)140 GBH100 PCIe$2.01/hr
120B+AnyCPU-only feasible256 GB+High-RAM CPU node

Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For 70B TIES merges with CPU offload, a single H100 PCIe handles the job in about 45 minutes. The merge itself barely stresses the GPU. The bottleneck is loading checkpoints from disk, so fast NVMe storage matters more than GPU count.

Deploying Mergekit on Spheron: Merging Llama 4 Models in One Workflow

Provision an H100 PCIe instance from the Spheron console (see Spheron docs for instance setup). Then:

bash
pip install mergekit transformers accelerate
huggingface-cli login

Write a config.yaml to merge a Llama 4 Scout base with a domain fine-tune using SLERP:

yaml
merge_method: slerp
models:
  - model: meta-llama/Llama-4-Scout-17B-16E-Instruct
  - model: ./llama4-scout-finetuned-domain
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
parameters:
  t: 0.4
dtype: bfloat16

Run the merge with CPU offload to keep VRAM under 20 GB:

bash
mergekit-yaml config.yaml ./output-model \
  --cuda \
  --lazy-unpickle \
  --copy-tokenizer

The --copy-tokenizer flag copies the base model's tokenizer into the output directory so the merged checkpoint is a self-contained HuggingFace repo. Total runtime on an H100: 30-45 minutes for Llama 4 Scout (17B active / 109B total).

One licensing note: verify the fine-tuned model's license before publishing the merge output. Llama 4 and most Qwen 3 variants permit derivative model creation, but some fine-tunes add restrictions. Always check the fine-tune's LICENSE file before distributing.

Evaluating Merged Models: MMLU, GSM8K, and LLM-as-Judge

MMLU and GSM8K are the right regression guardrails because they are broad, fast to run, and well-understood. A merged model that drops more than 2-3 points on either has likely experienced catastrophic interference.

bash
lm_eval \
  --model hf \
  --model_args pretrained=./output-model \
  --tasks mmlu,gsm8k \
  --device cuda \
  --output_path ./eval-results/

For generation quality, use an LLM-as-judge approach: run both the parent model and the merged model on a sample of your production queries, then have a judge model score each response. The LLM-as-judge evaluation pipeline guide covers the full setup including judge prompt design and scoring stability.

A typical evaluation comparison looks like this:

ModelMMLUGSM8KDomain Task AccProduction Quality (judge)
Parent A (base fine-tune)73.268.471.07.1/10
Parent B (domain fine-tune)70.165.283.56.8/10
SLERP t=0.572.867.180.27.4/10
TIES density=0.574.169.384.17.8/10
SFT fine-tune baseline72.567.882.07.3/10

The TIES merge often sits above either parent on the combined metric. That is the combination effect working correctly: each model's strengths survive the sign-election step.

Common Failure Modes

Catastrophic interference. The merged model underperforms both parents on every task. This happens when the fine-tuning directions directly conflict, and linear averaging cancels both. Fix: switch from linear to TIES, reduce density to trim more noisy deltas, or lower the merge coefficient closer to 0.3 to weight one model more heavily.

Merge-induced hallucinations. The merged model's output distribution is slightly off from either parent, producing confident text that neither parent would generate. Monitor perplexity on a held-out set of 500-1000 examples from your domain. A perplexity increase of more than 5% over the better parent signals distribution drift.

License conflicts. Some fine-tune licenses prohibit redistribution of merge derivatives. This catches teams off guard when they merge a popular fine-tune from HuggingFace Hub and then publish the output. Llama 4, Qwen 3, and Mistral variants are generally merge-friendly. Always check before publishing.

Tokenizer mismatch. If the two models have different tokenizers, mergekit will error on shape mismatch before producing any output. This is actually the correct behavior: a silent tokenizer mismatch would produce a working-looking model that generates garbage. Confirm vocab_size and tokenizer_class in both config.json files before starting.

Production Deployment: vLLM and Regression Monitoring

The merged checkpoint is a standard HuggingFace format model. No special vLLM flags or configuration:

bash
vllm serve ./output-model \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --max-model-len 8192

See the vLLM production deployment guide for tensor parallelism configuration, load balancing, and monitoring setup.

For regression monitoring, use an A/B shadow pattern: route 10% of production traffic to the merged model and 90% to the parent. Compare LLM-as-judge scores between the two streams over 24-48 hours. If the merged model's quality score stays within 3% of the parent, it is safe to promote to full traffic. A score drop outside that window means the merge introduced a regression that your offline eval did not catch.

Cost Comparison: Spot GPU Evolutionary Merging vs Fine-Tuning

Using live Spheron on-demand rates as of 25 Apr 2026:

ApproachHardwareDurationCost
TIES merge (single pass)1x H100 PCIe45 min~$1.51
Evolutionary merge (100-run sweep, parallel)4x H100 PCIe3 hrs~$24.16
SFT fine-tuning (7B, QLoRA)1x RTX 40904 hrs~$3.17
SFT fine-tuning (70B, QLoRA)1x H100 PCIe10 hrs~$20.14
GRPO training (32B)4x H20048+ hrs~$760+

Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing for live rates.

A 100-run evolutionary sweep costs roughly the same as a single SFT fine-tuning run on a 70B model. If you already have the checkpoints, the sweep is almost always worth running before committing to a new fine-tuning job.

When Merging Wins vs Fine-Tuning vs Distillation

SituationBest ApproachWhy
Two existing domain fine-tunes, want combinedMerge (TIES)No new data needed, under 1 hour
No existing fine-tunes, have labeled dataFine-tuneMerging needs something to merge
Need 10x cheaper inference at same qualityDistillationProduces a smaller model
Unknown optimal merge coefficients, have eval metricEvolutionary mergeFinds combinations humans miss
Need reasoning not in training dataGRPOMerging cannot inject new capabilities
Production inference cost matters mostDistillation + quantizationSmallest runtime footprint
Adjacent domain fine-tunes exist, tight deadlineMerge first, then fine-tune deltaFastest path to combined capability

Merging is not a replacement for fine-tuning. It is what you do when you already have the fine-tunes and want to see if you can skip another training run. Most teams that use merging seriously end up doing both: merge to create a strong starting checkpoint, then fine-tune on the specific delta where the merged model falls short.

Model merging is burst, parallel, and checkpoint-heavy, which is exactly the workload pattern spot GPUs serve well. Run an evolutionary merge sweep on Spheron for the cost of a single fine-tuning epoch on a hyperscaler, with per-minute billing and no minimum commitment.

Rent H100 for merging → | Rent H200 → | View GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.