Engineering

Model Merging on GPU Cloud: TIES, DARE, SLERP, and Evolutionary Merging for Custom LLMs (2026 Guide)

Model Merging LLMMergekit GPU CloudTIES DARE SLERP MergingEvolutionary Model MergingMerge LLMs Without TrainingTask Arithmetic LLMGPU CloudH100Open Weight Models
Model Merging on GPU Cloud: TIES, DARE, SLERP, and Evolutionary Merging for Custom LLMs (2026 Guide)

The Hugging Face Open LLM Leaderboard's top spots are consistently held by merged models, not fine-tuned ones. Merging a 70B model takes 45 minutes and costs under $2 on a single H100 instance. That is not a trade-off, it is a different category of operation entirely.

Model merging combines the weight tensors of two or more trained models without gradient descent, new data, or GPU clusters. The result is a single checkpoint you can deploy anywhere a parent model would run. If your goal is task-specific behavior learned from labeled examples, the fine-tuning guide covers the full workflow. If you need a permanently smaller model for cheaper inference, model distillation produces a student that runs at a fraction of the teacher's cost. Merging is the third path: combine capabilities from existing checkpoints in under an hour, no data required.

Three Ways to Customize an LLM Without Pretraining

Before picking a method, it helps to know which problem each one actually solves:

ApproachWhat ChangesNeeds New Data?GPU CostTimeWhen to Use
Fine-tuningModel weights via gradient updatesYesHigh (hours of training)Hours to daysNew behavior not in base model
DistillationStudent weights trained on teacher outputsYes (teacher outputs)High (training)Hours to daysPermanent inference cost reduction
MergingWeights interpolated from existing checkpointsNoNear zeroMinutes to 1 hourCombine capabilities from existing fine-tunes

For reinforcement learning from verifiable rewards, see GRPO fine-tuning for the RL-based path that teaches models to reason through problems. Merging is complementary: use it when you already have the checkpoints and want to combine what they each do well.

What Model Merging Actually Does

In plain terms: a model's weights are a point in a very high-dimensional parameter space. Two models fine-tuned from the same base end up in nearby regions of that space because they share the same starting point and loss landscape topology. Merging moves between those two points by interpolating the weight tensors directly.

Why does that produce a coherent model? Because models fine-tuned from the same base tend to occupy a connected loss basin. The mode connectivity research from 2020 onward showed that there are low-loss paths between the weight configurations of models with shared initialization. The lottery ticket hypothesis adds another angle: the base model's "winning subnetworks" are preserved through fine-tuning, so merged models still benefit from those subnetworks even when the fine-tuned weights diverge.

What merging cannot do: inject knowledge that is not in any parent, produce a model smaller than the parents, or work across architectures. If your two best models are a Llama 4 fine-tune and a Qwen 3 fine-tune, merging will not work. The weight tensor shapes must match exactly.

The Methods: Linear, SLERP, Task Arithmetic, TIES, DARE, and DARE-TIES

Linear Interpolation

The simplest approach: for each weight tensor W, compute W_merged = alpha * W_A + (1 - alpha) * W_B. At alpha=0.5, each model contributes equally.

Works well for models that are close in weight space (same base, similar fine-tuning). Falls apart at high mixing ratios when the fine-tuning directions conflict, producing what is called catastrophic interference: a model that is worse than either parent on both tasks.

yaml
merge_method: linear
models:
  - model: ./model-a
    parameters:
      weight: 0.5
  - model: ./model-b
    parameters:
      weight: 0.5
dtype: bfloat16

SLERP

Spherical linear interpolation treats weight tensors as vectors and interpolates along the geodesic on the unit sphere rather than the straight Euclidean line. The formula is:

W_merged = sin((1-t)*theta) / sin(theta) * W_A + sin(t*theta) / sin(theta) * W_B

where theta is the angle between the normalized weight vectors.

SLERP preserves the norm of the merged weights, which means it avoids the shrinkage that linear interpolation causes at intermediate t values. In practice this translates to better coherence at t=0.5. SLERP is the best default for two-model blends from the same base family.

yaml
merge_method: slerp
models:
  - model: ./model-a
  - model: ./model-b
base_model: ./base-model
parameters:
  t: 0.5
dtype: bfloat16

Task Arithmetic

Task arithmetic computes the "task vector" for each fine-tuned model: tau = W_finetuned - W_base. You then scale and add task vectors to the base:

W_merged = W_base + lambda_A * tau_A + lambda_B * tau_B

This is composable. You can add, subtract, and scale task vectors independently. Subtracting a task vector removes a capability (useful for removing toxic behavior that was fine-tuned in). The original Ilharco et al. 2023 paper showed that task vectors from models fine-tuned on different tasks are approximately orthogonal, which is why adding them does not erase the other task's gains.

TIES

TIES (Trim, Elect Sign, Merge) was designed for the three-or-more-model case where linear averaging produces sign conflicts. When W_A has a positive delta for a parameter and W_B has a negative delta, averaging them produces a near-zero result, discarding both fine-tuned values.

TIES resolves this with three steps:

  1. Trim: zero out the bottom 1 - density fraction of delta weights by magnitude in each model (removes low-magnitude noise)
  2. Elect sign: for each parameter, take a majority vote on sign direction across all input models
  3. Merge: average only the values that agree with the elected sign
yaml
merge_method: ties
models:
  - model: ./model-a
    parameters:
      weight: 0.4
  - model: ./model-b
    parameters:
      weight: 0.3
  - model: ./model-c
    parameters:
      weight: 0.3
base_model: ./base-model
parameters:
  density: 0.5
  normalize: true
dtype: bfloat16

Use TIES when merging three or more models, or any time you see the merged model performing worse than either parent on both tasks (a sign that linear averaging is canceling deltas).

DARE

DARE (Drop And REscale) addresses a different problem: when a fine-tuned model has been trained aggressively (many epochs, high learning rate, merged LoRA adapters), its delta from the base model is large and noisy. Averaging that noisy delta with another model's delta amplifies the noise.

DARE randomly masks out a fraction p of the delta weights and rescales the survivors by 1/(1-p) to maintain expected magnitude. This prunes redundant parameters while keeping the expected value of the delta unchanged. The math: tau_DARE = mask(tau, p) / (1-p).

Run DARE as a preprocessing step before TIES or linear merge when one of your inputs is an aggressively fine-tuned model.

DARE-TIES

The current strongest general method for multi-model merges. Apply DARE's random pruning to each model's task vector first, then run TIES's sign-election step on the pruned deltas. DARE removes the low-signal noise; TIES resolves the remaining sign conflicts.

yaml
merge_method: dare_ties
models:
  - model: ./model-a
    parameters:
      weight: 0.4
      density: 0.7
  - model: ./model-b
    parameters:
      weight: 0.6
      density: 0.7
base_model: ./base-model
parameters:
  normalize: true
dtype: bfloat16

Evolutionary Merging: When Hand-Tuning Is Not Enough

Sakana AI's 2024 evolutionary merging paper reframed merging as a search problem. Instead of hand-picking coefficients, you treat the merge configuration (weights, densities, per-layer coefficients) as a parameter vector and evolve it using a fitness function based on benchmark performance.

The search algorithm is CMA-ES (Covariance Matrix Adaptation Evolution Strategy), which efficiently explores the coefficient space by maintaining a multivariate Gaussian distribution over candidates and updating it toward high-fitness regions. Each generation samples a batch of merge configurations, evaluates each on your benchmark, and adapts the distribution toward the survivors.

When to use evolutionary merging: you have an automatable evaluation metric (test suite pass rate, GSM8K accuracy, domain benchmark score), and you are merging three or more models where the optimal coefficients are not obvious. The algorithm finds coefficient combinations that no human would arrive at by intuition.

Hardware for a sweep: 100 eval runs, each requiring a full benchmark pass on a freshly merged checkpoint, roughly 7 minutes per run on an H100 PCIe rental for a 13B model. That is about 12 GPU-hours total, or roughly $24 at on-demand rates. Parallelizing across 4 H100 nodes cuts wall-clock time to 3 hours.

Here is the core loop for an evolutionary sweep using optuna and the mergekit Python API:

python
import glob
import json
import os
import shutil
import subprocess

import optuna

def evaluate_merge(config_path: str, merge_dir: str, eval_result_path: str) -> float:
    # Run mergekit merge
    subprocess.run(
        ["mergekit-yaml", config_path, merge_dir, "--cuda", "--lazy-unpickle"],
        check=True
    )
    # Run evaluation with lm-evaluation-harness
    subprocess.run(
        ["lm_eval", "--model", "hf", "--model_args", f"pretrained={merge_dir}",
         "--tasks", "gsm8k,mmlu", "--output_path", eval_result_path],
        capture_output=True, text=True, check=True
    )
    # lm-eval v0.4+ treats --output_path as a directory regardless of extension,
    # so traverse it to find the actual results file.
    result_files = glob.glob(
        os.path.join(eval_result_path, "**", "results_*.json"), recursive=True
    )
    if not result_files:
        raise FileNotFoundError(f"No lm-eval results found in {eval_result_path}")
    with open(result_files[0]) as f:
        scores = json.load(f)
    return scores["results"]["gsm8k"]["acc,none"]

def objective(trial):
    weight_a = trial.suggest_float("weight_a", 0.2, 0.8)
    density = trial.suggest_float("density", 0.3, 0.9)
    trial_config_path = f"./trial-{trial.number}-config.yaml"
    merge_dir = f"./tmp-merged-{trial.number}"
    eval_result_path = f"./eval-result-{trial.number}.json"

    config = f"""
merge_method: dare_ties
models:
  - model: ./model-a
    parameters:
      weight: {weight_a}
      density: {density}
  - model: ./model-b
    parameters:
      weight: {1 - weight_a}
      density: {density}
base_model: ./base-model
parameters:
  normalize: true
dtype: bfloat16
"""
    with open(trial_config_path, "w") as f:
        f.write(config)
    try:
        return evaluate_merge(trial_config_path, merge_dir, eval_result_path)
    finally:
        for path in [trial_config_path, eval_result_path]:
            try:
                if os.path.isfile(path):
                    os.remove(path)
                elif os.path.isdir(path):
                    shutil.rmtree(path)
            except OSError:
                pass
        try:
            if os.path.exists(merge_dir):
                shutil.rmtree(merge_dir)
        except OSError:
            pass

sampler = optuna.samplers.CmaEsSampler()
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=100)
print(study.best_params)

Run this on H200 instances when sweeping 70B models, where each eval run takes longer and you want the VRAM headroom to avoid OOM on the benchmark pass.

Hardware Requirements: Why Merging a 70B Model Needs Less GPU Than You Think

Mergekit's --lazy-unpickle flag enables layer-by-layer streaming. It loads one transformer block from each source model at a time, merges that block, writes it to the output, then moves to the next. Peak VRAM is determined by the size of one block from each model, not the full model size.

Model SizeMethodPeak VRAMPeak RAMRecommended InstanceOn-Demand Rate
7BAny14 GB28 GBRTX 4090$0.79/hr
13BAny26 GB52 GBRTX 4090 or A100 PCIe$0.79/hr or $1.07/hr
70BSLERP/TIES~16 GB streaming140 GBH100 PCIe with CPU offload$2.01/hr
70BEvolutionary80 GB (eval pass)140 GBH100 PCIe$2.01/hr
120B+AnyCPU-only feasible256 GB+High-RAM CPU node

Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For 70B TIES merges with CPU offload, a single H100 PCIe handles the job in about 45 minutes. The merge itself barely stresses the GPU. The bottleneck is loading checkpoints from disk, so fast NVMe storage matters more than GPU count.

Deploying Mergekit on Spheron: Merging Llama 4 Models in One Workflow

Provision an H100 PCIe instance from the Spheron console (see Spheron docs for instance setup). Then:

bash
pip install mergekit transformers accelerate
huggingface-cli login

Write a config.yaml to merge a Llama 4 Scout base with a domain fine-tune using SLERP:

yaml
merge_method: slerp
models:
  - model: meta-llama/Llama-4-Scout-17B-16E-Instruct
  - model: ./llama4-scout-finetuned-domain
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
parameters:
  t: 0.4
dtype: bfloat16

Run the merge with CPU offload to keep VRAM under 20 GB:

bash
mergekit-yaml config.yaml ./output-model \
  --cuda \
  --lazy-unpickle \
  --copy-tokenizer

The --copy-tokenizer flag copies the base model's tokenizer into the output directory so the merged checkpoint is a self-contained HuggingFace repo. Total runtime on an H100: 30-45 minutes for Llama 4 Scout (17B active / 109B total).

One licensing note: verify the fine-tuned model's license before publishing the merge output. Llama 4 and most Qwen 3 variants permit derivative model creation, but some fine-tunes add restrictions. Always check the fine-tune's LICENSE file before distributing.

Evaluating Merged Models: MMLU, GSM8K, and LLM-as-Judge

MMLU and GSM8K are the right regression guardrails because they are broad, fast to run, and well-understood. A merged model that drops more than 2-3 points on either has likely experienced catastrophic interference.

bash
lm_eval \
  --model hf \
  --model_args pretrained=./output-model \
  --tasks mmlu,gsm8k \
  --device cuda \
  --output_path ./eval-results/

For generation quality, use an LLM-as-judge approach: run both the parent model and the merged model on a sample of your production queries, then have a judge model score each response. The LLM-as-judge evaluation pipeline guide covers the full setup including judge prompt design and scoring stability.

A typical evaluation comparison looks like this:

ModelMMLUGSM8KDomain Task AccProduction Quality (judge)
Parent A (base fine-tune)73.268.471.07.1/10
Parent B (domain fine-tune)70.165.283.56.8/10
SLERP t=0.572.867.180.27.4/10
TIES density=0.574.169.384.17.8/10
SFT fine-tune baseline72.567.882.07.3/10

The TIES merge often sits above either parent on the combined metric. That is the combination effect working correctly: each model's strengths survive the sign-election step.

Common Failure Modes

Catastrophic interference. The merged model underperforms both parents on every task. This happens when the fine-tuning directions directly conflict, and linear averaging cancels both. Fix: switch from linear to TIES, reduce density to trim more noisy deltas, or lower the merge coefficient closer to 0.3 to weight one model more heavily.

Merge-induced hallucinations. The merged model's output distribution is slightly off from either parent, producing confident text that neither parent would generate. Monitor perplexity on a held-out set of 500-1000 examples from your domain. A perplexity increase of more than 5% over the better parent signals distribution drift.

License conflicts. Some fine-tune licenses prohibit redistribution of merge derivatives. This catches teams off guard when they merge a popular fine-tune from HuggingFace Hub and then publish the output. Llama 4, Qwen 3, and Mistral variants are generally merge-friendly. Always check before publishing.

Tokenizer mismatch. If the two models have different tokenizers, mergekit will error on shape mismatch before producing any output. This is actually the correct behavior: a silent tokenizer mismatch would produce a working-looking model that generates garbage. Confirm vocab_size and tokenizer_class in both config.json files before starting.

Production Deployment: vLLM and Regression Monitoring

The merged checkpoint is a standard HuggingFace format model. No special vLLM flags or configuration:

bash
vllm serve ./output-model \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --max-model-len 8192

See the vLLM production deployment guide for tensor parallelism configuration, load balancing, and monitoring setup.

For regression monitoring, use an A/B shadow pattern: route 10% of production traffic to the merged model and 90% to the parent. Compare LLM-as-judge scores between the two streams over 24-48 hours. If the merged model's quality score stays within 3% of the parent, it is safe to promote to full traffic. A score drop outside that window means the merge introduced a regression that your offline eval did not catch.

Cost Comparison: Spot GPU Evolutionary Merging vs Fine-Tuning

Using live Spheron on-demand rates as of 25 Apr 2026:

ApproachHardwareDurationCost
TIES merge (single pass)1x H100 PCIe45 min~$1.51
Evolutionary merge (100-run sweep, parallel)4x H100 PCIe3 hrs~$24.16
SFT fine-tuning (7B, QLoRA)1x RTX 40904 hrs~$3.17
SFT fine-tuning (70B, QLoRA)1x H100 PCIe10 hrs~$20.14
GRPO training (32B)4x H20048+ hrs~$760+

Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing for live rates.

A 100-run evolutionary sweep costs roughly the same as a single SFT fine-tuning run on a 70B model. If you already have the checkpoints, the sweep is almost always worth running before committing to a new fine-tuning job.

When Merging Wins vs Fine-Tuning vs Distillation

SituationBest ApproachWhy
Two existing domain fine-tunes, want combinedMerge (TIES)No new data needed, under 1 hour
No existing fine-tunes, have labeled dataFine-tuneMerging needs something to merge
Need 10x cheaper inference at same qualityDistillationProduces a smaller model
Unknown optimal merge coefficients, have eval metricEvolutionary mergeFinds combinations humans miss
Need reasoning not in training dataGRPOMerging cannot inject new capabilities
Production inference cost matters mostDistillation + quantizationSmallest runtime footprint
Adjacent domain fine-tunes exist, tight deadlineMerge first, then fine-tune deltaFastest path to combined capability

Merging is not a replacement for fine-tuning. It is what you do when you already have the fine-tunes and want to see if you can skip another training run. Most teams that use merging seriously end up doing both: merge to create a strong starting checkpoint, then fine-tune on the specific delta where the merged model falls short.

Model merging is burst, parallel, and checkpoint-heavy, which is exactly the workload pattern spot GPUs serve well. Run an evolutionary merge sweep on Spheron for the cost of a single fine-tuning epoch on a hyperscaler, with per-minute billing and no minimum commitment.

Rent H100 for merging → | On-demand H200 → | View GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Audit your candidate models for merge compatibility

    Confirm all candidate models share the same base architecture and tokenizer. Run `transformers-cli env` or inspect config.json to check model_type, hidden_size, num_attention_heads, vocab_size. Any mismatch means the merge will produce garbled outputs. Also check licenses: some fine-tune licenses prohibit redistribution of merge derivatives.

  2. Choose your merge method based on model count and goal

    Two models, same base family, blending styles or strengths: use SLERP with t=0.5 as the starting coefficient. Three or more models: use TIES with density=0.5 and normalize=true. Any model with aggressive fine-tuning (LoRA merged into base): apply DARE first as a pre-processing step to prune redundant deltas, then run TIES. If you have an evaluation metric and a few hours of compute: run evolutionary merging with CMA-ES to search the coefficient space automatically.

  3. Provision a Spheron GPU or CPU node for merging

    For models up to 13B: a single RTX 4090 (24 GB VRAM) handles the merge entirely on-device in under 10 minutes. For 70B models: provision an H100 PCIe (80 GB) with CPU offload enabled in mergekit, or a high-RAM CPU node with 256 GB system RAM. For evolutionary merging sweeps: provision 4-8 H100 nodes to run parallel evaluation jobs; use Spheron spot instances to cut the sweep cost by 30-40%.

  4. Write the mergekit config YAML and run the merge

    Install mergekit with `pip install mergekit`. Write a config.yaml specifying the merge method, model paths, and coefficients. Run `mergekit-yaml config.yaml ./output-model --cuda --lazy-unpickle`. The --lazy-unpickle flag enables layer-by-layer streaming, which keeps peak RAM below 2x the size of a single model. Expect runtimes of 5-15 minutes for 13B merges and 30-60 minutes for 70B merges on a single H100.

  5. Evaluate the merged model on benchmark and domain tasks

    Run MMLU and GSM8K using lm-evaluation-harness as a regression check. Then run domain-specific evals that matter for your use case. Use llm-as-judge for open-ended tasks where reference answers don't exist. Compare the merged model's scores against both parent models and, if available, a fine-tuned baseline. A merged model that underperforms both parents on any metric is a sign of catastrophic interference - try reducing the merge coefficient or switching from linear interpolation to TIES.

  6. Deploy the merged model with vLLM and monitor for regressions

    The merged model is a standard Hugging Face checkpoint and deploys identically to any other model. Launch with vLLM: `vllm serve ./output-model --dtype bfloat16 --tensor-parallel-size 2`. Monitor output perplexity and task accuracy for the first 24 hours of production traffic. Set up an A/B shadow comparison against the parent model using the llm-as-judge approach from the evaluation step. A perplexity spike or accuracy drop below the parent baseline signals a merge-induced regression.

FAQ / 05

Frequently Asked Questions

Model merging combines the weight tensors of two or more trained models into a single model without any gradient updates or new data. Fine-tuning adapts a model by running gradient descent on a dataset. Merging costs near-zero compute (a few minutes on CPU or a single GPU), while fine-tuning costs hours of GPU time. Merging trades the precision of task-specific adaptation for the ability to stack capabilities across multiple specialized models in seconds.

Less than you'd expect. Mergekit loads models one at a time and streams weight tensors layer by layer, so peak VRAM is just large enough to hold one layer from each model simultaneously. You can merge two 70B models on a CPU-only node with 256 GB of system RAM, or on a single H100 (80 GB) using CPU offload for the rest. Full-precision (FP32) merge of two 70B models needs roughly 280 GB of combined RAM+VRAM; BF16 halves that.

SLERP is the best default for merging two models that share a base - it interpolates smoothly in weight space and rarely causes catastrophic interference. TIES is better when merging three or more models: it resolves sign conflicts before averaging. DARE prunes redundant delta weights before merging and works well when one of your models has been fine-tuned aggressively. Evolutionary merging is for when you have an evaluation metric and want to search the merge configuration space automatically - it finds coefficient combinations that a human would not think to try, but it requires 50-200 eval runs and several hours of compute.

No. Model merging only works on models that share the same architecture and tokenizer. You can merge Llama 4 Scout with a Llama 4 Scout fine-tune. You cannot merge Llama 4 with Qwen 3 or DeepSeek V3, because the weight tensors have different shapes. The only exception is task arithmetic on the delta weights, which can sometimes be applied cross-architecture by projecting into a shared embedding space - but this is research-grade, not production-ready.

Merging wins when you already have two fine-tuned models targeting adjacent domains and want to combine their strengths without collecting new joint data. For example, merging a code-specialized Llama 4 with a math-specialized Llama 4 fine-tune often produces a model that outperforms either in combined coding+math evaluations. Fine-tuning wins when you have a specific dataset and need behavior the base model has never seen. The fastest path is often merge first, then fine-tune on the delta.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.