What is model merging and how does it differ from fine-tuning?

Model merging combines the weight tensors of two or more trained models into a single model without any gradient updates or new data. Fine-tuning adapts a model by running gradient descent on a dataset. Merging costs near-zero compute (a few minutes on CPU or a single GPU), while fine-tuning costs hours of GPU time. Merging trades the precision of task-specific adaptation for the ability to stack capabilities across multiple specialized models in seconds.

How much GPU memory does merging a 70B model require?

Less than you'd expect. Mergekit loads models one at a time and streams weight tensors layer by layer, so peak VRAM is just large enough to hold one layer from each model simultaneously. You can merge two 70B models on a CPU-only node with 256 GB of system RAM, or on a single H100 (80 GB) using CPU offload for the rest. Full-precision (FP32) merge of two 70B models needs roughly 280 GB of combined RAM+VRAM; BF16 halves that.

Which merging method should I use: TIES, DARE, SLERP, or evolutionary merging?

SLERP is the best default for merging two models that share a base - it interpolates smoothly in weight space and rarely causes catastrophic interference. TIES is better when merging three or more models: it resolves sign conflicts before averaging. DARE prunes redundant delta weights before merging and works well when one of your models has been fine-tuned aggressively. Evolutionary merging is for when you have an evaluation metric and want to search the merge configuration space automatically - it finds coefficient combinations that a human would not think to try, but it requires 50-200 eval runs and several hours of compute.

Can I merge models with different architectures or tokenizers?

No. Model merging only works on models that share the same architecture and tokenizer. You can merge Llama 4 Scout with a Llama 4 Scout fine-tune. You cannot merge Llama 4 with Qwen 3 or DeepSeek V3, because the weight tensors have different shapes. The only exception is task arithmetic on the delta weights, which can sometimes be applied cross-architecture by projecting into a shared embedding space - but this is research-grade, not production-ready.

When does merging beat fine-tuning for domain specialization?

Merging wins when you already have two fine-tuned models targeting adjacent domains and want to combine their strengths without collecting new joint data. For example, merging a code-specialized Llama 4 with a math-specialized Llama 4 fine-tune often produces a model that outperforms either in combined coding+math evaluations. Fine-tuning wins when you have a specific dataset and need behavior the base model has never seen. The fastest path is often merge first, then fine-tune on the delta.

Model Merging on GPU Cloud: TIES, DARE, SLERP, and Evolutionary Merging for Custom LLMs (2026 Guide)

The Hugging Face Open LLM Leaderboard's top spots are consistently held by merged models, not fine-tuned ones. Merging a 70B model takes 45 minutes and costs under $2 on a single H100 instance. That is not a trade-off, it is a different category of operation entirely.

Model merging combines the weight tensors of two or more trained models without gradient descent, new data, or GPU clusters. The result is a single checkpoint you can deploy anywhere a parent model would run. If your goal is task-specific behavior learned from labeled examples, the fine-tuning guide covers the full workflow. If you need a permanently smaller model for cheaper inference, model distillation produces a student that runs at a fraction of the teacher's cost. Merging is the third path: combine capabilities from existing checkpoints in under an hour, no data required.

Three Ways to Customize an LLM Without Pretraining

Before picking a method, it helps to know which problem each one actually solves:

Approach	What Changes	Needs New Data?	GPU Cost	Time	When to Use
Fine-tuning	Model weights via gradient updates	Yes	High (hours of training)	Hours to days	New behavior not in base model
Distillation	Student weights trained on teacher outputs	Yes (teacher outputs)	High (training)	Hours to days	Permanent inference cost reduction
Merging	Weights interpolated from existing checkpoints	No	Near zero	Minutes to 1 hour	Combine capabilities from existing fine-tunes

For reinforcement learning from verifiable rewards, see GRPO fine-tuning for the RL-based path that teaches models to reason through problems. Merging is complementary: use it when you already have the checkpoints and want to combine what they each do well.

What Model Merging Actually Does

In plain terms: a model's weights are a point in a very high-dimensional parameter space. Two models fine-tuned from the same base end up in nearby regions of that space because they share the same starting point and loss landscape topology. Merging moves between those two points by interpolating the weight tensors directly.

Why does that produce a coherent model? Because models fine-tuned from the same base tend to occupy a connected loss basin. The mode connectivity research from 2020 onward showed that there are low-loss paths between the weight configurations of models with shared initialization. The lottery ticket hypothesis adds another angle: the base model's "winning subnetworks" are preserved through fine-tuning, so merged models still benefit from those subnetworks even when the fine-tuned weights diverge.

What merging cannot do: inject knowledge that is not in any parent, produce a model smaller than the parents, or work across architectures. If your two best models are a Llama 4 fine-tune and a Qwen 3 fine-tune, merging will not work. The weight tensor shapes must match exactly.

The Methods: Linear, SLERP, Task Arithmetic, TIES, DARE, and DARE-TIES

Linear Interpolation

The simplest approach: for each weight tensor W, compute W_merged = alpha * W_A + (1 - alpha) * W_B. At alpha=0.5, each model contributes equally.

Works well for models that are close in weight space (same base, similar fine-tuning). Falls apart at high mixing ratios when the fine-tuning directions conflict, producing what is called catastrophic interference: a model that is worse than either parent on both tasks.

yaml

merge_method: linear
models:
  - model: ./model-a
    parameters:
      weight: 0.5
  - model: ./model-b
    parameters:
      weight: 0.5
dtype: bfloat16

SLERP

Spherical linear interpolation treats weight tensors as vectors and interpolates along the geodesic on the unit sphere rather than the straight Euclidean line. The formula is:

W_merged = sin((1-t)*theta) / sin(theta) * W_A + sin(t*theta) / sin(theta) * W_B

where theta is the angle between the normalized weight vectors.

SLERP preserves the norm of the merged weights, which means it avoids the shrinkage that linear interpolation causes at intermediate t values. In practice this translates to better coherence at t=0.5. SLERP is the best default for two-model blends from the same base family.

yaml

merge_method: slerp
models:
  - model: ./model-a
  - model: ./model-b
base_model: ./base-model
parameters:
  t: 0.5
dtype: bfloat16

Task Arithmetic

Task arithmetic computes the "task vector" for each fine-tuned model: tau = W_finetuned - W_base. You then scale and add task vectors to the base:

W_merged = W_base + lambda_A * tau_A + lambda_B * tau_B

This is composable. You can add, subtract, and scale task vectors independently. Subtracting a task vector removes a capability (useful for removing toxic behavior that was fine-tuned in). The original Ilharco et al. 2023 paper showed that task vectors from models fine-tuned on different tasks are approximately orthogonal, which is why adding them does not erase the other task's gains.

TIES

TIES (Trim, Elect Sign, Merge) was designed for the three-or-more-model case where linear averaging produces sign conflicts. When W_A has a positive delta for a parameter and W_B has a negative delta, averaging them produces a near-zero result, discarding both fine-tuned values.

TIES resolves this with three steps:

Trim: zero out the bottom 1 - density fraction of delta weights by magnitude in each model (removes low-magnitude noise)
Elect sign: for each parameter, take a majority vote on sign direction across all input models
Merge: average only the values that agree with the elected sign

yaml

merge_method: ties
models:
  - model: ./model-a
    parameters:
      weight: 0.4
  - model: ./model-b
    parameters:
      weight: 0.3
  - model: ./model-c
    parameters:
      weight: 0.3
base_model: ./base-model
parameters:
  density: 0.5
  normalize: true
dtype: bfloat16

Use TIES when merging three or more models, or any time you see the merged model performing worse than either parent on both tasks (a sign that linear averaging is canceling deltas).

DARE

DARE (Drop And REscale) addresses a different problem: when a fine-tuned model has been trained aggressively (many epochs, high learning rate, merged LoRA adapters), its delta from the base model is large and noisy. Averaging that noisy delta with another model's delta amplifies the noise.

DARE randomly masks out a fraction p of the delta weights and rescales the survivors by 1/(1-p) to maintain expected magnitude. This prunes redundant parameters while keeping the expected value of the delta unchanged. The math: tau_DARE = mask(tau, p) / (1-p).

Run DARE as a preprocessing step before TIES or linear merge when one of your inputs is an aggressively fine-tuned model.

DARE-TIES

The current strongest general method for multi-model merges. Apply DARE's random pruning to each model's task vector first, then run TIES's sign-election step on the pruned deltas. DARE removes the low-signal noise; TIES resolves the remaining sign conflicts.

yaml

merge_method: dare_ties
models:
  - model: ./model-a
    parameters:
      weight: 0.4
      density: 0.7
  - model: ./model-b
    parameters:
      weight: 0.6
      density: 0.7
base_model: ./base-model
parameters:
  normalize: true
dtype: bfloat16

Evolutionary Merging: When Hand-Tuning Is Not Enough

Sakana AI's 2024 evolutionary merging paper reframed merging as a search problem. Instead of hand-picking coefficients, you treat the merge configuration (weights, densities, per-layer coefficients) as a parameter vector and evolve it using a fitness function based on benchmark performance.

The search algorithm is CMA-ES (Covariance Matrix Adaptation Evolution Strategy), which efficiently explores the coefficient space by maintaining a multivariate Gaussian distribution over candidates and updating it toward high-fitness regions. Each generation samples a batch of merge configurations, evaluates each on your benchmark, and adapts the distribution toward the survivors.

When to use evolutionary merging: you have an automatable evaluation metric (test suite pass rate, GSM8K accuracy, domain benchmark score), and you are merging three or more models where the optimal coefficients are not obvious. The algorithm finds coefficient combinations that no human would arrive at by intuition.

Hardware for a sweep: 100 eval runs, each requiring a full benchmark pass on a freshly merged checkpoint, roughly 7 minutes per run on an H100 PCIe rental for a 13B model. That is about 12 GPU-hours total, or roughly $24 at on-demand rates. Parallelizing across 4 H100 nodes cuts wall-clock time to 3 hours.

Here is the core loop for an evolutionary sweep using optuna and the mergekit Python API:

python

import glob
import json
import os
import shutil
import subprocess

import optuna

def evaluate_merge(config_path: str, merge_dir: str, eval_result_path: str) -> float:
    # Run mergekit merge
    subprocess.run(
        ["mergekit-yaml", config_path, merge_dir, "--cuda", "--lazy-unpickle"],
        check=True
    )
    # Run evaluation with lm-evaluation-harness
    subprocess.run(
        ["lm_eval", "--model", "hf", "--model_args", f"pretrained={merge_dir}",
         "--tasks", "gsm8k,mmlu", "--output_path", eval_result_path],
        capture_output=True, text=True, check=True
    )
    # lm-eval v0.4+ treats --output_path as a directory regardless of extension,
    # so traverse it to find the actual results file.
    result_files = glob.glob(
        os.path.join(eval_result_path, "**", "results_*.json"), recursive=True
    )
    if not result_files:
        raise FileNotFoundError(f"No lm-eval results found in {eval_result_path}")
    with open(result_files[0]) as f:
        scores = json.load(f)
    return scores["results"]["gsm8k"]["acc,none"]

def objective(trial):
    weight_a = trial.suggest_float("weight_a", 0.2, 0.8)
    density = trial.suggest_float("density", 0.3, 0.9)
    trial_config_path = f"./trial-{trial.number}-config.yaml"
    merge_dir = f"./tmp-merged-{trial.number}"
    eval_result_path = f"./eval-result-{trial.number}.json"

    config = f"""
merge_method: dare_ties
models:
  - model: ./model-a
    parameters:
      weight: {weight_a}
      density: {density}
  - model: ./model-b
    parameters:
      weight: {1 - weight_a}
      density: {density}
base_model: ./base-model
parameters:
  normalize: true
dtype: bfloat16
"""
    with open(trial_config_path, "w") as f:
        f.write(config)
    try:
        return evaluate_merge(trial_config_path, merge_dir, eval_result_path)
    finally:
        for path in [trial_config_path, eval_result_path]:
            try:
                if os.path.isfile(path):
                    os.remove(path)
                elif os.path.isdir(path):
                    shutil.rmtree(path)
            except OSError:
                pass
        try:
            if os.path.exists(merge_dir):
                shutil.rmtree(merge_dir)
        except OSError:
            pass

sampler = optuna.samplers.CmaEsSampler()
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=100)
print(study.best_params)

Run this on H200 instances when sweeping 70B models, where each eval run takes longer and you want the VRAM headroom to avoid OOM on the benchmark pass.

Hardware Requirements: Why Merging a 70B Model Needs Less GPU Than You Think

Mergekit's --lazy-unpickle flag enables layer-by-layer streaming. It loads one transformer block from each source model at a time, merges that block, writes it to the output, then moves to the next. Peak VRAM is determined by the size of one block from each model, not the full model size.

Model Size	Method	Peak VRAM	Peak RAM	Recommended Instance	On-Demand Rate
7B	Any	14 GB	28 GB	RTX 4090	$0.79/hr
13B	Any	26 GB	52 GB	RTX 4090 or A100 PCIe	$0.79/hr or $1.07/hr
70B	SLERP/TIES	~16 GB streaming	140 GB	H100 PCIe with CPU offload	$2.01/hr
70B	Evolutionary	80 GB (eval pass)	140 GB	H100 PCIe	$2.01/hr
120B+	Any	CPU-only feasible	256 GB+	High-RAM CPU node

Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For 70B TIES merges with CPU offload, a single H100 PCIe handles the job in about 45 minutes. The merge itself barely stresses the GPU. The bottleneck is loading checkpoints from disk, so fast NVMe storage matters more than GPU count.

Deploying Mergekit on Spheron: Merging Llama 4 Models in One Workflow

Provision an H100 PCIe instance from the Spheron console (see Spheron docs for instance setup). Then:

bash

pip install mergekit transformers accelerate
huggingface-cli login

Write a config.yaml to merge a Llama 4 Scout base with a domain fine-tune using SLERP:

yaml

merge_method: slerp
models:
  - model: meta-llama/Llama-4-Scout-17B-16E-Instruct
  - model: ./llama4-scout-finetuned-domain
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
parameters:
  t: 0.4
dtype: bfloat16

Run the merge with CPU offload to keep VRAM under 20 GB:

bash

mergekit-yaml config.yaml ./output-model \
  --cuda \
  --lazy-unpickle \
  --copy-tokenizer

The --copy-tokenizer flag copies the base model's tokenizer into the output directory so the merged checkpoint is a self-contained HuggingFace repo. Total runtime on an H100: 30-45 minutes for Llama 4 Scout (17B active / 109B total).

One licensing note: verify the fine-tuned model's license before publishing the merge output. Llama 4 and most Qwen 3 variants permit derivative model creation, but some fine-tunes add restrictions. Always check the fine-tune's LICENSE file before distributing.

Evaluating Merged Models: MMLU, GSM8K, and LLM-as-Judge

MMLU and GSM8K are the right regression guardrails because they are broad, fast to run, and well-understood. A merged model that drops more than 2-3 points on either has likely experienced catastrophic interference.

bash

lm_eval \
  --model hf \
  --model_args pretrained=./output-model \
  --tasks mmlu,gsm8k \
  --device cuda \
  --output_path ./eval-results/

For generation quality, use an LLM-as-judge approach: run both the parent model and the merged model on a sample of your production queries, then have a judge model score each response. The LLM-as-judge evaluation pipeline guide covers the full setup including judge prompt design and scoring stability.

A typical evaluation comparison looks like this:

Model	MMLU	GSM8K	Domain Task Acc	Production Quality (judge)
Parent A (base fine-tune)	73.2	68.4	71.0	7.1/10
Parent B (domain fine-tune)	70.1	65.2	83.5	6.8/10
SLERP t=0.5	72.8	67.1	80.2	7.4/10
TIES density=0.5	74.1	69.3	84.1	7.8/10
SFT fine-tune baseline	72.5	67.8	82.0	7.3/10

The TIES merge often sits above either parent on the combined metric. That is the combination effect working correctly: each model's strengths survive the sign-election step.

Common Failure Modes

Catastrophic interference. The merged model underperforms both parents on every task. This happens when the fine-tuning directions directly conflict, and linear averaging cancels both. Fix: switch from linear to TIES, reduce density to trim more noisy deltas, or lower the merge coefficient closer to 0.3 to weight one model more heavily.

Merge-induced hallucinations. The merged model's output distribution is slightly off from either parent, producing confident text that neither parent would generate. Monitor perplexity on a held-out set of 500-1000 examples from your domain. A perplexity increase of more than 5% over the better parent signals distribution drift.

License conflicts. Some fine-tune licenses prohibit redistribution of merge derivatives. This catches teams off guard when they merge a popular fine-tune from HuggingFace Hub and then publish the output. Llama 4, Qwen 3, and Mistral variants are generally merge-friendly. Always check before publishing.

Tokenizer mismatch. If the two models have different tokenizers, mergekit will error on shape mismatch before producing any output. This is actually the correct behavior: a silent tokenizer mismatch would produce a working-looking model that generates garbage. Confirm vocab_size and tokenizer_class in both config.json files before starting.

Production Deployment: vLLM and Regression Monitoring

The merged checkpoint is a standard HuggingFace format model. No special vLLM flags or configuration:

bash

vllm serve ./output-model \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --max-model-len 8192

See the vLLM production deployment guide for tensor parallelism configuration, load balancing, and monitoring setup.

For regression monitoring, use an A/B shadow pattern: route 10% of production traffic to the merged model and 90% to the parent. Compare LLM-as-judge scores between the two streams over 24-48 hours. If the merged model's quality score stays within 3% of the parent, it is safe to promote to full traffic. A score drop outside that window means the merge introduced a regression that your offline eval did not catch.

Cost Comparison: Spot GPU Evolutionary Merging vs Fine-Tuning

Using live Spheron on-demand rates as of 25 Apr 2026:

Approach	Hardware	Duration	Cost
TIES merge (single pass)	1x H100 PCIe	45 min	~$1.51
Evolutionary merge (100-run sweep, parallel)	4x H100 PCIe	3 hrs	~$24.16
SFT fine-tuning (7B, QLoRA)	1x RTX 4090	4 hrs	~$3.17
SFT fine-tuning (70B, QLoRA)	1x H100 PCIe	10 hrs	~$20.14
GRPO training (32B)	4x H200	48+ hrs	~$760+

Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing for live rates.

A 100-run evolutionary sweep costs roughly the same as a single SFT fine-tuning run on a 70B model. If you already have the checkpoints, the sweep is almost always worth running before committing to a new fine-tuning job.

When Merging Wins vs Fine-Tuning vs Distillation

Situation	Best Approach	Why
Two existing domain fine-tunes, want combined	Merge (TIES)	No new data needed, under 1 hour
No existing fine-tunes, have labeled data	Fine-tune	Merging needs something to merge
Need 10x cheaper inference at same quality	Distillation	Produces a smaller model
Unknown optimal merge coefficients, have eval metric	Evolutionary merge	Finds combinations humans miss
Need reasoning not in training data	GRPO	Merging cannot inject new capabilities
Production inference cost matters most	Distillation + quantization	Smallest runtime footprint
Adjacent domain fine-tunes exist, tight deadline	Merge first, then fine-tune delta	Fastest path to combined capability

Merging is not a replacement for fine-tuning. It is what you do when you already have the fine-tunes and want to see if you can skip another training run. Most teams that use merging seriously end up doing both: merge to create a strong starting checkpoint, then fine-tune on the specific delta where the merged model falls short.

Model merging is burst, parallel, and checkpoint-heavy, which is exactly the workload pattern spot GPUs serve well. Run an evolutionary merge sweep on Spheron for the cost of a single fine-tuning epoch on a hyperscaler, with per-minute billing and no minimum commitment.
Rent H100 for merging → | Rent H200 → | View GPU pricing →

Three Ways to Customize an LLM Without Pretraining

What Model Merging Actually Does

The Methods: Linear, SLERP, Task Arithmetic, TIES, DARE, and DARE-TIES

Linear Interpolation

SLERP

Task Arithmetic

TIES

DARE

DARE-TIES

Evolutionary Merging: When Hand-Tuning Is Not Enough

Hardware Requirements: Why Merging a 70B Model Needs Less GPU Than You Think

Deploying Mergekit on Spheron: Merging Llama 4 Models in One Workflow

Evaluating Merged Models: MMLU, GSM8K, and LLM-as-Judge

Common Failure Modes

Production Deployment: vLLM and Regression Monitoring

Cost Comparison: Spot GPU Evolutionary Merging vs Fine-Tuning

When Merging Wins vs Fine-Tuning vs Distillation

Build what's next.