What is Liger Kernel?

Liger Kernel is an open-source library from LinkedIn AI that replaces standard PyTorch LLM layer implementations with fused Triton kernels. It covers RMSNorm, RoPE, SwiGLU/GeGLU, and FusedLinearCrossEntropy. Applying it requires a single function call and reduces training VRAM by 40-60% on models like Llama, Qwen, and Gemma.

Which models support Liger Kernel in 2026?

Liger Kernel supports Llama 3/4, Qwen 2.5/3, Gemma 2/3, Mistral 3, Phi-3/4, and most models that share the standard transformer architecture (RMSNorm + RoPE + SwiGLU or GeGLU). Models with custom attention implementations (e.g., sliding window variants) or multimodal projection heads may need manual verification.

Does Liger Kernel work with LoRA and QLoRA fine-tuning?

Yes. Liger Kernel operates at the layer level, not the parameter level, so it is fully compatible with PEFT adapters including LoRA, QLoRA, and DoRA. The kernel patches replace the computation path of base model layers; adapter weights remain untouched.

Can I use Liger Kernel with Axolotl or TRL?

Yes. Axolotl supports Liger Kernel natively via the use_liger: true config key in your YAML (requires Axolotl 0.4+). TRL's SFTTrainer, DPOTrainer, and GRPOTrainer accept a model_init_kwargs dict where you can pass liger-patched models, or you can call the patch function before passing the model. One caveat: do not apply Liger patches to models already loaded by Unsloth, which applies its own overlapping kernel optimizations.

How much does Liger Kernel save per fine-tuning run on Spheron?

Savings depend on model size and whether memory reduction lets you drop to a smaller GPU tier. For Llama 3.1 70B full fine-tuning on Spheron's H100 SXM, Liger typically reduces VRAM enough to lower the gradient accumulation steps needed and increase effective batch size, cutting wall-clock time by 15-20% and total cost proportionally. For smaller 7-13B models, VRAM savings may let you substitute a cheaper GPU altogether.

Liger Kernel on GPU Cloud: Cut LLM Training Memory by 60% and Train 20% Faster

Liger Kernel from LinkedIn AI cuts LLM training VRAM by 40-60% and improves throughput 15-20% with a one-line patch to Hugging Face Trainer. No special environment, no custom training loop. It works because the bottleneck in standard training is not compute but intermediate tensor allocation during forward and backward passes. Liger eliminates most of that through kernel fusion. If you are already following standard LLM fine-tuning practices, dropping Liger in is the single most impactful change you can make to your training cost.

What Liger Kernel Actually Does

Standard PyTorch model layers allocate intermediate tensors at each computation step. RMSNorm, for example, allocates 2-3 tensors the full size of the activation (input, normalized, scaled) before producing output. A fused Triton kernel computes all three in a single GPU pass with zero intermediate allocation. Multiply that across 80 layers and you recover gigabytes of VRAM without touching your model architecture.

Here is what each kernel does:

RMSNorm: Replaces torch.nn.RMSNorm and LlamaRMSNorm. Eliminates 2 intermediate tensors per layer. Every transformer layer has 2 RMSNorm calls, so on a 70B model with 80 layers, this removes 160 large allocation events per forward pass.
RoPE (Rotary Position Embedding): Fuses position encoding into the QKV projection, eliminating the intermediate cos/sin position cache allocation on the forward pass.
SwiGLU / GeGLU: Fuses the gating activation with the feed-forward linear projection, eliminating the intermediate activation tensor (equal to 4 * hidden_dim * seq_len * batch).
FusedLinearCrossEntropy: The biggest win by far. Standard training computes lm_head(hidden_states), which produces a [batch * seq_len, vocab_size] logit tensor (often 2-8GB on large-vocab models), then computes softmax and cross-entropy on top. The fused version never materializes the full logit tensor. It computes cross-entropy in chunks directly, delivering 40-55% peak VRAM savings on models with vocab sizes above 100K tokens (Qwen 3, Llama 4).

Component	Standard PyTorch	Liger Kernel	Peak Intermediate VRAM Eliminated
RMSNorm	2-3 allocs/layer	0 allocs/layer	~1.5 GB on 70B (BF16)
RoPE	cos/sin buffer allocated	Fused in-place	~200 MB per step
SwiGLU gate	Full activation tensor	Fused in-place	2-4 GB depending on seq_len
Linear + CrossEntropy	Full logit tensor materialized	Chunked, never materialized	4-8 GB on 128K vocab models

The savings compound. FusedLinearCrossEntropy alone accounts for roughly half the VRAM savings on Llama 4 and Qwen 3 models with 128K+ token vocabularies. The norm and activation kernels cover the rest.

One version note: verify your installed version includes FusedLinearCrossEntropy before relying on the large-vocab savings. Running from liger_kernel.transformers.functional import liger_cross_entropy should not raise ImportError. Use liger-kernel >= 0.3.0 to be safe.

Model and Framework Coverage in 2026

Model Family	Architecture Match	Patch Function	Axolotl Config	TRL Compatible
Llama 3/4	RMSNorm, RoPE, SwiGLU	`apply_liger_kernel_to_llama()`	`use_liger: true`	Yes
Qwen 2.5 / 3	RMSNorm, RoPE, SwiGLU	`apply_liger_kernel_to_qwen2()`	`use_liger: true`	Yes
Gemma 2/3	RMSNorm, RoPE, GeGLU	`apply_liger_kernel_to_gemma()`	`use_liger: true`	Yes
Mistral 3	RMSNorm, RoPE, SwiGLU	`apply_liger_kernel_to_mistral()`	`use_liger: true`	Yes
Phi-3/4	RMSNorm, RoPE, SwiGLU	`apply_liger_kernel_to_phi3()`	`use_liger: true`	Yes
Multimodal (LLaVA, LLama-4 Scout visual)	Vision encoder not patched	Base LLM layers patched only	Partial	Check manually

For multimodal models, the vision encoder and cross-attention projectors are not auto-patched. Only the LLM decoder layers get Liger coverage. VRAM savings will be smaller than the pure-LLM figures above.

One-Line Integration

Hugging Face Trainer

python

import torch
from liger_kernel.transformers import apply_liger_kernel_to_llama
apply_liger_kernel_to_llama()  # Call before model.from_pretrained()

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", torch_dtype=torch.bfloat16)
# Liger patches are already active - train as normal

Call the patch function before from_pretrained. The patches modify the module classes in the model registry, so any model loaded after the call picks them up automatically.

TRL (SFT, DPO, GRPO)

python

from liger_kernel.transformers import apply_liger_kernel_to_llama
apply_liger_kernel_to_llama()

from trl import SFTTrainer, GRPOTrainer, DPOTrainer
# Pass model normally - patches persist through TRL's internal model wrapping

The patches persist through TRL's internal model wrapping. For SFT and DPO, see the GRPO fine-tuning guide and DPO fine-tuning guide for full TRL-specific setup context.

Axolotl

yaml

# In your axolotl config.yaml
use_liger: true

One line. Axolotl (version 0.4+) calls the appropriate patch function automatically before loading the model, based on the model_type field in your config. See Axolotl vs Unsloth vs TorchTune for the broader framework comparison.

Unsloth

Do not apply Liger patches to models loaded or wrapped by Unsloth. Unsloth applies its own custom CUDA kernels that cover RMSNorm, RoPE, and SwiGLU fusion natively. Applying Liger on top causes double-patching and undefined behavior in the backward pass. Pick one or the other, not both.

Benchmarks on Spheron GPU Cloud

The numbers below reflect benchmark data from the Liger Kernel project and community reports. Actual results depend on sequence length, batch size, and checkpoint strategy. Use them to set expectations, not as absolute guarantees.

VRAM comparison (BF16 full fine-tune, seq_len=2048, batch=4):

Model	GPU	Baseline VRAM	Liger VRAM	Reduction	Notes
Llama 3.1 8B	H100 SXM	72 GB (AdamW FP32 optimizer states, no gradient checkpointing)	29 GB	60%	FusedLinearCrossEntropy dominates savings on 128K vocab
Qwen 3 14B	H100 SXM	~79 GB (OOM at batch=4)	51 GB	Avoids OOM	SwiGLU + large vocab
Llama 3.1 70B	8x H100 SXM	74 GB/GPU	46 GB/GPU	38%	Enables larger batch on same hardware
Gemma 3 27B	H200 141GB	68 GB	41 GB	40%	GeGLU variant fully supported
Llama 4 Scout 17B	B200 192GB	62 GB	38 GB	39%	Headroom for longer context (expected, based on architecture similarity)

The B200 figures follow the same kernel fusion pattern as H100 and H200. Community benchmarks for B200 with Liger are still accumulating, so treat the 39% figure as directionally consistent with the architecture rather than a measured result.

Throughput comparison (tokens/sec, H100 SXM, Llama 3.1 8B, seq_len=2048):

Config	Tokens/sec	vs Baseline
Baseline HF Trainer	~3,100	1x
+ Liger Kernel	~3,800	+22%
+ Liger + FlashAttention-4	~5,200	+68%
+ Liger + FA4 + ZeRO-3 (8 GPU)	~38,000 aggregate	multi-node

The throughput gain from Liger alone (22%) comes from the reduced memory pressure on the GPU's memory subsystem. Fewer allocations means fewer cache misses, which means the compute units wait less.

Combining Liger with FlashAttention-4, FSDP, and DeepSpeed ZeRO-3

FlashAttention-4 + Liger: Complementary by design. FA4 optimizes attention computation. Liger covers everything else: norms, activations, cross-entropy. No conflict. Apply both independently. Call apply_liger_kernel_to_llama() first, then set attn_implementation="flash_attention_2" (or flash_attention_4 when HF Transformers integrates it natively) in from_pretrained.

FSDP2 + Liger: Fully compatible. FSDP2 shards parameters. Liger replaces computation in those parameters' forward passes. Apply Liger patches before calling fully_shard(). The patch function replaces module classes in the model registry, so FSDP2 wrapping picks up the replacements automatically. See the distributed LLM training guide for the full FSDP2 setup.

DeepSpeed ZeRO-3 + Liger: Compatible with ZeRO-1, ZeRO-2, and ZeRO-3. Apply patches before deepspeed.initialize(). ZeRO-3's parameter scatter/gather happens at a lower level than Liger's layer implementations.

Known Gotchas

1. Gradient checkpointing + RMSNorm. Liger's RMSNorm stores fewer activations for the backward pass by design. If you enable model.gradient_checkpointing_enable() alongside Liger patches, the recomputation path for checkpointed layers may conflict with the fused backward kernel. Test both together and verify loss values match a non-checkpointed run for 50-100 steps before committing to long training runs.

2. Multimodal projection heads. LLaVA-style models and Llama 4 Scout with the visual encoder have additional LayerNorm and projection layers that Liger does NOT auto-patch. The LLM decoder layers get patched. The vision encoder and cross-attention projectors do not. VRAM savings will be smaller than the pure-LLM figures above.

3. FusedLinearCrossEntropy numerical drift. The fused chunked cross-entropy computes identical results mathematically but uses a different floating-point accumulation order. Loss values may differ in the 4th-6th decimal place from standard cross-entropy. This is expected and not a training bug, but it can trigger false positives if you are comparing loss curves between patched and unpatched runs to check for regressions.

4. Custom tokenizers with small vocab. The savings from FusedLinearCrossEntropy scale with vocab size. On models with a vocab under 32K tokens, the gains from this kernel are smaller and total VRAM savings will be more modest (15-25% total rather than 40-60%).

Pricing Math: What Liger Saves per Fine-Tune Run on Spheron

GPU	On-demand $/hr/GPU	Spot $/hr/GPU
H100 SXM on Spheron	$4.21	$0.80
H200 on Spheron	$4.54	$1.19
B200 on Spheron	$7.00	$1.71

Example 1: Llama 3.1 8B full fine-tune, 1B tokens

Without Liger, this job requires 2x H100 SXM for VRAM headroom at batch=4, seq_len=2048 (72GB baseline VRAM is tight on a single 80GB card). With FSDP2 coordination overhead, the job runs for roughly 10 hours across two GPUs. Cost: 2 x $4.21 x 10 = $84.20.

With Liger, the same job fits comfortably on 1x H100 (29GB VRAM). On a single GPU, it runs for about 16 hours. Cost: 1 x $4.21 x 16 = $67.36. That is roughly 20% cheaper and avoids multi-GPU coordination setup entirely.

Use Spheron's H100 spot rate ($0.80/hr) for non-time-sensitive fine-tuning jobs that can checkpoint and resume. The same 16-hour single-GPU run drops to $12.80.

Example 2: Qwen 3 14B QLoRA fine-tune

Without Liger: OOM at seq_len=2048 on a single H100 at batch=4. You have to reduce batch size or switch to gradient accumulation hacks.

With Liger: 51GB VRAM usage. Fits with room to spare. Batch=4 runs without any workaround.

Pricing fluctuates based on GPU availability. The prices above are based on 09 May 2026 and may have changed. Check current GPU pricing → for live rates.

When NOT to Use Liger Kernel

You are already using Unsloth. Unsloth applies overlapping custom kernels. Double-patching causes undefined backward pass behavior. Choose one or the other. See Axolotl vs Unsloth vs TorchTune for the framework tradeoff comparison.

RL training with reward model in the same process. In PPO-style RL, if your actor and reward model share a single process and you patch both, the patched FusedLinearCrossEntropy can interact unexpectedly with the reward model's value head outputs. GRPO (which drops the critic entirely) is generally fine with Liger patches.

Custom architectures with non-standard layer implementations. If your model uses a custom MyRMSNorm class rather than the standard Hugging Face implementation, Liger's patch function will not find it. You need to manually replace the class or write a custom patch.

Debugging a loss spike or NaN. Fused kernels make it harder to insert intermediate inspection hooks. When you hit a training instability, temporarily disable Liger patches to rule them out before chasing other causes.

Getting Started on Spheron

Go to app.spheron.ai and select an H100 SXM or H200 instance. For the 8B-class fine-tunes in this post, H100 SXM is the most cost-effective starting point.

SSH in and install PyTorch 2.5+ and liger-kernel:

bash

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install "liger-kernel>=0.3.0" transformers accelerate

Add one line before your model initialization:

python

import torch
from transformers import AutoModelForCausalLM
from liger_kernel.transformers import apply_liger_kernel_to_llama
apply_liger_kernel_to_llama()

# Then load your model as usual
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", torch_dtype=torch.bfloat16)

Run your training script and watch nvidia-smi during the first few steps. VRAM usage should drop visibly within the first forward-backward pass.

For Axolotl users (version 0.4+), add use_liger: true to your config YAML and nothing else changes. Axolotl handles the rest based on your model_type field.

For SSH setup and instance configuration, see docs.spheron.ai.

Liger Kernel reduces VRAM by 40-60%, which translates directly to fewer GPUs needed per run. On Spheron's flat-rate H100 and B200 hardware, that means lower spend without giving up bare-metal performance or switching to a serverless API.
Rent H100 on Spheron → | Rent H200 → | Rent B200 → | View all pricing →

What Liger Kernel Actually Does

Model and Framework Coverage in 2026

One-Line Integration

Hugging Face Trainer

TRL (SFT, DPO, GRPO)

Axolotl

Unsloth

Benchmarks on Spheron GPU Cloud

Combining Liger with FlashAttention-4, FSDP, and DeepSpeed ZeRO-3

Known Gotchas

Pricing Math: What Liger Saves per Fine-Tune Run on Spheron

When NOT to Use Liger Kernel

Getting Started on Spheron

Build what's next.