Liger Kernel from LinkedIn AI cuts LLM training VRAM by 40-60% and improves throughput 15-20% with a one-line patch to Hugging Face Trainer. No special environment, no custom training loop. It works because the bottleneck in standard training is not compute but intermediate tensor allocation during forward and backward passes. Liger eliminates most of that through kernel fusion. If you are already following standard LLM fine-tuning practices, dropping Liger in is the single most impactful change you can make to your training cost.
What Liger Kernel Actually Does
Standard PyTorch model layers allocate intermediate tensors at each computation step. RMSNorm, for example, allocates 2-3 tensors the full size of the activation (input, normalized, scaled) before producing output. A fused Triton kernel computes all three in a single GPU pass with zero intermediate allocation. Multiply that across 80 layers and you recover gigabytes of VRAM without touching your model architecture.
Here is what each kernel does:
- RMSNorm: Replaces
torch.nn.RMSNormandLlamaRMSNorm. Eliminates 2 intermediate tensors per layer. Every transformer layer has 2 RMSNorm calls, so on a 70B model with 80 layers, this removes 160 large allocation events per forward pass. - RoPE (Rotary Position Embedding): Fuses position encoding into the QKV projection, eliminating the intermediate
cos/sinposition cache allocation on the forward pass. - SwiGLU / GeGLU: Fuses the gating activation with the feed-forward linear projection, eliminating the intermediate activation tensor (equal to
4 * hidden_dim * seq_len * batch). - FusedLinearCrossEntropy: The biggest win by far. Standard training computes
lm_head(hidden_states), which produces a[batch * seq_len, vocab_size]logit tensor (often 2-8GB on large-vocab models), then computes softmax and cross-entropy on top. The fused version never materializes the full logit tensor. It computes cross-entropy in chunks directly, delivering 40-55% peak VRAM savings on models with vocab sizes above 100K tokens (Qwen 3, Llama 4).
| Component | Standard PyTorch | Liger Kernel | Peak Intermediate VRAM Eliminated |
|---|---|---|---|
| RMSNorm | 2-3 allocs/layer | 0 allocs/layer | ~1.5 GB on 70B (BF16) |
| RoPE | cos/sin buffer allocated | Fused in-place | ~200 MB per step |
| SwiGLU gate | Full activation tensor | Fused in-place | 2-4 GB depending on seq_len |
| Linear + CrossEntropy | Full logit tensor materialized | Chunked, never materialized | 4-8 GB on 128K vocab models |
The savings compound. FusedLinearCrossEntropy alone accounts for roughly half the VRAM savings on Llama 4 and Qwen 3 models with 128K+ token vocabularies. The norm and activation kernels cover the rest.
One version note: verify your installed version includes FusedLinearCrossEntropy before relying on the large-vocab savings. Running from liger_kernel.transformers.functional import liger_cross_entropy should not raise ImportError. Use liger-kernel >= 0.3.0 to be safe.
Model and Framework Coverage in 2026
| Model Family | Architecture Match | Patch Function | Axolotl Config | TRL Compatible |
|---|---|---|---|---|
| Llama 3/4 | RMSNorm, RoPE, SwiGLU | apply_liger_kernel_to_llama() | use_liger: true | Yes |
| Qwen 2.5 / 3 | RMSNorm, RoPE, SwiGLU | apply_liger_kernel_to_qwen2() | use_liger: true | Yes |
| Gemma 2/3 | RMSNorm, RoPE, GeGLU | apply_liger_kernel_to_gemma() | use_liger: true | Yes |
| Mistral 3 | RMSNorm, RoPE, SwiGLU | apply_liger_kernel_to_mistral() | use_liger: true | Yes |
| Phi-3/4 | RMSNorm, RoPE, SwiGLU | apply_liger_kernel_to_phi3() | use_liger: true | Yes |
| Multimodal (LLaVA, LLama-4 Scout visual) | Vision encoder not patched | Base LLM layers patched only | Partial | Check manually |
For multimodal models, the vision encoder and cross-attention projectors are not auto-patched. Only the LLM decoder layers get Liger coverage. VRAM savings will be smaller than the pure-LLM figures above.
One-Line Integration
Hugging Face Trainer
import torch
from liger_kernel.transformers import apply_liger_kernel_to_llama
apply_liger_kernel_to_llama() # Call before model.from_pretrained()
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", torch_dtype=torch.bfloat16)
# Liger patches are already active - train as normalCall the patch function before from_pretrained. The patches modify the module classes in the model registry, so any model loaded after the call picks them up automatically.
TRL (SFT, DPO, GRPO)
from liger_kernel.transformers import apply_liger_kernel_to_llama
apply_liger_kernel_to_llama()
from trl import SFTTrainer, GRPOTrainer, DPOTrainer
# Pass model normally - patches persist through TRL's internal model wrappingThe patches persist through TRL's internal model wrapping. For SFT and DPO, see the GRPO fine-tuning guide and DPO fine-tuning guide for full TRL-specific setup context.
Axolotl
# In your axolotl config.yaml
use_liger: trueOne line. Axolotl (version 0.4+) calls the appropriate patch function automatically before loading the model, based on the model_type field in your config. See Axolotl vs Unsloth vs TorchTune for the broader framework comparison.
Unsloth
Do not apply Liger patches to models loaded or wrapped by Unsloth. Unsloth applies its own custom CUDA kernels that cover RMSNorm, RoPE, and SwiGLU fusion natively. Applying Liger on top causes double-patching and undefined behavior in the backward pass. Pick one or the other, not both.
Benchmarks on Spheron GPU Cloud
The numbers below reflect benchmark data from the Liger Kernel project and community reports. Actual results depend on sequence length, batch size, and checkpoint strategy. Use them to set expectations, not as absolute guarantees.
VRAM comparison (BF16 full fine-tune, seq_len=2048, batch=4):
| Model | GPU | Baseline VRAM | Liger VRAM | Reduction | Notes |
|---|---|---|---|---|---|
| Llama 3.1 8B | H100 SXM | 72 GB (AdamW FP32 optimizer states, no gradient checkpointing) | 29 GB | 60% | FusedLinearCrossEntropy dominates savings on 128K vocab |
| Qwen 3 14B | H100 SXM | ~79 GB (OOM at batch=4) | 51 GB | Avoids OOM | SwiGLU + large vocab |
| Llama 3.1 70B | 8x H100 SXM | 74 GB/GPU | 46 GB/GPU | 38% | Enables larger batch on same hardware |
| Gemma 3 27B | H200 141GB | 68 GB | 41 GB | 40% | GeGLU variant fully supported |
| Llama 4 Scout 17B | B200 192GB | 62 GB | 38 GB | 39% | Headroom for longer context (expected, based on architecture similarity) |
The B200 figures follow the same kernel fusion pattern as H100 and H200. Community benchmarks for B200 with Liger are still accumulating, so treat the 39% figure as directionally consistent with the architecture rather than a measured result.
Throughput comparison (tokens/sec, H100 SXM, Llama 3.1 8B, seq_len=2048):
| Config | Tokens/sec | vs Baseline |
|---|---|---|
| Baseline HF Trainer | ~3,100 | 1x |
| + Liger Kernel | ~3,800 | +22% |
| + Liger + FlashAttention-4 | ~5,200 | +68% |
| + Liger + FA4 + ZeRO-3 (8 GPU) | ~38,000 aggregate | multi-node |
The throughput gain from Liger alone (22%) comes from the reduced memory pressure on the GPU's memory subsystem. Fewer allocations means fewer cache misses, which means the compute units wait less.
Combining Liger with FlashAttention-4, FSDP, and DeepSpeed ZeRO-3
FlashAttention-4 + Liger: Complementary by design. FA4 optimizes attention computation. Liger covers everything else: norms, activations, cross-entropy. No conflict. Apply both independently. Call apply_liger_kernel_to_llama() first, then set attn_implementation="flash_attention_2" (or flash_attention_4 when HF Transformers integrates it natively) in from_pretrained.
FSDP2 + Liger: Fully compatible. FSDP2 shards parameters. Liger replaces computation in those parameters' forward passes. Apply Liger patches before calling fully_shard(). The patch function replaces module classes in the model registry, so FSDP2 wrapping picks up the replacements automatically. See the distributed LLM training guide for the full FSDP2 setup.
DeepSpeed ZeRO-3 + Liger: Compatible with ZeRO-1, ZeRO-2, and ZeRO-3. Apply patches before deepspeed.initialize(). ZeRO-3's parameter scatter/gather happens at a lower level than Liger's layer implementations.
Known Gotchas
1. Gradient checkpointing + RMSNorm. Liger's RMSNorm stores fewer activations for the backward pass by design. If you enable model.gradient_checkpointing_enable() alongside Liger patches, the recomputation path for checkpointed layers may conflict with the fused backward kernel. Test both together and verify loss values match a non-checkpointed run for 50-100 steps before committing to long training runs.
2. Multimodal projection heads. LLaVA-style models and Llama 4 Scout with the visual encoder have additional LayerNorm and projection layers that Liger does NOT auto-patch. The LLM decoder layers get patched. The vision encoder and cross-attention projectors do not. VRAM savings will be smaller than the pure-LLM figures above.
3. FusedLinearCrossEntropy numerical drift. The fused chunked cross-entropy computes identical results mathematically but uses a different floating-point accumulation order. Loss values may differ in the 4th-6th decimal place from standard cross-entropy. This is expected and not a training bug, but it can trigger false positives if you are comparing loss curves between patched and unpatched runs to check for regressions.
4. Custom tokenizers with small vocab. The savings from FusedLinearCrossEntropy scale with vocab size. On models with a vocab under 32K tokens, the gains from this kernel are smaller and total VRAM savings will be more modest (15-25% total rather than 40-60%).
Pricing Math: What Liger Saves per Fine-Tune Run on Spheron
| GPU | On-demand $/hr/GPU | Spot $/hr/GPU |
|---|---|---|
| H100 SXM on Spheron | $4.21 | $0.80 |
| H200 on Spheron | $4.54 | $1.19 |
| B200 on Spheron | $7.00 | $1.71 |
Example 1: Llama 3.1 8B full fine-tune, 1B tokens
Without Liger, this job requires 2x H100 SXM for VRAM headroom at batch=4, seq_len=2048 (72GB baseline VRAM is tight on a single 80GB card). With FSDP2 coordination overhead, the job runs for roughly 10 hours across two GPUs. Cost: 2 x $4.21 x 10 = $84.20.
With Liger, the same job fits comfortably on 1x H100 (29GB VRAM). On a single GPU, it runs for about 16 hours. Cost: 1 x $4.21 x 16 = $67.36. That is roughly 20% cheaper and avoids multi-GPU coordination setup entirely.
Use Spheron's H100 spot rate ($0.80/hr) for non-time-sensitive fine-tuning jobs that can checkpoint and resume. The same 16-hour single-GPU run drops to $12.80.
Example 2: Qwen 3 14B QLoRA fine-tune
Without Liger: OOM at seq_len=2048 on a single H100 at batch=4. You have to reduce batch size or switch to gradient accumulation hacks.
With Liger: 51GB VRAM usage. Fits with room to spare. Batch=4 runs without any workaround.
Pricing fluctuates based on GPU availability. The prices above are based on 09 May 2026 and may have changed. Check current GPU pricing → for live rates.
When NOT to Use Liger Kernel
- You are already using Unsloth. Unsloth applies overlapping custom kernels. Double-patching causes undefined backward pass behavior. Choose one or the other. See Axolotl vs Unsloth vs TorchTune for the framework tradeoff comparison.
- RL training with reward model in the same process. In PPO-style RL, if your actor and reward model share a single process and you patch both, the patched FusedLinearCrossEntropy can interact unexpectedly with the reward model's value head outputs. GRPO (which drops the critic entirely) is generally fine with Liger patches.
- Custom architectures with non-standard layer implementations. If your model uses a custom
MyRMSNormclass rather than the standard Hugging Face implementation, Liger's patch function will not find it. You need to manually replace the class or write a custom patch.
- Debugging a loss spike or NaN. Fused kernels make it harder to insert intermediate inspection hooks. When you hit a training instability, temporarily disable Liger patches to rule them out before chasing other causes.
Getting Started on Spheron
- Go to app.spheron.ai and select an H100 SXM or H200 instance. For the 8B-class fine-tunes in this post, H100 SXM is the most cost-effective starting point.
- SSH in and install PyTorch 2.5+ and liger-kernel:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install "liger-kernel>=0.3.0" transformers accelerate- Add one line before your model initialization:
import torch
from transformers import AutoModelForCausalLM
from liger_kernel.transformers import apply_liger_kernel_to_llama
apply_liger_kernel_to_llama()
# Then load your model as usual
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", torch_dtype=torch.bfloat16)- Run your training script and watch
nvidia-smiduring the first few steps. VRAM usage should drop visibly within the first forward-backward pass.
- For Axolotl users (version 0.4+), add
use_liger: trueto your config YAML and nothing else changes. Axolotl handles the rest based on yourmodel_typefield.
For SSH setup and instance configuration, see docs.spheron.ai.
Liger Kernel reduces VRAM by 40-60%, which translates directly to fewer GPUs needed per run. On Spheron's flat-rate H100 and B200 hardware, that means lower spend without giving up bare-metal performance or switching to a serverless API.
Rent H100 on Spheron → | Rent H200 → | Rent B200 → | View all pricing →
