LFM2's hybrid architecture reduces KV cache at long context compared to a same-sized pure transformer. Of the 24 layers in LFM2-8B-A1B, 18 use LIV (Linear Input Variant) convolution with a fixed recurrent state and no KV cache, and 6 use Grouped Query Attention (GQA) with a smaller cache than standard multi-head attention. The practical result: at 32K context, VRAM stays well under an L40S even at moderate batch sizes, while a comparable dense transformer at the same context and batch starts pressing against the GPU's limit. This guide covers the full deployment path for LFM2-8B-A1B and LFM2-2.6B on Spheron GPU instances: architecture background, VRAM sizing, native vLLM setup, and cost math.
For background on KV cache memory behavior in transformer models, see the AI memory wall inference guide. For other non-transformer architectures in production use, see the Mamba-3 deployment guide and the xLSTM and RWKV-7 deployment guide.
LFM2 Hybrid Architecture: LIV Convolutions and GQA Blocks
LFM2's architecture combines two distinct layer types.
LIV (Linear Input Variant) convolution layers process tokens through a structured recurrent update. Each layer maintains a fixed-size state that gets updated on each new token via a structured matrix multiplication. There is no attention over past tokens and no KV cache: the state memory at token 10,000 is the same as at token 10. LFM2-8B-A1B has 18 of these layers.
GQA (Grouped Query Attention) layers work like standard transformer attention with a KV cache, but with fewer KV heads per layer than query heads. GQA reduces cache size compared to full multi-head attention (MHA) by sharing key-value pairs across grouped query heads. LFM2-8B-A1B has 6 of these layers.
The hybrid design means:
- 75% of the network (18/24 layers) generates no KV cache
- 25% of the network (6/24 GQA layers) generates a reduced KV cache that does grow with context length
- Total KV cache footprint is substantially smaller than a 24-layer pure transformer of similar size
This is different from earlier descriptions of LFM that characterized it as having "no KV cache". LFM2 does have a KV cache, from its GQA layers. The accurate description: the KV cache is significantly reduced compared to a pure transformer of the same size.
Architecture specifics for LFM2-8B-A1B are documented at docs.liquid.ai. The LFM2 technical report is on arxiv (arxiv.org/abs/2511.23404).
LFM2 Model Family: What Is Actually Available
All LFM2 models are published under the LiquidAI HuggingFace organization. Before downloading, verify current model availability there. The LFM2 family includes:
Dense and small models:
- LFM2-350M, LFM2-700M - very small, edge deployment targets
- LFM2-1.2B with -Instruct, -Thinking, and -Base variants
- LFM2-2.6B - compact dense model, good throughput/quality balance for batch workloads
Sparse MoE models:
- LFM2-8B-A1B - 8.3B total parameters, 1.5B active per token
- LFM2-24B-A2B - 24B total parameters, 2.3B active per token
Vision-language models:
- LFM2-VL-450M, LFM2-VL-3B
- LFM2.5-VL-450M, LFM2.5-VL-1.6B
Audio:
- LFM2.5-Audio-1.5B
This guide focuses on LFM2-8B-A1B (the primary inference target for most teams) and LFM2-2.6B (the compact option for cost-sensitive workloads). Real HuggingFace model IDs follow the pattern LiquidAI/LFM2-8B-A1B, LiquidAI/LFM2-2.6B, LiquidAI/LFM2-1.2B-Instruct.
KV Cache Reduction: How the Hybrid Design Affects GPU Economics
The memory pressure from KV cache growth is a concrete hardware problem for long-context transformer workloads. For a 7-8B transformer with 32 layers, 8 KV heads (GQA), 128 head dim, running at 32K context with batch size 2, BF16:
kv_cache_gb = 2 × 32 layers × 8 kv_heads × 128 head_dim × 32768 seq_len × 2 batch × 2 bytes / 1e9
= ~8.6 GBCombined with ~14-16 GB model weights, total VRAM reaches ~23-25 GB. That fits on an L40S, but raising batch to 4 pushes KV cache alone to ~17 GB and the total toward the 48 GB limit. For teams needing 32-64 concurrent sessions at 32K context, a single L40S is tight and you're likely moving to H100.
For LFM2-8B-A1B with only 6 GQA layers contributing KV cache, the cache at the same context and batch is roughly one-fifth:
- Model weights at BF16: ~17 GB (8.3B total params × 2 bytes)
- KV cache from 6 GQA layers at 32K context, batch 2: ~1.7 GB
- Total: ~18-19 GB
At batch 8, LFM2-8B-A1B's KV cache reaches ~6-7 GB, keeping total VRAM around 24 GB on an L40S. A same-size pure transformer at batch 8 needs ~34 GB for KV cache alone, making a single L40S infeasible. The L40S handles what the transformer requires an H100 for.
For LFM2-2.6B, the smaller weight footprint (~5.3 GB at BF16) and reduced KV cache leave substantial headroom even on an RTX 5090 (32 GB GDDR7) for high-concurrency batch workloads.
For the KV cache management techniques that do apply to LFM2's GQA layers, PagedAttention and continuous batching both work and improve throughput. See the LLM serving optimization guide for configuration details.
The KV cache optimization guide covers the transformer-side mitigations if you are evaluating LFM2 against optimized transformer baselines.
Hardware Requirements and VRAM Footprint
| Model | Precision | Weight VRAM | KV Cache at 32K ctx (batch 8) | Recommended GPU |
|---|---|---|---|---|
| LFM2-1.2B-Instruct | BF16 | ~2.5 GB | Minimal | L40S / RTX 5090 |
| LFM2-2.6B | BF16 | ~5.3 GB | Small | L40S / RTX 5090 |
| LFM2-8B-A1B | BF16 | ~17 GB | ~1-2 GB (6 GQA layers) | L40S |
| LFM2-8B-A1B | INT8 | ~9 GB | ~1-2 GB (6 GQA layers) | RTX 5090 / L40S |
| LFM2-24B-A2B | BF16 | ~48 GB | Larger | 2x H100 SXM5 |
KV cache column notes: These are estimates based on the hybrid architecture (6 GQA layers contributing KV cache vs 18 LIV layers with none). The KV cache does grow with context length and batch size from the GQA layers. Actual values depend on the GQA head configuration in each model. Run your own benchmarks using the vLLM memory-profiling flags to measure precisely.
LFM2-8B-A1B fits on a single L40S with room for moderate batching at the full 32,768-token context window. Its maximum documented context length is 32,768 tokens per the Liquid AI documentation.
For LFM2-24B-A2B at ~48 GB weights plus KV cache, a single H100 SXM5 (80 GB) can hold the model but with limited batching headroom. The standard path is 2x H100 SXM5 with tensor parallelism at --tensor-parallel-size 2, giving 160 GB combined VRAM. An H200 (141 GB) is an alternative for single-GPU serving with larger batch headroom.
Installation: Native vLLM Support
LFM2 has native vLLM support. No adapter package or separate runtime is required.
uv pip install vllm==0.14Tokenizer dependency: LFM2's tokenizer requires transformers>=5.0.0. vLLM 0.14 installs this, but if you're working in an environment with older transformers, update it explicitly:
uv pip install "transformers>=5.0.0"For vision models (LFM2-VL or LFM2.5-VL series):
uv pip install vllm==0.19.0 transformers==5.5.0 pillowRequirements: Python 3.10+, CUDA 12.1+.
Serving LFM2 with vLLM
Download model weights from the LiquidAI HuggingFace organization:
# LFM2-8B-A1B
huggingface-cli download LiquidAI/LFM2-8B-A1B --local-dir ./lfm2-8b-a1b
# LFM2-2.6B
huggingface-cli download LiquidAI/LFM2-2.6B --local-dir ./lfm2-2.6b
# LFM2-1.2B-Instruct (smaller variant)
huggingface-cli download LiquidAI/LFM2-1.2B-Instruct --local-dir ./lfm2-1.2b-instructLaunch the server using a HuggingFace model ID directly or a local path:
# Serve LFM2-8B-A1B from HuggingFace (downloads automatically)
vllm serve LiquidAI/LFM2-8B-A1B --host 0.0.0.0 --port 8000 --dtype bfloat16 --max-model-len 32768
# From local directory
vllm serve ./lfm2-8b-a1b --host 0.0.0.0 --port 8000 --dtype bfloat16 --max-model-len 32768
# LFM2.5-1.2B-Instruct variant
vllm serve LiquidAI/LFM2.5-1.2B-Instruct --host 0.0.0.0 --port 8000Verify the server:
curl http://localhost:8000/v1/modelsFor LFM2-24B-A2B on 2x H100 SXM5:
vllm serve LiquidAI/LFM2-24B-A2B \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--tensor-parallel-size 2Production Serving Parameters
LFM2's GQA layers do maintain a KV cache, so standard vLLM KV cache management applies. PagedAttention allocates KV blocks dynamically and is beneficial. Do not disable it.
Key parameters for LFM2-8B-A1B on an L40S:
vllm serve LiquidAI/LFM2-8B-A1B \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 128--max-model-len 32768 sets the context window to LFM2-8B-A1B's documented maximum. Set this explicitly if vLLM infers a different value from the model config.
--gpu-memory-utilization 0.90 allocates 90% of available VRAM after weights load to the KV cache pool. Given LFM2-8B-A1B's smaller KV cache footprint, this leaves room for high concurrency.
--max-num-seqs 128 controls maximum concurrent sequences. You can raise this beyond typical transformer values given the smaller per-sequence KV cache, but benchmark first.
Do not pass --enable-prefix-caching false. Prefix caching benefits LFM2's GQA layers for repeated system prompts and should remain enabled.
For high-concurrency mixed workloads with both short and long context requests, also add --enable-chunked-prefill. The LLM serving optimization guide covers how chunked prefill reduces head-of-line blocking from long prefills.
Throughput Characteristics
LFM2-8B-A1B's MoE design routes each token through only 1.5B active parameters. This means per-token compute is closer to a 1.5B model than an 8B model, which translates to higher tokens-per-second than a dense 7-8B model on the same hardware.
At longer context, the reduced KV cache from the hybrid architecture means throughput degrades less than a comparable pure transformer. At 32K context with batch 8, the pure transformer needs ~23-25 GB total VRAM and starts limiting batch growth. LFM2-8B-A1B at the same settings needs ~18-19 GB, leaving 29 GB free on the L40S for additional concurrent sequences.
For verified throughput benchmarks on real LFM2 models, see the LFM2 technical report (arxiv.org/abs/2511.23404) and the benchmark data on the model cards at huggingface.co/LiquidAI. Run your own benchmarks before making infrastructure decisions:
python benchmarks/benchmark_throughput.py \
--model LiquidAI/LFM2-8B-A1B \
--input-len 4096 \
--output-len 512 \
--num-prompts 50Repeat at --input-len 16384 and --input-len 32768 to measure throughput vs context length and compare against a Llama 3.1 8B baseline on the same hardware.
Fine-Tuning LFM2 on GPU Cloud
Liquid AI provides fine-tuning support for the LFM2 family. Because LFM2 is a hybrid architecture, the LoRA target module setup differs from a pure transformer.
LFM2's 6 GQA layers expose standard attention projection layers (q_proj, k_proj, v_proj, o_proj), which standard LoRA tooling can target. The 18 LIV convolution layers have their own projection structure with different target modules.
Before configuring fine-tuning, check docs.liquid.ai for the current fine-tuning documentation and the recommended LoRA target modules for LFM2. Do not copy target module settings from a transformer fine-tuning guide directly, as the hybrid layer types require different configurations.
For LFM2-8B-A1B LoRA on a single H100 SXM5 at BF16 with rank 16, rough VRAM estimates:
- Model weights: ~17 GB
- LoRA adapters: ~0.3-1 GB
- Optimizer states (AdamW): ~30-35 GB
- Activations: ~5-10 GB
- Total: ~52-63 GB, fits within H100 SXM5 (80 GB) with moderate headroom
For larger datasets or higher LoRA ranks, 2x H100 or H200 adds optimizer state headroom.
HuggingFace datasets in standard instruction-tuning formats (Alpaca, ShareGPT) are the expected input format. No conversion is needed if your data is already in one of these formats.
Cost Analysis on Spheron
These rates are based on current Spheron GPU pricing. All rows are labeled with the billing mode.
| Deployment | GPU | Billing | Hourly Cost |
|---|---|---|---|
| LFM2-2.6B | L40S | On-demand | $0.72/hr |
| LFM2-8B-A1B | L40S | On-demand | $0.72/hr |
| LFM2-24B-A2B | 2x H100 SXM5 | On-demand | ~$7.80/hr |
| LFM2-24B-A2B | 2x H100 SXM5 | Spot | ~$3.46/hr |
| Llama 3.1 8B baseline | L40S | On-demand | $0.72/hr |
| Llama 3.1 8B (long ctx) | H100 SXM5 | On-demand | $3.90/hr |
| Llama 3.1 8B (long ctx) | H100 SXM5 | Spot | $1.73/hr |
Pricing fluctuates based on GPU availability. The prices above are based on 29 May 2026 and may have changed. Check current GPU pricing → for live rates.
The cost case for LFM2-8B-A1B vs a comparable transformer is strongest when context windows are consistently 8K-32K and concurrency is moderate-to-high. At that range, a transformer deployment handling many concurrent long-context sessions may need to move from L40S to H100 to avoid KV cache pressure, raising per-GPU cost from $0.72/hr to $3.90/hr on-demand. LFM2-8B-A1B running the same workload stays on the L40S, with the MoE design's 1.5B active params giving additional throughput headroom.
At short context (under 4K tokens), there is no meaningful KV cache difference, and transformer tooling with FlashAttention is mature and highly optimized. The cost argument for LFM2 at short context is weaker.
When LFM2 Wins and When It Doesn't
LFM2-8B-A1B is the better choice when:
- Context lengths are consistently 8K-32K tokens, where the hybrid architecture's reduced KV cache provides a meaningful batching advantage on an L40S
- You need higher tokens-per-second than a dense 7-8B model and the MoE's 1.5B active params per token gives a compute advantage
- GPU budget is constrained and L40S is the target tier
- You need to maximize concurrent sessions on a single GPU at long context
LFM2 may not be the better choice when:
- Context is consistently short (under 4K tokens), where transformer FlashAttention optimizations are highly tuned and the KV cache advantage disappears
- You need context beyond 32K tokens. LFM2-8B-A1B's documented maximum is 32,768 tokens
- Fine-tuning ecosystem depth matters. Transformer LoRA tooling (PEFT, Axolotl, Unsloth) has broader support, documentation, and community resources in 2026
- Tasks require very precise recall from the middle of long sequences. LFM architectures compress sequence history differently from attention, which can affect retrieval of specific facts embedded deep in a long context
- Your workload requires vision or multimodal processing (use LFM2-VL-3B or LFM2.5-VL-1.6B, which have different hardware requirements from the text-only models)
For pure inference at 8K-32K context on an L40S budget, LFM2-8B-A1B is a practical option. The deployment path is standard vLLM with no adapter packages and no special runtime. The trade-off is that transformer ecosystem tooling remains considerably more mature and battle-tested in production.
LFM2-8B-A1B's hybrid architecture lets you run 8K-32K context workloads on an L40S that a comparable transformer would need an H100 for.
L40S on Spheron → | H100 SXM5 on Spheron → | View current GPU pricing →
Quick Setup Guide
Log in to app.spheron.ai. For LFM2-8B-A1B inference, select an L40S (48 GB) instance. The model weights take about 17 GB at BF16 and the reduced KV cache leaves significant batching headroom. For LFM2-2.6B, an L40S or RTX 5090 (32 GB) works. For LFM2-24B-A2B, select 2x H100 SXM5 or an H200. Deploy with Ubuntu 22.04 and the NVIDIA Docker runtime template, then confirm with nvidia-smi.
Run: uv pip install vllm==0.14. LFM2 has native vLLM support and does not require any adapter or runtime package. The tokenizer requires transformers>=5.0.0, which vLLM 0.14 installs as a dependency. For vision models (LFM2-VL or LFM2.5-VL): uv pip install vllm==0.19.0 transformers==5.5.0 pillow. Requires Python 3.10+, CUDA 12.1+.
LFM2 models are published under the LiquidAI HuggingFace organization. For LFM2-8B-A1B: huggingface-cli download LiquidAI/LFM2-8B-A1B --local-dir ./lfm2-8b-a1b. For LFM2-2.6B: huggingface-cli download LiquidAI/LFM2-2.6B --local-dir ./lfm2-2.6b. For LFM2-1.2B-Instruct: huggingface-cli download LiquidAI/LFM2-1.2B-Instruct --local-dir ./lfm2-1.2b-instruct. Verify current model IDs on the hub before downloading.
Run: vllm serve LiquidAI/LFM2-8B-A1B --host 0.0.0.0 --port 8000 --dtype bfloat16 --max-model-len 32768. For LFM2.5-1.2B-Instruct: vllm serve LiquidAI/LFM2.5-1.2B-Instruct --host 0.0.0.0 --port 8000. The server exposes OpenAI-compatible endpoints. Verify with: curl http://localhost:8000/v1/models. For multi-GPU with LFM2-24B-A2B on 2x H100 SXM5: add --tensor-parallel-size 2.
Set --gpu-memory-utilization 0.90 to dedicate 90% of remaining VRAM to the KV cache pool. Set --max-num-seqs to your expected peak concurrency (128 is a reasonable starting point for LFM2-8B-A1B on an L40S). Because LFM2's hybrid architecture generates less KV cache than a pure transformer of similar size, you can run more concurrent sequences on the same GPU. Benchmark with vLLM's benchmark_serving.py at your target concurrency before committing to a configuration.
Frequently Asked Questions
LFM2-8B-A1B has 8.3B total parameters (1.5B active per token via MoE routing). At BF16, model weights take approximately 17 GB. The hybrid architecture uses 18 LIV convolution layers with no KV cache and 6 GQA layers with a reduced KV cache. At 32K context and moderate batch sizes, the KV cache from the GQA layers adds roughly 1-2 GB, bringing total VRAM to around 18-19 GB. An L40S (48 GB) fits this comfortably and has significant headroom for batching.
No. LFM2 has native vLLM support from day one. No adapter or runtime package is required. The install is: uv pip install vllm==0.14. LFM2's tokenizer requires transformers>=5.0.0, which vLLM 0.14 installs. For vision models (LFM2-VL or LFM2.5-VL), use: uv pip install vllm==0.19.0 transformers==5.5.0 pillow.
LFM2-8B-A1B's documented context window is 32,768 tokens (32K), per the official Liquid AI documentation at docs.liquid.ai. Set --max-model-len 32768 in your vLLM launch command to make this explicit.
LFM2-8B-A1B uses 18 LIV (Linear Input Variant) convolution layers and 6 GQA (Grouped Query Attention) layers. The LIV layers maintain a fixed-size recurrent state with no KV cache growth. The 6 GQA layers do maintain a KV cache, but because only 25% of the layers use attention (vs all layers in a pure transformer), and GQA already uses fewer KV heads than MHA, the total KV cache is substantially smaller at long context. LFM2 does grow a KV cache, just a much smaller one than a comparable pure transformer.
LFM2-8B-A1B is a sparse MoE model with 8.3B total parameters and 1.5B active per token. LFM2-2.6B is a smaller dense model with 2.6B parameters. LFM2-8B-A1B delivers better quality for complex reasoning and instruction-following. LFM2-2.6B is more cost-effective for batch summarization, classification, and simpler generation workloads where the quality tradeoff is acceptable.
