KV cache memory was the main GPU VRAM bottleneck for deploying large language models at long context. Multi-Head Latent Attention (MLA) cuts it by ~98% versus standard MHA by compressing keys and values into a shared low-rank latent vector before storing them. That compression changes the GPU economics: the same H200 node that can run 1-2 concurrent users at 128K context with standard GQA attention can handle 8-9 with MLA, at the same hourly rate.
This post covers the architecture, the memory math, which 2026 open models use MLA, how to configure vLLM and SGLang for optimal MLA serving, and a step-by-step deployment on Spheron H200 and B200 instances.
What MLA Is and How It Differs from MHA, GQA, and MQA
Standard Multi-Head Attention (MHA) caches one key vector and one value vector per attention head per token per layer. For a model with 128 heads and head_dim=128, that's 128 × 128 × 2 × 2 bytes = 65,536 bytes per token per layer. At 128K context, the KV cache for one user on a 60-layer model is 128K × 65,536 × 60 bytes = ~480 GB. That does not fit in a single 8x H200 node.
Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce this by sharing key and value heads across query heads. GQA with 8 groups cuts it by roughly 16x (to ~30 GB at 128K context). MLA takes a different approach: instead of sharing heads, it projects all keys and values down into a single low-dimensional latent vector before caching. The full K and V tensors are reconstructed during each decode step via learned up-projection matrices.
| Mechanism | KV Cache Size (relative) | Per-Layer KV | Notes |
|---|---|---|---|
| MHA | 100% baseline | num_heads × head_dim × 2 per token | Standard in GPT-3, Llama 2, Falcon |
| MQA | ~6-12% of MHA | 1 × head_dim × 2 per token | Shares one K/V head across all Q heads |
| GQA | ~12-50% of MHA | num_groups × head_dim × 2 per token | Llama 3, Qwen 2.5, Mistral 3 |
| MLA | ~1-2% of MHA | d_c × 2 latent vector + small r per token | DeepSeek V2, V3, and Kimi K2; single compressed vector |
The MLA projection mechanism: MLA adds a down-projection matrix W_DKV that maps K and V to a low-dimensional latent vector c_kv of dimension d_c (512 for DeepSeek models). Only c_kv is written to the KV cache. On each decode step, the serving framework applies learned up-projection matrices W_UK and W_UV to reconstruct the full K and V tensors before computing attention. The stored footprint is d_c × 2 bytes per token per layer. For DeepSeek models with d_c = 512, num_heads = 128, and head_dim = 128, MLA stores 64x fewer KV bytes per token per layer than MHA and 4x fewer than GQA with 8 groups.
The Memory Math: What ~98% KV Compression Actually Means
The formula for KV cache memory per attention layer:
- MHA:
2 × num_heads × head_dim × 2 bytes × num_tokens - GQA (8 groups):
2 × num_groups × head_dim × 2 bytes × num_tokens - MLA:
d_c × 2 bytes × num_tokens
For DeepSeek V3 with 128 heads, head_dim=128, d_c=512, and 60 transformer layers:
| Config | MHA (baseline) | GQA (8 groups, Llama 3 style) | MLA (DeepSeek V3) |
|---|---|---|---|
| KV bytes per token per layer | 65,536 | 4,096 | 1,024 |
| KV cache at 128K ctx, 1 user, 60 layers | ~480 GB | ~30 GB | ~7.5 GB |
| KV cache at 32K ctx, 1 user, 60 layers | ~120 GB | ~7.5 GB | ~1.9 GB |
| Max concurrent users on H200 141 GB (after ~75 GB weights) | 0 | ~2 | ~8-9 |
The punchline: a model that cannot fit a single concurrent user at 128K context with MHA can serve 8-9 concurrent users with MLA on the same hardware. At H200 spot pricing on Spheron, that translates to roughly 8-9x lower cost-per-token for the KV-cache-bound portion of serving.
For H200 SXM5 at $1.77/hr spot, serving 9 concurrent 128K-context DeepSeek V3 requests means the effective GPU cost per concurrent session is around $0.20/hr, versus a hypothetical MHA equivalent needing separate GPU capacity per user. Note that spot instances can be reclaimed with short notice; for production SLA-bound workloads, plan for that interruption risk.
Pricing fluctuates based on GPU availability. The prices above are based on 25 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Adding FP8 KV cache quantization (available on H200 via --kv-cache-dtype fp8) halves the already-compressed MLA footprint again, from ~7.5 GB to ~3.75 GB per user at 128K context. That pushes concurrent user capacity to 16-18 per node. On B200 with --kv-cache-dtype nvfp4, another 50% reduction is possible on top of FP8.
Which 2026 Open Models Use MLA
DeepSeek V2 and V3 both use MLA. DeepSeek V4 switched to a different compression approach (token-wise KV compression plus DeepSeek Sparse Attention) and does not use MLA. The Kimi K2 series from Moonshot AI uses a related MLA-based architecture. Together these cover a significant portion of the frontier open models available for self-hosting in 2026.
| Model | MLA Variant | d_c (latent dim) | Context | Minimum GPU Config |
|---|---|---|---|---|
| DeepSeek V3 | Standard MLA | 512 | 128K | 4x H200 (FP8) |
| Kimi K2.5, K2.6, K2.7 | K2 MLA architecture | ~512 | 256K | 8x H200 or 8x B200 |
For Kimi K2.6, see the Kimi K2.6 deployment guide which covers the Moonshot MLA implementation and multi-GPU configuration.
Serving MLA Models: vLLM and SGLang Configuration
vLLM
vLLM handles MLA natively for DeepSeek-family and Kimi K2 models from the model config. No special flag is required to activate MLA; vLLM reads the attention architecture from the model's config.json and routes accordingly.
Key flags for production MLA serving:
On H200 (Hopper):
vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--dtype fp8 \
--kv-cache-dtype fp8 \
--max-model-len 131072 \
--gpu-memory-utilization 0.92 \
--port 8000--kv-cache-dtype fp8 halves the MLA KV footprint. Do not use --kv-cache-dtype nvfp4 on H200 for KV cache: Hopper lacks the hardware FP4 tensor core path for KV storage, so it falls back to software emulation that degrades throughput.
On B200 (Blackwell):
vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--dtype fp8 \
--kv-cache-dtype nvfp4 \
--max-model-len 131072 \
--gpu-memory-utilization 0.92 \
--port 8000--kv-cache-dtype nvfp4 gives an additional 50% reduction versus FP8. B200's Blackwell architecture accelerates FP4 KV cache operations in hardware, making this the highest-density configuration for MLA serving.
For a detailed comparison of vLLM and SGLang for prefix-overlap workloads (which benefit MLA deployments with shared system prompts), see the vLLM vs SGLang 2026 comparison.
SGLang
SGLang offers a dedicated FlashInfer MLA kernel path that handles the compressed latent format directly, avoiding the naive expand-then-attend path. Pass --attention-backend flashinfer to activate it:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--tp 8 \
--attention-backend flashinfer \
--dtype fp8 \
--context-length 131072 \
--mem-fraction-static 0.90 \
--port 30000On Hopper and Blackwell hardware, SGLang may auto-select FlashInfer as the default MLA backend. Pass --attention-backend flashinfer explicitly to ensure the optimized FlashInfer MLA kernel path regardless of hardware auto-detection. This reduces attention compute time 20-35% for MLA models compared to the decomposed expansion path.
For Kimi K2 models in SGLang, use the specific activation flag:
python -m sglang.launch_server \
--model-path moonshotai/Kimi-K2.6 \
--tp 8 \
--attention-backend flashinfer \
--enable-flashinfer-mla \
--dtype fp8 \
--context-length 65536 \
--mem-fraction-static 0.90 \
--port 30000For deeper kernel configuration, see the FlashInfer GPU cloud guide which covers the FlashInfer MLA kernel path, how it interacts with block-sparse KV storage, and benchmarks across H100, H200, and B200.
Throughput and Cost-Per-Token Benchmarks on H200 and B200
The benchmarks below are directional. Real performance depends on request distribution, prefix reuse, batch composition, and hardware configuration. They show the relative effect of MLA's concurrency multiplier on cost-per-token.
| Model | GPU | Context | Concurrency | tok/sec | $/hr (spot) | $/M tokens |
|---|---|---|---|---|---|---|
| Llama 3.3 70B (GQA) | 1x H200 | 4K | 8 | ~18,000 | $1.77 | ~$0.027 |
| DeepSeek V3 (MLA, FP8 KV) | 8x H200 | 32K | 16 | ~9,500 | $14.16 | ~$0.41 |
| DeepSeek V3 (MLA, FP8 KV) | 8x H200 | 128K | 8 | ~4,200 | $14.16 | ~$0.94 |
| DeepSeek V3 (MLA, FP4 KV) | 8x B200 | 128K | 16 | ~7,800 | $42.72 | ~$1.52 |
Pricing fluctuates based on GPU availability. The prices above are based on 25 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Cost-per-token formula: ($/hr) / (tok/sec × 3600) × 1,000,000
For 8x H200 spot at $14.16/hr with 9,500 tok/sec at 32K context: 14.16 / (9500 × 3600) × 1,000,000 ≈ $0.41/M tokens.
At 128K context, H200 comes in at $0.94/M versus B200 at $1.52/M. B200 costs more per token but supports double the concurrent users (16 vs 8) at 128K, which matters for high-concurrency workloads. Both figures assume spot pricing, which can be reclaimed at short notice.
MLA's throughput benefit compounds at longer context. At 4K context, MLA versus GQA is a modest improvement. At 128K, MLA enables roughly 4x more concurrent users per GPU versus GQA (the table above shows ~8-9 MLA users versus ~2 GQA users on H200 141 GB), or 8-12x more versus standard MHA, which directly multiplies output tokens per hour at a fixed hourly cost.
Tuning MLA Serving: Absorbed vs Unabsorbed Compute, FP8 KV, and Long-Context Tradeoffs
Absorbed vs unabsorbed MLA
MLA's up-projections from latent vector to full K and V can be handled two ways:
- Absorbed MLA: The up-projection matrices W_UK and W_UV are fused ("absorbed") into the attention weight matrices at model load time. Decode requires fewer FLOPs per step because the projection is pre-folded. VRAM for model weights increases slightly (10-20%) versus the unabsorbed variant.
- Unabsorbed MLA: Up-projections run live on each decode step. VRAM for weights is lower, but each decode requires additional matrix multiplications for the expansion.
vLLM defaults to absorbed MLA for DeepSeek models. SGLang with FlashInfer also defaults to absorbed for DeepSeek. For memory-bound deployments (long context, large batch), absorbed is the right choice: fewer memory reads per decode step beats the slightly higher static weight footprint. For compute-bound deployments where VRAM is the hard constraint (trying to fit a larger model), unabsorbed can recover 10-20% of VRAM at the cost of more FLOPs per token.
When sizing GPU instances, check whether the benchmark figures for a model assumed absorbed or unabsorbed weights. Absorbed weights for DeepSeek V3 at FP8 run approximately 10-15% larger than the raw parameter count would suggest.
FP8 KV cache on H200
--kv-cache-dtype fp8 halves the MLA KV footprint by storing latent vectors at 8-bit precision rather than BF16. At 128K context, per-user MLA KV cache drops from ~7.5 GB to ~3.75 GB. This is a proven production configuration on H200 with negligible quality degradation on standard benchmarks. Enable it by default for any MLA deployment on Hopper.
FP4 KV cache on B200
On Blackwell (B200/B300), --kv-cache-dtype nvfp4 cuts the FP8 KV footprint in half again, to ~1.9 GB per user at 128K context. This is hardware-accelerated on B200 using the SM100 tensor core FP4 path. The quality tradeoff is larger than FP8 and should be validated on your specific task before production deployment, particularly for reasoning-heavy or factual retrieval workloads.
Long-context tradeoffs
MLA's memory savings scale linearly with context length. At 4K context, MLA versus GQA saves a few GB. At 128K, it saves ~22 GB per user on H200 (comparing GQA-8 at ~30 GB vs MLA at ~7.5 GB). At 1M context, MLA is the only architecture that fits in a 2-node H200 cluster without NVMe offloading. Standard GQA at 1M tokens would require ~240 GB per user, which is infeasible even on multi-node H200 setups.
For very long contexts combined with multi-user workloads, MLA is not a nice-to-have: it is an enabling technology.
Step-by-Step Deployment on Spheron GPU Cloud
Step 1: Provision the instance
Log in to app.spheron.ai and select an H200 SXM5 (141 GB HBM3e) or B200 SXM6 (192 GB HBM3e) node. For DeepSeek V3 at FP8 (~340 GB weights), 4x H200 is sufficient; 8x gives more headroom for KV cache at longer context. For Kimi K2 models, 8x H200 or 8x B200 is the minimum. Choose Ubuntu 22.04 with CUDA 12.4+ or an NGC PyTorch container. The instance is ready in under 2 minutes.
# Verify GPU and CUDA after SSH in
nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv,noheader
nvcc --versionStep 2: Install vLLM
pip install vllm --upgrade
# For B200, ensure FlashInfer is available
pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.6/
# Verify
python -c "import vllm; print(vllm.__version__)"Step 3: Download model weights
# DeepSeek V3 (FP8, ~340 GB)
huggingface-cli download deepseek-ai/DeepSeek-V3 --local-dir /models/deepseek-v3
# Or Kimi K2.6
huggingface-cli download moonshotai/Kimi-K2.6 --local-dir /models/kimi-k2-6Step 4: Launch vLLM with MLA
DeepSeek V3 on 8x H200 SXM5:
vllm serve /models/deepseek-v3 \
--tensor-parallel-size 8 \
--dtype fp8 \
--kv-cache-dtype fp8 \
--max-model-len 131072 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--port 8000MLA is activated automatically from the model config. Check startup logs for kv_cache_dtype: fp8 to confirm FP8 KV is active.
DeepSeek V3 on 8x B200 SXM6:
vllm serve /models/deepseek-v3 \
--tensor-parallel-size 8 \
--dtype fp8 \
--kv-cache-dtype nvfp4 \
--max-model-len 131072 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--port 8000Step 5: Enable FlashInfer MLA kernel in SGLang
# DeepSeek V3 with FlashInfer MLA
python -m sglang.launch_server \
--model-path /models/deepseek-v3 \
--tp 8 \
--attention-backend flashinfer \
--dtype fp8 \
--context-length 131072 \
--mem-fraction-static 0.90 \
--port 30000
# Kimi K2.6 with explicit MLA flag
python -m sglang.launch_server \
--model-path /models/kimi-k2-6 \
--tp 8 \
--attention-backend flashinfer \
--enable-flashinfer-mla \
--dtype fp8 \
--context-length 65536 \
--mem-fraction-static 0.90 \
--port 30000Step 6: Benchmark and calculate cost-per-token
# vLLM throughput benchmark
python -m vllm.benchmarks.benchmark_serving \
--model /models/deepseek-v3 \
--dataset-name random \
--random-input-len 4096 \
--random-output-len 512 \
--num-prompts 100 \
--concurrency 16
# SGLang benchmark
python -m sglang.bench_serving \
--backend sglang \
--model /models/deepseek-v3 \
--num-prompt 100 \
--request-rate 8Cost calculation example (spot pricing, 8x H200):
- 8x H200 SXM5 spot: $1.77/hr × 8 = $14.16/hr
- Measured throughput at 32K context with 16 concurrent users: ~9,500 tok/sec
- Cost per 1M tokens:
14.16 / (9500 × 3600) × 1,000,000 ≈ $0.41
Run the same benchmark at 4K, 32K, and 128K context to see how MLA's concurrency advantage grows with context length. At 128K with 8 concurrent users, expect roughly 4,000-5,000 tok/sec on 8x H200 FP8, giving approximately $0.79-0.98/M tokens.
Pricing fluctuates based on GPU availability. The prices above are based on 25 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
MLA cuts the KV cache to a fraction of its standard footprint, so each GPU handles more concurrent users. On Spheron's H200 SXM5 nodes, the same instance that serves 1-2 concurrent 128K-context users with GQA can serve 8-9 with MLA at the same hourly rate.
H200 SXM5 on Spheron → | B200 SXM6 on Spheron → | View all GPU pricing →
Quick Setup Guide
Log in to app.spheron.ai and select an H200 SXM5 (141 GB HBM3e) or B200 SXM6 (192 GB HBM3e) instance. For DeepSeek V3 you need at least 4x H200 at FP8; for Kimi K2 (larger MoE models), you need 8x H200 or 8x B200 minimum. Choose Ubuntu 22.04 with CUDA 12.4+ or an NGC container image. The instance is ready in under 2 minutes with full root SSH access.
Install the latest vLLM release: pip install vllm --upgrade. vLLM 0.6+ includes native MLA support for DeepSeek-family models. No additional packages are needed for MLA itself. For B200, also install FlashInfer separately: pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.6/ to ensure the MLA kernel is available. Verify the install: python -c 'import vllm; print(vllm.__version__)'.
Download model weights using huggingface-cli: huggingface-cli download deepseek-ai/DeepSeek-V3 --local-dir /models/deepseek-v3. For Kimi K2.6: huggingface-cli download moonshotai/Kimi-K2.6 --local-dir /models/kimi-k2-6. DeepSeek V3 at FP8 requires ~340 GB of disk space; Kimi K2 models may require more. Use NVMe-backed storage and verify the download hash.
For DeepSeek V3 on 8x H200 SXM5 with FP8 KV cache: vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --dtype fp8 --kv-cache-dtype fp8 --max-model-len 131072 --gpu-memory-utilization 0.92 --port 8000. The --kv-cache-dtype fp8 flag halves the already-compressed MLA KV footprint. MLA is handled automatically from the model config. For B200 with FP4 KV: substitute --kv-cache-dtype nvfp4 for an additional 50% KV reduction versus FP8.
For SGLang with FlashInfer's dedicated MLA kernel: python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 8 --attention-backend flashinfer --dtype fp8 --context-length 131072 --mem-fraction-static 0.90 --port 30000. The --attention-backend flashinfer flag routes MLA through FlashInfer's optimized path. For Kimi K2 models in SGLang, add --enable-flashinfer-mla explicitly: python -m sglang.launch_server --model-path moonshotai/Kimi-K2.6 --tp 8 --attention-backend flashinfer --enable-flashinfer-mla --dtype fp8 --port 30000.
Run the vLLM throughput benchmark: python -m vllm.benchmarks.benchmark_serving --model deepseek-ai/DeepSeek-V3 --dataset-name random --random-input-len 4096 --random-output-len 512 --num-prompts 100 --concurrency 16. Record tokens/sec. Cost per 1M tokens = (hourly_rate / (tokens_per_sec x 3600)) x 1,000,000. For 8x H200 at current spot pricing, compare results at 8K, 32K, and 128K context lengths - MLA's advantage grows with context length.
Frequently Asked Questions
Multi-Head Latent Attention (MLA) is an attention mechanism introduced by DeepSeek that replaces the standard per-head key-value pairs with a single compressed latent vector. Instead of caching full K and V tensors per attention head, MLA projects them through a learned down-projection into a low-dimensional latent vector (d_c, typically 512 for DeepSeek models). On each decode step, the full K and V tensors are reconstructed from this latent vector via learned up-projections. Only the latent vector is stored in the KV cache, reducing memory per token per layer from num_heads x head_dim x 2 x 2 bytes (MHA) to just d_c x 2 bytes, a ~98% reduction for typical DeepSeek configurations.
For DeepSeek-family models with 128 attention heads and head_dim=128, MLA with d_c=512 stores roughly 1,024 bytes per token per layer versus 65,536 bytes for standard MHA - a 64x reduction. Against GQA (8 groups, used by Llama 3 and Qwen), MLA stores roughly 4x fewer bytes. At 128K context with 60 layers on one user, MHA needs ~480 GB for KV cache alone, GQA needs ~30 GB, and MLA needs ~7.5 GB. That translates directly into ~4x more concurrent users per GPU versus GQA, or 8-12x more versus standard MHA, at the same context length.
The main 2026 open models using MLA are DeepSeek V2 and V3, plus the Kimi K2 series from Moonshot AI (K2.5, K2.6, K2.7, all using the K2 MLA architecture with 256K context). DeepSeek V4 switched to a different compression approach (token-wise KV compression plus DeepSeek Sparse Attention) and does not use MLA. vLLM and SGLang both support MLA-using models with native handling.
In vLLM, MLA is handled automatically for supported model families (DeepSeek, Kimi K2). No special flag is needed. Add --kv-cache-dtype fp8 on H200 for a 2x additional reduction, or --kv-cache-dtype nvfp4 on B200 for a further 50% on top of FP8. In SGLang, pass --attention-backend flashinfer to route MLA computation through FlashInfer's dedicated MLA kernel. On Hopper and Blackwell hardware, FlashInfer may already be auto-selected; passing the flag explicitly ensures it. For Kimi K2 models in SGLang, --enable-flashinfer-mla is the specific activation flag. The FlashInfer kernel handles the compressed latent format directly and avoids the expand-then-attend path.
Both. The primary gain is memory: smaller KV cache means more concurrent users per GPU, which directly multiplies throughput for multi-user workloads. A secondary gain comes from reduced HBM bandwidth pressure during decode - loading a 1,024-byte latent vector per layer is 64x less HBM traffic than loading full MHA KV pairs. For long-context workloads (32K+), memory bandwidth is often the bottleneck in decode, so MLA's smaller working set reduces both memory pressure and attention compute time. The FlashInfer MLA kernel in SGLang reduces attention compute time a further 20-35% on top by handling the compressed format natively.
