The KV cache quietly becomes your VRAM bottleneck as context length grows. At 128K tokens, a 70B model's KV cache alone consumes around 40 GB, nearly double the headroom available on two H100 SXM5s after loading the model weights. TurboQuant, a new method from Google Research (arXiv 2504.19874, ICLR 2026), compresses that KV cache 6x and speeds up attention computation 8x. If you're running long-context inference or serving high-concurrency workloads, this is worth understanding now. For broader GPU memory context, see our GPU memory requirements for LLMs guide.
What Is TurboQuant
TurboQuant is a KV cache compression method, not a model weight quantization technique. This distinction matters. The model weights for a 70B model still require 140 GB of VRAM at FP16 regardless of TurboQuant. What changes is the memory footprint of the attention key-value cache generated during inference.
Key claims from the paper:
- 6x memory reduction in KV cache versus storing in BF16
- 8x speedup in attention computation versus 32-bit (FP32) unquantized keys (not overall inference throughput)
- Minimal accuracy loss on long-context benchmarks (LongBench, Needle In A Haystack, RULER, L-Eval)
- Data-oblivious: no calibration dataset required
The critical difference from weight quantization methods like AWQ or GPTQ: TurboQuant and weight quantization address different memory pools. They are complementary, not alternatives. You can apply AWQ to compress model weights, then run TurboQuant on top to compress the KV cache at inference time.
How TurboQuant Works
Standard KV cache quantization (INT8 or INT4) often introduces noticeable accuracy loss because key and value vectors have irregular distributions. Outlier activations cause large quantization errors when you force these values into low-bit integers.
TurboQuant uses a two-stage approach that sidesteps this problem without requiring calibration data.
Stage 1: PolarQuant
PolarQuant applies a random rotation matrix to each key and value vector before quantization. The rotation doesn't change the mathematical content of the vectors, but it redistributes the variance uniformly across all coordinates. After rotation, the distribution of each coordinate looks much more like a standard normal, which quantizes accurately with a simple scalar quantizer. The optimal scalar quantizer for each coordinate is then computed analytically. No learned parameters, no calibration.
Stage 2: QJL Error Correction
After PolarQuant quantization, a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform is applied as an error corrector. The QJL transform compresses the residual quantization error into a 1-bit representation using a random matrix derived from the Johnson-Lindenstrauss lemma. This corrector is also data-oblivious.
Together, PolarQuant and QJL achieve the 6x KV cache compression ratio. Because the entire approach uses random (not learned) matrices, you can apply it to any model at any time without a calibration run.
For context on why memory bandwidth is often the bottleneck in the first place, see Why Your LLM Inference Is Slow.
KV Cache Memory: The Numbers
These calculations use Llama 3.1 architecture parameters. "Available VRAM" assumes 2x H100 SXM5 (160 GB total) with a 70B FP16 model using 140 GB for weights, leaving 20 GB for KV cache.
| Model | Context Length | KV Cache (BF16) | KV Cache (TurboQuant 6x) |
|---|---|---|---|
| Llama 3.1 8B | 32K | 4 GB | 0.7 GB |
| Llama 3.1 8B | 128K | 16 GB | 2.7 GB |
| Llama 3.1 70B | 32K | 10 GB | 1.7 GB |
| Llama 3.1 70B | 128K | 40 GB | 6.7 GB |
| Llama 3.1 70B | 1M | ~320 GB | ~53 GB |
The 128K row for 70B is where TurboQuant has the biggest practical impact. Without compression, a 40 GB KV cache on hardware with only 20 GB of headroom means a single 128K-context request cannot be served alongside the model weights. With TurboQuant, the 6.7 GB footprint fits comfortably, with room to serve multiple concurrent requests.
Add 10-20% headroom in production for framework buffers and activation memory beyond the KV cache figures shown above.
Cost Impact: Concurrency Changes Everything
TurboQuant doesn't reduce the number of GPUs needed to load a model. But it dramatically changes how many users you can serve per GPU-hour.
Setup: 70B model at FP16 on 2x H100 SXM5 on-demand at $2.90/hr per GPU ($5.80/hr total). Available VRAM headroom for KV cache: 20 GB.
| Context per User | Without TurboQuant | With TurboQuant | Concurrency Gain |
|---|---|---|---|
| 32K tokens | ~2 concurrent users | ~11 concurrent users | 5.5x |
| 128K tokens | 0 users (doesn't fit) | 2-3 concurrent users | Now feasible |
| Cost per user-hr (32K) | ~$2.90/hr | ~$0.53/hr | 5.5x cheaper |
At 32K context, 24/7 serving on this setup: $5.80/hr x 720 hr = $4,176/month. Without TurboQuant, that serves 2 users ($2,088/user/month). With TurboQuant, it serves 11 users ($380/user/month).
The 128K case is even more dramatic: a workload that previously required scaling to additional GPUs just to have headroom for the KV cache now fits on the base setup.
Pricing fluctuates based on GPU availability. The prices above are based on 08 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Implementation Status
As of April 2026, Google Research has not released an official Python implementation of TurboQuant. The method is described in detail in arXiv 2504.19874 (ICLR 2026). There is no public GitHub repository with an installable library. Watch the Google Research GitHub organization and the paper's arXiv page for the official release.
In the meantime, if you need KV cache compression today, vLLM supports FP8 KV cache quantization natively with the --kv-cache-dtype fp8 flag. It achieves roughly 2x compression versus BF16, not TurboQuant's 6x, but it's production-ready now:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--host 0.0.0.0 \
--port 8000For weight quantization to reduce the 140 GB model footprint itself, see our guide on FP4 quantization on Blackwell GPUs.
TurboQuant vs Weight Quantization Methods
These methods target different memory pools and are not alternatives. You can stack them.
| Method | What It Compresses | Compression | Calibration Required | Status |
|---|---|---|---|---|
| TurboQuant | KV cache | 6x | None | Research (ICLR 2026) |
| FP8 KV (vLLM) | KV cache | ~2x | None | Production-ready |
| AWQ | Model weights | 4x | ~512 samples | Production-ready |
| GPTQ | Model weights | 4x | ~512 samples | Production-ready |
| GGUF/Q4 | Model weights | 4x | None (post-convert) | Production-ready |
| FP4 (Blackwell) | Model weights | 4x | Hardware-native | B200/B300 only |
The combination that matters for long-context inference: AWQ or GPTQ for weights, TurboQuant for KV cache. A 70B model at AWQ 4-bit needs about 35 GB for weights, fitting on a single H100 SXM5. Add TurboQuant and the 128K KV cache drops from 40 GB to 6.7 GB. Total VRAM: ~42 GB on a single 80 GB H100, with room left for batching. That's a workload that previously needed multiple GPUs running comfortably on one.
One clarification: GGUF is a file format, not a compression algorithm. It packages models quantized via various INT4/INT8 schemes for use with llama.cpp and Ollama. See our local LLM guide with Ollama for how GGUF works in practice.
When to Use TurboQuant
TurboQuant makes sense when:
- You're serving long context windows (32K+ tokens) and KV cache is eating into VRAM headroom
- High concurrency is the goal and each request's KV cache is the limiting factor
- You want to avoid calibration data requirements (data-oblivious is a real operational advantage)
- You're planning ahead as the library matures toward a production release
TurboQuant won't help when:
- You can't load the model weights in the first place (that's a job for weight quantization like AWQ or GPTQ)
- Your context lengths are short (4K-8K) and KV cache isn't the bottleneck
- You need a production-ready library today (wait for the official release)
- You're running fine-tuning pipelines (TurboQuant is inference-only)
For workloads where weight memory is the primary constraint, see our DeepSeek R2 deployment guide for how quantization tradeoffs play out with large reasoning models.
Spheron GPU Recommendations
GPU selection depends on model weight requirements, which TurboQuant doesn't change. TurboQuant increases the effective context length and concurrency you can sustain on the hardware you already need.
| Model Size | FP16 Weight VRAM | Recommended GPU | On-Demand Price | Link |
|---|---|---|---|---|
| 7B-8B | ~16 GB | 1x H100 SXM5 | $2.90/hr | Rent H100 |
| 13B-14B | ~28 GB | 1x H100 SXM5 | $2.90/hr | Rent H100 |
| 70B-72B | ~140-144 GB | 2x H100 SXM5 | $5.80/hr | Rent H100 |
| 70B (AWQ 4-bit) | ~35 GB | 1x H100 SXM5 | $2.90/hr | Rent H100 |
| 405B | ~810 GB | 11x H100 SXM5 | $31.90/hr | View pricing |
For maximum throughput on large batch sizes, B200 SXM6 (spot from $2.06/hr) and B300 SXM6 (spot from $2.97/hr) outperform H100 per token despite the higher headline rate. H200 SXM5 at $4.50/hr on-demand (141 GB VRAM) is worth considering for 70B models where you want a single-GPU setup without weight quantization. See our NVIDIA GPU comparison for LLMs for the throughput-per-dollar breakdown by workload type.
KV cache is what limits how many users you can serve at long context lengths. TurboQuant compresses it 6x, enabling 5x more concurrent users or context windows that previously wouldn't fit at all on the same GPU setup.
