What does TurboQuant compress?

TurboQuant is a KV cache compression method from Google Research that reduces KV cache memory 6x and speeds up attention computation up to 8x versus 32-bit keys. It uses PolarQuant and QJL, requires no calibration data, and is described in arXiv 2504.19874 (ICLR 2026).

Does TurboQuant require a calibration dataset?

No. TurboQuant is data-oblivious. It uses random rotation matrices (PolarQuant) followed by optimal scalar quantization, then a 1-bit QJL error corrector. No representative dataset or calibration step is needed, unlike AWQ or GPTQ.

Which GPU should I use for models with TurboQuant?

TurboQuant doesn't change weight VRAM requirements. A 70B model at FP16 still needs 2x H100 SXM5. TurboQuant frees up VRAM that was allocated to KV cache, letting you serve longer contexts or more concurrent users on the same hardware.

How does TurboQuant compare to AWQ and GPTQ?

AWQ and GPTQ compress model weights. TurboQuant compresses the KV cache. They target different memory pools and can be combined: use AWQ or GPTQ for weight compression, then TurboQuant on top for KV cache compression during inference.

Is TurboQuant available as a Python library?

As of April 2026, no official Python implementation has been publicly released by Google Research. The method is described in arXiv 2504.19874 (ICLR 2026). Watch the Google Research GitHub for the official release.

Google TurboQuant: 6x KV Cache Compression for LLM Inference

The KV cache quietly becomes your VRAM bottleneck as context length grows. At 128K tokens, a 70B model's KV cache alone consumes around 40 GB, nearly double the headroom available on two H100 SXM5s after loading the model weights. TurboQuant, a new method from Google Research (arXiv 2504.19874, ICLR 2026), compresses that KV cache 6x and speeds up attention computation 8x. If you're running long-context inference or serving high-concurrency workloads, this is worth understanding now. For broader GPU memory context, see our GPU memory requirements for LLMs guide.

What Is TurboQuant

TurboQuant is a KV cache compression method, not a model weight quantization technique. This distinction matters. The model weights for a 70B model still require 140 GB of VRAM at FP16 regardless of TurboQuant. What changes is the memory footprint of the attention key-value cache generated during inference.

Key claims from the paper:

6x memory reduction in KV cache versus storing in BF16
8x speedup in attention computation versus 32-bit (FP32) unquantized keys (not overall inference throughput)
Minimal accuracy loss on long-context benchmarks (LongBench, Needle In A Haystack, RULER, L-Eval)
Data-oblivious: no calibration dataset required

The critical difference from weight quantization methods like AWQ or GPTQ: TurboQuant and weight quantization address different memory pools. They are complementary, not alternatives. You can apply AWQ to compress model weights, then run TurboQuant on top to compress the KV cache at inference time.

How TurboQuant Works

Standard KV cache quantization (INT8 or INT4) often introduces noticeable accuracy loss because key and value vectors have irregular distributions. Outlier activations cause large quantization errors when you force these values into low-bit integers.

TurboQuant uses a two-stage approach that sidesteps this problem without requiring calibration data.

Stage 1: PolarQuant

PolarQuant applies a random rotation matrix to each key and value vector before quantization. The rotation doesn't change the mathematical content of the vectors, but it redistributes the variance uniformly across all coordinates. After rotation, the distribution of each coordinate looks much more like a standard normal, which quantizes accurately with a simple scalar quantizer. The optimal scalar quantizer for each coordinate is then computed analytically. No learned parameters, no calibration.

Stage 2: QJL Error Correction

After PolarQuant quantization, a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform is applied as an error corrector. The QJL transform compresses the residual quantization error into a 1-bit representation using a random matrix derived from the Johnson-Lindenstrauss lemma. This corrector is also data-oblivious.

Together, PolarQuant and QJL achieve the 6x KV cache compression ratio. Because the entire approach uses random (not learned) matrices, you can apply it to any model at any time without a calibration run.

For context on why memory bandwidth is often the bottleneck in the first place, see Why Your LLM Inference Is Slow.

KV Cache Memory: The Numbers

These calculations use Llama 3.1 architecture parameters. "Available VRAM" assumes 2x H100 SXM5 (160 GB total) with a 70B FP16 model using 140 GB for weights, leaving 20 GB for KV cache.

Model	Context Length	KV Cache (BF16)	KV Cache (TurboQuant 6x)
Llama 3.1 8B	32K	4 GB	0.7 GB
Llama 3.1 8B	128K	16 GB	2.7 GB
Llama 3.1 70B	32K	10 GB	1.7 GB
Llama 3.1 70B	128K	40 GB	6.7 GB
Llama 3.1 70B	1M	~320 GB	~53 GB

The 128K row for 70B is where TurboQuant has the biggest practical impact. Without compression, a 40 GB KV cache on hardware with only 20 GB of headroom means a single 128K-context request cannot be served alongside the model weights. With TurboQuant, the 6.7 GB footprint fits comfortably, with room to serve multiple concurrent requests.

Add 10-20% headroom in production for framework buffers and activation memory beyond the KV cache figures shown above.

Cost Impact: Concurrency Changes Everything

TurboQuant doesn't reduce the number of GPUs needed to load a model. But it dramatically changes how many users you can serve per GPU-hour.

Setup: 70B model at FP16 on 2x H100 SXM5 on-demand at $2.90/hr per GPU ($5.80/hr total). Available VRAM headroom for KV cache: 20 GB.

Context per User	Without TurboQuant	With TurboQuant	Concurrency Gain
32K tokens	~2 concurrent users	~11 concurrent users	5.5x
128K tokens	0 users (doesn't fit)	2-3 concurrent users	Now feasible
Cost per user-hr (32K)	~$2.90/hr	~$0.53/hr	5.5x cheaper

At 32K context, 24/7 serving on this setup: $5.80/hr x 720 hr = $4,176/month. Without TurboQuant, that serves 2 users ($2,088/user/month). With TurboQuant, it serves 11 users ($380/user/month).

The 128K case is even more dramatic: a workload that previously required scaling to additional GPUs just to have headroom for the KV cache now fits on the base setup.

Pricing fluctuates based on GPU availability. The prices above are based on 08 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Implementation Status

As of April 2026, Google Research has not released an official Python implementation of TurboQuant. The method is described in detail in arXiv 2504.19874 (ICLR 2026). There is no public GitHub repository with an installable library. Watch the Google Research GitHub organization and the paper's arXiv page for the official release.

In the meantime, if you need KV cache compression today, vLLM supports FP8 KV cache quantization natively with the --kv-cache-dtype fp8 flag. It achieves roughly 2x compression versus BF16, not TurboQuant's 6x, but it's production-ready now:

bash

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --host 0.0.0.0 \
  --port 8000

For weight quantization to reduce the 140 GB model footprint itself, see our guide on FP4 quantization on Blackwell GPUs.

TurboQuant vs Weight Quantization Methods

These methods target different memory pools and are not alternatives. You can stack them.

Method	What It Compresses	Compression	Calibration Required	Status
TurboQuant	KV cache	6x	None	Research (ICLR 2026)
FP8 KV (vLLM)	KV cache	~2x	None	Production-ready
AWQ	Model weights	4x	~512 samples	Production-ready
GPTQ	Model weights	4x	~512 samples	Production-ready
GGUF/Q4	Model weights	4x	None (post-convert)	Production-ready
FP4 (Blackwell)	Model weights	4x	Hardware-native	B200/B300 only

The combination that matters for long-context inference: AWQ or GPTQ for weights, TurboQuant for KV cache. A 70B model at AWQ 4-bit needs about 35 GB for weights, fitting on a single H100 SXM5. Add TurboQuant and the 128K KV cache drops from 40 GB to 6.7 GB. Total VRAM: ~42 GB on a single 80 GB H100, with room left for batching. That's a workload that previously needed multiple GPUs running comfortably on one.

One clarification: GGUF is a file format, not a compression algorithm. It packages models quantized via various INT4/INT8 schemes for use with llama.cpp and Ollama. See our local LLM guide with Ollama for how GGUF works in practice.

When to Use TurboQuant

TurboQuant makes sense when:

You're serving long context windows (32K+ tokens) and KV cache is eating into VRAM headroom
High concurrency is the goal and each request's KV cache is the limiting factor
You want to avoid calibration data requirements (data-oblivious is a real operational advantage)
You're planning ahead as the library matures toward a production release

TurboQuant won't help when:

You can't load the model weights in the first place (that's a job for weight quantization like AWQ or GPTQ)
Your context lengths are short (4K-8K) and KV cache isn't the bottleneck
You need a production-ready library today (wait for the official release)
You're running fine-tuning pipelines (TurboQuant is inference-only)

For workloads where weight memory is the primary constraint, see our DeepSeek R2 deployment guide for how quantization tradeoffs play out with large reasoning models.

Spheron GPU Recommendations

GPU selection depends on model weight requirements, which TurboQuant doesn't change. TurboQuant increases the effective context length and concurrency you can sustain on the hardware you already need.

Model Size	FP16 Weight VRAM	Recommended GPU	On-Demand Price	Link
7B-8B	~16 GB	1x H100 SXM5	$2.90/hr	Rent H100
13B-14B	~28 GB	1x H100 SXM5	$2.90/hr	Rent H100
70B-72B	~140-144 GB	2x H100 SXM5	$5.80/hr	Rent H100
70B (AWQ 4-bit)	~35 GB	1x H100 SXM5	$2.90/hr	Rent H100
405B	~810 GB	11x H100 SXM5	$31.90/hr	View pricing

For maximum throughput on large batch sizes, B200 SXM6 (spot from $2.06/hr) and B300 SXM6 (spot from $2.97/hr) outperform H100 per token despite the higher headline rate. H200 SXM5 at $4.50/hr on-demand (141 GB VRAM) is worth considering for 70B models where you want a single-GPU setup without weight quantization. See our NVIDIA GPU comparison for LLMs for the throughput-per-dollar breakdown by workload type.

KV cache is what limits how many users you can serve at long context lengths. TurboQuant compresses it 6x, enabling 5x more concurrent users or context windows that previously wouldn't fit at all on the same GPU setup.
Rent H100 → | Rent B200 → | View all GPU pricing →
Get started on Spheron →