If you're paying for a full-precision 70B model on an H100 when an A100 running AWQ INT4 will do, you're overpaying by roughly 2x. AWQ (Activation-Aware Weight Quantization) cuts model VRAM requirements by about 50% with 1-3% quality degradation on most tasks. That's often the difference between needing two GPUs and needing one. For a full breakdown of how VRAM requirements scale with model size and precision, see GPU memory requirements for LLMs.
As of 2026, AWQ is the default INT4 format for production inference. Major model families ship pre-quantized AWQ checkpoints on Hugging Face. vLLM, SGLang, and TensorRT-LLM all include optimized AWQ kernels. If you're serving on H100, A100, L40S, or RTX 4090 hardware and haven't evaluated AWQ, this guide walks through the full process: how it works, how to quantize a 70B model, expected memory and throughput gains, and production deployment on cloud GPUs.
Why AWQ Became the Default INT4 Method
INT4 quantization has existed for years. The early methods (naive INT4, round-to-nearest) worked on small models but produced significant quality loss on 7B+ parameter models. GPTQ (2022) improved this by using layer-wise second-order weight optimization, which produced better INT4 quality than naive quantization. But it still left headroom on the table.
AWQ (2023, from MIT Han Lab) solved the remaining quality problem with a different insight: not all weights are equally important. A small fraction of weights, maybe 1%, have outsized influence on model outputs. If you can identify and protect those weights during quantization, the rest can be aggressively compressed with minimal quality impact.
The result: AWQ INT4 models consistently outperform GPTQ at the same bit-width on quality benchmarks. By 2024, major model releases started shipping AWQ checkpoints alongside their standard weights. By 2026, it's the expected format.
The practical outcome for you: AWQ weights are already available on Hugging Face for most major models. You may never need to run quantization yourself. But understanding the process matters when pre-quantized weights don't exist for your model, or when you need to optimize for a specific domain. For framework selection context, see Ollama vs vLLM: Which Should You Use?.
How AWQ Works: Activation-Aware Weight Quantization
The core insight behind AWQ: activation magnitudes tell you which weights matter. During a forward pass, certain weight channels consistently produce high-magnitude activations. These are the salient weights. Perturbing them introduces large errors into downstream computations. The non-salient weights have small activation response; you can quantize them aggressively without much impact.
The quantization process has two steps:
Step 1: Calibration pass. Run a small dataset (128-512 samples) through the model and record activation magnitudes for each weight channel. This identifies which channels are salient.
Step 2: Per-channel scaling before quantization. Scale the salient weight channels up (by a learned factor) and scale the corresponding activations down by the same factor. The net effect on model outputs is neutral, but now when you quantize to INT4, the salient weights get more precision (because they're spread over a wider range), while non-salient weights are compressed normally.
This is different from GPTQ, which uses a layer-wise second-order (Hessian) approximation to minimize quantization error across all weights. GPTQ is more computationally expensive and doesn't specifically target the most important weights. AWQ's calibration-based approach is both faster and produces better quality.
It's also different from bitsandbytes NF4 (used in QLoRA fine-tuning), which applies a different 4-bit representation without a calibration step. NF4 is good for fine-tuning but AWQ is better for inference quality.
Here's how the formats compare on memory and quality:
| Format | Bits | Bytes/param | 70B model size | Quality vs BF16 |
|---|---|---|---|---|
| BF16 | 16 | 2 bytes | ~140 GB | Reference |
| FP8 | 8 | 1 byte | ~70 GB | Negligible difference |
| GPTQ INT4 | 4 | 0.5 bytes | ~35 GB | 2-5% degradation |
| AWQ INT4 | 4 | 0.5 bytes | ~35 GB | 1-3% degradation |
| FP4 (NVFP4) | 4 | 0.5 bytes | ~35 GB | Blackwell only; see FP4 guide |
Both GPTQ and AWQ get you to ~35 GB for a 70B model. The difference is ~1-2% better benchmark scores with AWQ, and better throughput because AWQ's weight layout is more amenable to fast INT4 matrix multiplication kernels in vLLM and SGLang.
AWQ vs GPTQ vs GGUF vs FP4: When to Use Each
| Method | Format | GPU support | Framework | Best for |
|---|---|---|---|---|
| AWQ | INT4 | All CUDA GPUs | vLLM, SGLang, TRT-LLM | Production GPU server inference |
| GPTQ | INT4 | All CUDA GPUs | vLLM, AutoGPTQ | When AWQ weights unavailable |
| GGUF | Variable | CPU + NVIDIA + Apple | llama.cpp, Ollama | Local/edge/CPU inference |
| FP8 | 8-bit float | H100, H200, Blackwell | vLLM, SGLang | Server inference, near-lossless |
| FP4 | 4-bit float | Blackwell only | TRT-LLM, vLLM | Maximum throughput on B200/B300 |
The choice is mostly determined by where you're running:
- Running on H100, A100, L40S, or RTX 4090: AWQ is your INT4 option. GPTQ works too but AWQ is better when available.
- Running on consumer hardware or CPU: GGUF via llama.cpp handles mixed CPU/GPU inference and is the right format for Ollama.
- Running on Blackwell (B200, RTX 5090): FP8 first (near-lossless, good framework support). FP4 if you need maximum throughput and can validate quality. See the FP4 quantization guide for Blackwell-specific setup.
- Framework comparison: For a throughput breakdown across vLLM, TensorRT-LLM, and SGLang, see inference framework benchmarks.
Step-by-Step: Quantizing a 70B Model with AWQ
Most of the time you won't need to run quantization yourself. TheBloke and the official model authors publish AWQ checkpoints for nearly every major model on Hugging Face. Search for -AWQ suffix variants. If you find a pre-quantized checkpoint, skip to the deployment section.
If you need to quantize your own model, here's the full process using AutoAWQ.
Hardware requirements:
- 7B-13B models: RTX 4090 (24 GB VRAM) is sufficient
- 70B models: A100 80G or H100 required (model needs to fit in VRAM during quantization)
- Estimated time: 30-90 minutes for a 70B model on a single A100 80G
Installation:
pip install autoawq autoawq-kernelsNote: AutoAWQ (casper-hansen/AutoAWQ) was archived in May 2025 and is no longer maintained. For ongoing support, see vllm-project/llm-compressor as the recommended replacement. The code below still works with the last released version.
Quantization script (tested with AutoAWQ 0.2.x; check the AutoAWQ GitHub for the last released API):
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-3.3-70B-Instruct"
quant_path = "Llama-3.3-70B-Instruct-AWQ"
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
model = AutoAWQForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)Llama 3.x models do not require custom tokenizer code, so trust_remote_code=True is omitted here. Only pass trust_remote_code=True when a model's Hugging Face repository explicitly requires it (stated in the model card). The flag allows the model repository to execute arbitrary Python code on your machine during loading, which carries real risk when used with third-party or unvetted model repos.
The q_group_size=128 is the standard setting and what most published AWQ weights use. Mismatching this value when loading a pre-quantized checkpoint will produce garbage output, so always check the model card for the config used during quantization.
After quantization completes, you'll have a directory with AWQ weight files that can be loaded directly by vLLM, SGLang, or TensorRT-LLM.
GPU Memory Savings: Before and After AWQ
These figures include KV cache overhead at moderate batch sizes (batch 8-16). Actual VRAM usage varies with sequence length and concurrency.
| Model | BF16 VRAM | AWQ INT4 VRAM | Savings | Fits single GPU |
|---|---|---|---|---|
| 7B | ~14 GB | ~4.5 GB | 68% | RTX 4090 (24 GB) |
| 13B | ~26 GB | ~8.5 GB | 67% | RTX 4090 (24 GB) |
| 34B | ~68 GB | ~21 GB | 69% | L40S (48 GB) |
| 70B | ~140 GB | ~40 GB | 71% | A100 80G (80 GB) |
| 405B | ~810 GB | ~243 GB | 70% | 4x A100 80G |
The 70B row is where AWQ has the most practical impact: you go from needing two A100s (or one H100, tightly packed) at BF16, to fitting comfortably on a single A100 80G with room for KV cache. For full VRAM budget methodology including KV cache sizing at different sequence lengths, see the GPU memory requirements for LLMs guide.
Inference Speed: Throughput and Latency with AWQ
AWQ improves throughput by reducing memory bandwidth pressure. When inference is memory-bandwidth-bound (small batch sizes, large models), moving from BF16 to INT4 means the GPU can load model weights 4x faster from HBM. This translates directly to more tokens per second.
The gains are largest when you're memory-bound. If you're already compute-saturated (high batch sizes, small models), the throughput improvement is smaller.
| GPU | Model | Precision | Tokens/sec | Relative |
|---|---|---|---|---|
| A100 80G | Llama 3.1 70B | BF16 (2x A100) | ~1,200 | 1x |
| A100 80G | Llama 3.1 70B | AWQ INT4 | ~1,800 | ~1.5x |
| RTX 4090 | Llama 3.1 8B | BF16 | ~1,500 | 1x |
| RTX 4090 | Llama 3.1 8B | AWQ INT4 | ~2,400 | ~1.6x |
Estimated values: Throughput figures for A100 80G and RTX 4090 are community-sourced estimates based on vLLM benchmark reports and Hugging Face model card benchmarks. No official MLPerf single-GPU AWQ result exists for these configurations as of April 2026. Actual throughput depends on batch size, sequence length, vLLM version, and driver configuration. Run your own benchmarks before production capacity planning.
The throughput gain from AWQ is on top of the GPU consolidation benefit. Going from 2x A100 BF16 to 1x A100 AWQ reduces your hourly cost by 50% while also improving tokens/sec. That's the compound effect that makes AWQ valuable.
Quality Retention: Perplexity and Benchmark Accuracy
AWQ's calibration step preserves more quality than GPTQ at the same bit-width. The gaps are small but measurable on standard benchmarks:
| Model | Benchmark | BF16 | AWQ INT4 | Delta |
|---|---|---|---|---|
| Llama 3.3 70B | MMLU | ~85% | ~83% | -2% |
| Llama 3.3 70B | HumanEval | ~72% | ~70% | -3% |
| Mistral 7B | Perplexity (WikiText-2) | ~5.25 | ~5.40 | +2.9% |
A few caveats on reading these numbers: perplexity is a proxy metric. It correlates with quality but doesn't capture task-specific behavior well. For production decisions, always benchmark on your actual use case. Instruction following, summarization, and RAG tasks are much less sensitive to AWQ quality loss than multi-step reasoning or math. For reasoning model-specific quantization tradeoffs, see the reasoning model inference cost guide.
The practical rule: if your task involves complex reasoning chains, validate AWQ quality explicitly. For everything else, AWQ is likely fine.
Deploying AWQ Models with vLLM on GPU Cloud
vLLM handles AWQ natively. The --quantization awq flag tells vLLM to load AWQ weights and use the optimized INT4 kernel path. No other changes needed.
Using a pre-quantized Hugging Face model:
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model TheBloke/Llama-3.3-70B-Instruct-AWQ \
--quantization awq \
--dtype auto \
--gpu-memory-utilization 0.92 \
--max-model-len 8192Verify the endpoint is working:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TheBloke/Llama-3.3-70B-Instruct-AWQ",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 50
}'Check the model card for your chosen checkpoint before using it in production. TheBloke (AWQ) and bartowski (GGUF) are community-maintained namespaces on Hugging Face; specific model slugs may change. If the model name has changed, search {model-name}-AWQ on Hugging Face for the current checkpoint.
For the full production setup including multi-GPU deployment, FP8, load balancing, and monitoring, see the vLLM production deployment guide.
Deploying AWQ with SGLang and TensorRT-LLM
SGLang supports AWQ directly:
python -m sglang.launch_server \
--model-path TheBloke/Llama-3.3-70B-Instruct-AWQ \
--port 30000Note: SGLang auto-detects AWQ quantization from the model's Hugging Face config, so
--quantization awqis not needed for pre-quantized checkpoints. The flag is optional.
TensorRT-LLM supports AWQ via the modelopt toolkit. The build steps are more complex and version-dependent, so reference the NVIDIA TensorRT-LLM AWQ documentation directly rather than following a specific command sequence here. TRT-LLM gives higher peak throughput than vLLM or SGLang for production deployment on NVIDIA hardware, at the cost of a more involved build pipeline.
For the full SGLang deployment walkthrough, see the SGLang production deployment guide. For throughput and latency comparisons across all three frameworks, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.
Cost Analysis: AWQ vs Full Precision on Spheron
The cost formula:
Cost per 1M tokens = ($/hr) ÷ (tokens/sec × 3,600) × 1,000,000Using live Spheron pricing and estimated throughput figures:
| Config | GPU | $/hr | Tokens/sec | Cost/1M tokens |
|---|---|---|---|---|
| 70B BF16, 2x GPU | 2x A100 80G SXM4 | $3.28 | ~1,200 | ~$0.757 |
| 70B AWQ INT4, 1x GPU | 1x A100 80G SXM4 | $1.64 | ~1,800 | ~$0.253 |
| 70B BF16 | H100 SXM5 | $2.98 | ~1,600 | ~$0.517 |
| 70B AWQ INT4 | H100 SXM5 | $2.98 | ~2,400 | ~$0.345 |
| 7B AWQ INT4 | A100 80G PCIe | $1.04 | ~3,800 | ~$0.076 |
The 70B AWQ on a single A100 column is the clearest case: $0.253/M tokens vs $0.757/M for 2x A100 BF16, a 3x cost reduction. The throughput improvement compounds the GPU count reduction. For broader GPU cost strategies beyond quantization, see the GPU cost optimization playbook.
Pricing fluctuates based on GPU availability. The prices above are based on 07 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Rent the GPU you need: A100 GPUs → | L40S → | RTX 4090 →
Choosing the Right GPU for Quantized Inference
| Use case | Recommended GPU | Reason |
|---|---|---|
| 70B model production serving | A100 80G | Fits full quantized 70B in single GPU; high memory bandwidth; cost-effective |
| 34B model serving | L40S 48G | 48 GB fits quantized 34B; good token throughput; lower cost than A100 |
| 7B-13B models, high throughput | RTX 4090 | 24 GB fits quantized 13B; highest tokens/$ for small models |
| 7B fine-tune then serve | RTX 4090 | QLoRA fine-tuning + AWQ inference on same GPU; lowest total cost |
| 405B model serving | 4x A100 80G | Tensor parallelism across 4 GPUs; AWQ keeps per-GPU load manageable |
For the broader GPU selection guide covering FP8, FP4, and full-precision workloads, see Best GPU for AI Inference 2026. For a quick-reference cheat sheet, see the GPU requirements cheat sheet.
Common Pitfalls and How to Avoid Them
- Using dynamic quantization instead of AWQ. Loading a model and running round-to-nearest INT4 without calibration produces worse quality than AWQ, with no memory benefit over loading a pre-quantized checkpoint. Use AutoAWQ or get pre-calibrated weights from Hugging Face.
- Forgetting
--quantization awqin vLLM. If you omit this flag, vLLM loads the model as BF16. Your memory savings disappear and throughput drops back to baseline. The flag is required every time you launch the server.
- Skipping KV cache in your VRAM budget. AWQ dramatically reduces weight memory, but the KV cache is separate and scales with batch size and sequence length. At production batch sizes, KV cache can add 10-30 GB on top of model weights. See the KV cache optimization guide for sizing formulas and FP8 KV cache compression options.
- Using AWQ for reasoning tasks without quality validation. Chain-of-thought reasoning and multi-step math are more sensitive to INT4 quantization noise than instruction following or summarization. If your use case is reasoning-heavy, benchmark your specific task before deploying AWQ.
- Mismatching
q_group_sizesettings. The standard is 128. Pre-quantized weights from Hugging Face are almost always quantized withq_group_size=128. If you quantize yourself with a different value and then load with the wrong setting, output is garbage. Always match the quantization config to what the weights expect.
- Using a general-purpose calibration dataset for a specialized domain. AWQ calibrates salient weights using activation patterns from the calibration data. If you're deploying for code generation, calibrate with code samples. If you're deploying for medical text, use medical text. Domain mismatch in calibration data leads to worse quality on the target distribution.
- Over-estimating throughput gains on compute-bound workloads. AWQ's throughput benefits are largest when inference is memory-bandwidth-bound: large models at small batch sizes. If you're running 7B models at batch 256, you're likely compute-bound and the throughput improvement from AWQ will be smaller than the tables above suggest.
AWQ INT4 quantization is the most practical way to fit larger models on mid-tier GPUs without a meaningful quality tradeoff. A100, L40S, and RTX 4090 instances are available on Spheron. Run the cost math against your current setup.
