Running a 70B model at FP16 on GPU cloud requires 140 GB of VRAM - typically 2 A100 80GB instances costing $2.08/hr or more on-demand. Quantize the same model to Q4_K_M with GGUF and it fits on a single A100 at $1.04/hr on-demand. That single decision cuts your inference bill roughly in half without rewriting your serving code. For a broader view of GPU memory planning, see the LLM VRAM requirements guide and the AI inference cost economics breakdown.
GGUF in 2026: The Universal LLM Distribution Format
GGUF (GPT-Generated Unified Format) replaced GGML in August 2023 and has since become the standard for portable LLM distribution. The key design choice: everything the model needs is bundled in a single file. Weights, tokenizer vocabulary, tokenizer config, and model hyperparameters are all packed into one .gguf file. No separate tokenizer JSON, no config files to track alongside the weights.
This matters for cloud deployment in a practical way. You download one file, pass it to llama.cpp server, and get an OpenAI-compatible HTTP API back. No Python serving stack. No separate tokenizer loading code. No framework version dependencies to manage. The binary runs on CPU, Apple Silicon M-series chips, and CUDA GPUs.
Every major local LLM tool uses GGUF: Ollama, LM Studio, Jan, Open WebUI, GPT4All, and llama.cpp itself. Pre-quantized GGUF files for most popular models are available on Hugging Face from community quantizers like Bartowski and the official Unsloth namespace. For cloud use, the combination of small file sizes (a 70B Q4_K_M is roughly 42 GB vs 140 GB for FP16) and zero-dependency serving makes GGUF the lowest-friction path to single-GPU inference at scale.
Quantization Methods Compared: GGUF K-Quants, AWQ, GPTQ, and ExLlamaV2
Not all quantization formats are interchangeable. The right choice depends on your serving framework and hardware.
| Method | Format | GPU Required | Framework | Typical Use |
|---|---|---|---|---|
| GGUF K-Quants | .gguf | Optional (CPU/GPU) | llama.cpp, Ollama | Local and cloud inference |
| AWQ | .safetensors | CUDA GPU | vLLM, TGI | GPU-optimized cloud serving |
| GPTQ | .safetensors | CUDA GPU | vLLM, AutoGPTQ | Legacy GPU quantization |
| ExLlamaV2 | .safetensors | CUDA GPU | TabbyAPI, text-gen-webui | High-throughput alternative |
GGUF K-quants use a mixed-precision block quantization scheme. In Q4_K_M for example, most blocks use 4-bit weights but some attention and embedding layers retain higher precision. This is different from naive INT4 where every weight gets the same treatment. The result is better perplexity at the same average bit-width.
AWQ (Activation-aware Weight Quantization) is the better choice when running vLLM at high throughput - it's GPU-native and integrates cleanly with vLLM's continuous batching. See the vLLM production deployment guide for FP8 and multi-GPU vLLM configurations. For format decision context and a direct comparison of Ollama (GGUF) vs vLLM (safetensors), see Ollama vs vLLM.
Unsloth Dynamic 2.0: Per-Layer Intelligent Quantization
Standard quantization applies the same bit-width to every layer in the model. Unsloth Dynamic 2.0 doesn't.
The core insight: transformer layers are not equally sensitive to precision loss. Embedding layers and the first and last few attention blocks hold structural information that the rest of the model depends on. Quantize these aggressively and quality drops sharply. The middle feed-forward network layers are more redundant and can survive heavier compression without noticeable degradation.
Dynamic 2.0 measures per-layer sensitivity during quantization and assigns bit-widths accordingly. Sensitive layers stay at Q6 or Q8. Less sensitive FFN layers drop to Q2 or Q3. The average bit-width across the model ends up similar to uniform Q4, but the output quality is measurably better. In practice, Dynamic 2.0 at "Q4" produces perplexity closer to uniform Q5, at the same file size.
This is offline quantization, not dynamic quantization at inference time. The bit-width decisions are baked into the .gguf file during export. The inference runtime (llama.cpp) then loads the pre-quantized weights and runs as normal.
The Unsloth team documented their benchmarks and methodology in their Dynamic 2.0 blog post. For a comparison of Unsloth's fine-tuning capabilities against other frameworks, see the Axolotl vs Unsloth vs TorchTune comparison.
GPU VRAM Requirements by Quant Level
Add 10-20% headroom to these numbers for KV cache and framework overhead. Actual VRAM usage varies by context length and batch size.
| Model Size | Q2_K | Q4_K_M | Q5_K_M | Q6_K | Q8_0 | FP16 |
|---|---|---|---|---|---|---|
| 7B | ~3.8 GB | ~4.5 GB | ~5.2 GB | ~5.9 GB | ~7.7 GB | ~14 GB |
| 13B | ~6.9 GB | ~8.4 GB | ~9.7 GB | ~10.8 GB | ~14.3 GB | ~26 GB |
| 32B | ~17 GB | ~20 GB | ~23 GB | ~26 GB | ~34 GB | ~64 GB |
| 70B | ~38 GB | ~42 GB | ~48 GB | ~56 GB | ~75 GB | ~140 GB |
For a complete GPU-to-model matching reference including MoE architectures and multi-GPU configurations, see the GPU memory requirements for LLMs guide and the GPU requirements cheat sheet.
Key practical thresholds:
- 7B Q4_K_M at 4.5 GB: fits on virtually any modern GPU, including consumer cards with 8 GB VRAM.
- 70B Q4_K_M at 42 GB: fits on a single A100 80GB or H100 80GB with room for context.
- 70B Q8_0 at 75 GB: tight on an A100 80GB, comfortable on an H100 80GB.
- 70B FP16 at 140 GB: requires 2 A100 80GB GPUs minimum.
Step-by-Step: Quantize Any Model to GGUF with Unsloth Dynamic 2.0
This workflow starts from a Hugging Face model checkpoint and produces a .gguf file ready for llama.cpp serving.
1. Install dependencies
pip install unsloth
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu1212. Load the model with Unsloth
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Llama-3.3-70B-Instruct",
max_seq_length=8192,
load_in_4bit=True, # Load in 4-bit for quantization
)3. Export to GGUF
# Standard Q4_K_M (uniform 4-bit K-quant)
model.save_pretrained_gguf("llama-70b-q4km", tokenizer, quantization_method="q4_k_m")
# Unsloth Dynamic 2.0 (per-layer selection, better quality at same size)
model.save_pretrained_gguf("llama-70b-dynamic", tokenizer, quantization_method="q4_k_xl")
# Q5_K_M for near-lossless quality
model.save_pretrained_gguf("llama-70b-q5km", tokenizer, quantization_method="q5_k_m")The q4_k_xl method triggers Unsloth's Dynamic 2.0 per-layer sensitivity analysis. It takes longer than standard Q4_K_M but produces a better model at the same file size. Dynamic 2.0 output files follow the UD-Q4_K_XL naming convention. Verify the current method name against the Unsloth GitHub README, as the API surface changes with releases.
4. Verify output size
ls -lh llama-70b-q4km/
# Expect: ~42 GB for 70B Q4_K_M
# Standard Q4_K_M vs Dynamic 2.0 should be similar in file size5. Optional: push to Hugging Face
# Push quantized model to your HF repo
model.push_to_hub_gguf(
"your-username/llama-70b-q4km-gguf",
tokenizer,
quantization_method="q4_k_m",
token="hf_..."
)Deploy GGUF Models on GPU Cloud with llama.cpp Server
1. Provision a GPU instance
Log into Spheron, navigate to the GPU catalog, and select an A100 80GB PCIe ($1.14/hr spot, $1.04/hr on-demand) for 70B Q4_K_M models. SSH in and verify the GPU is available:
nvidia-smi
# Should show A100 80GB, CUDA version, driver version2. Build llama.cpp with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Verify build
./build/bin/llama-cli --version3. Download the GGUF model
pip install huggingface_hub
huggingface-cli download \
bartowski/Llama-3.3-70B-Instruct-GGUF \
Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--local-dir ./models4. Launch the server with full GPU offloading
./build/bin/llama-server \
-m ./models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
-ngl 99 \
--port 8080 \
--ctx-size 8192 \
--n-predict -1 \
--threads 8 \
--host 0.0.0.0Security note: --host 0.0.0.0 binds the server to all network interfaces, including the public IP of your cloud instance. Anyone who can reach that IP on port 8080 can send inference requests. Before exposing this endpoint, restrict access with a firewall rule (allow only trusted IPs on port 8080) or place an authenticated reverse proxy (nginx with auth_basic, Caddy with JWT, or similar) in front of it. On Spheron instances, you can also bind to 127.0.0.1 and use SSH port forwarding for local testing without opening a public port.
The -ngl 99 flag offloads all layers to GPU. For a 70B model with 80 transformer layers, this means --ngl 80 would also work. Using 99 is safe - llama.cpp caps at the actual layer count.
5. Test the OpenAI-compatible endpoint
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [{"role": "user", "content": "What is GGUF quantization?"}],
"max_tokens": 200
}'The response format is identical to the OpenAI API. Point any OpenAI SDK client at http://your-instance-ip:8080/v1 and it works without code changes. See self-hosted OpenAI-compatible API with vLLM for a comparison with the vLLM serving path.
Docker alternative
docker run --gpus all \
-v /path/to/models:/models \
-p 8080:8080 \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
-ngl 99 \
--port 8080 \
--host 0.0.0.0Same security note applies: -p 8080:8080 maps the container port to the host's public interface. Use -p 127.0.0.1:8080:8080 to bind locally only, or restrict port 8080 with a firewall rule before running this on a cloud instance.
vLLM GGUF Support (Alternative to llama.cpp Server)
vLLM added GGUF loading support in mid-2024 (around v0.5.5+). Teams already running vLLM who want to experiment with GGUF files don't need to switch to llama.cpp.
vllm serve ./models/Meta-Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--served-model-name llama-70bThe --tokenizer flag is required because vLLM needs the tokenizer separately when loading GGUF files (unlike llama.cpp, which reads it from the GGUF bundle).
One practical caveat: vLLM's GGUF support is less mature than llama.cpp. Mixed-precision K-quants work, but the testing coverage and performance optimization are more limited. For workflows centered on GGUF, llama.cpp is still the recommended path. For production deployments needing high-concurrency batching and vLLM's continuous batching engine, AWQ or FP8 via native safetensors is a better fit. See the vLLM production deployment guide for those configurations.
Benchmarks: Quality vs Speed vs VRAM on A100 and H100
These figures are directional estimates based on community benchmarks for Llama 3.3 70B (see the llama.cpp quantization perplexity benchmarks and Unsloth's Dynamic 2.0 evaluation results). Treat them as approximate ranges, not measured values for your specific setup. Measure your model and workload before making infrastructure decisions.
Quality: perplexity increase vs BF16 (lower is better)
| Quant | Perplexity delta | Quality rating |
|---|---|---|
| Q2_K | +0.80 | Acceptable for low-stakes use |
| Q4_K_M | +0.18 | Excellent for production |
| Q5_K_M | +0.08 | Near-lossless |
| Q8_0 | +0.01 | Indistinguishable |
| Dynamic 2.0 (Q4) | +0.12 | Better than uniform Q4 |
Dynamic 2.0 closes about a third of the gap between uniform Q4_K_M and Q5_K_M, at the same file size and memory footprint as Q4_K_M.
Throughput: tokens/sec on A100 80GB, Llama 3.3 70B, batch=1 (estimated — results vary significantly by driver version, context length, and thermal state; benchmark your specific setup)
| Quant | Tokens/sec | VRAM used |
|---|---|---|
| Q4_K_M | ~20-30 | ~42 GB |
| Q5_K_M | ~28-36 | ~48 GB |
| Q8_0 | ~20-25 | ~75 GB |
| FP16 (2x A100) | ~30-40 | 140 GB (2 GPUs) |
FP16 on 2x A100 and Q4_K_M on 1x A100 have similar throughput at batch=1, but Q4_K_M costs half as much per hour. For higher batch sizes, 2x A100 FP16 scales better. Use llama.cpp's built-in benchmark tool to measure your specific configuration:
./build/bin/llama-bench -m ./models/model.gguf -ngl 99Cost Analysis: Single-GPU Quantized vs Multi-GPU FP16
Pricing as of 09 Apr 2026 from the Spheron GPU catalog:
| Config | GPU | On-Demand $/hr | Spot $/hr | Notes |
|---|---|---|---|---|
| 70B FP16 | 2x A100 PCIe | ~$2.08 | ~$2.28 (spot) | Minimum 2 GPUs needed |
| 70B Q4_K_M | 1x A100 PCIe | $1.04 | $1.14 (spot) | Single GPU, full GPU offload |
| 70B Q4_K_M | 1x A100 SXM4 | $1.64 | $0.45 (spot) | Higher memory bandwidth |
| 70B Q4_K_M | 1x H100 PCIe | $2.01 | N/A | More VRAM headroom |
Cost per million tokens (formula: $/hr / (tokens/sec × 3,600) × 1,000,000):
| Config | $/hr | Est. Tokens/sec | Est. $/M tokens |
|---|---|---|---|
| 70B FP16, 2x A100 PCIe (on-demand) | $2.08 | ~35* | ~$16.51* |
| 70B Q4_K_M, 1x A100 PCIe (on-demand) | $1.04 | ~25* | ~$11.56* |
| 70B Q4_K_M, 1x A100 PCIe (spot) | $1.14 | ~25* | ~$12.67* |
*Figures are estimates based on directional benchmarks at batch=1. Actual throughput depends on context length, batch size, and prompt characteristics. Measure your workload with llama-bench before committing to a configuration.
Pricing fluctuates based on GPU availability. The prices above are based on 09 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
For a broader cost optimization framework, see AI inference cost economics and the GPU cost optimization playbook.
When to Use GGUF vs AWQ vs Native FP16: Decision Framework
| Scenario | Recommended format | Reason |
|---|---|---|
| Serving with llama.cpp or Ollama | GGUF | Native format, simplest setup |
| Serving with vLLM at high throughput | AWQ or FP8 | Better batching, GPU-native ops |
| Single developer, limited GPU budget | GGUF Q4_K_M | Fits on smaller GPU, cheaper instance |
| Production API, latency SLA | FP16 or FP8 (vLLM) | Maximum throughput, predictable latency |
| Hybrid CPU and GPU inference | GGUF | Only format that handles both cleanly |
| Moving from Ollama to cloud | GGUF | Zero conversion needed, same files |
For teams already running Ollama locally, the cloud transition is straightforward: the same .gguf files work on a Spheron GPU instance with llama.cpp server. No re-quantization, no format conversion.
Production Setup: Serving Quantized Models at Scale on Spheron
Process management with systemd
Keep the llama.cpp server running after SSH disconnect:
# /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp server
After=network.target
[Service]
Type=simple
ExecStart=/home/user/llama.cpp/build/bin/llama-server \
-m /models/llama-70b-q4km.gguf \
-ngl 99 \
--port 8080 \
--host 0.0.0.0 \
--ctx-size 8192 \
--metrics
Restart=always
User=user
[Install]
WantedBy=multi-user.targetImportant: This service binds to 0.0.0.0 and will restart automatically after reboots, keeping the port open at all times. Restrict access before enabling this unit in production. Add a firewall rule to limit port 8080 to trusted source IPs, or place an authenticated reverse proxy in front of the service. Without this, any machine that can reach your instance will have free access to your inference endpoint.
sudo systemctl enable llama-server
sudo systemctl start llama-serverHealth check
llama.cpp server exposes a /health endpoint:
curl http://localhost:8080/health
# Returns: {"status": "ok"}Load balancing across multiple instances
For higher throughput, run multiple llama.cpp instances on separate ports (or separate GPU instances) and use nginx as a load balancer:
upstream llama_pool {
server 127.0.0.1:8080;
server 127.0.0.1:8081;
}
server {
listen 80;
location /v1/ {
proxy_pass http://llama_pool;
}
}Throughput monitoring
The --metrics flag enables a Prometheus-compatible metrics endpoint at /metrics. Without this flag, the /metrics endpoint is not available. Key metrics to track: llama_prompt_tokens_total, llama_tokens_predicted_total, and llama_kv_cache_usage_ratio.
For broader production architecture patterns, see GPU monitoring for ML workloads and production GPU cloud architecture.
Quantization is the most direct way to cut GPU cloud spend without sacrificing production quality. A 70B model at Q4_K_M fits on a single A100 on Spheron at $1.04/hr on-demand - roughly half the cost of a two-GPU FP16 setup.
