Tutorial

GGUF Dynamic Quantization on GPU Cloud: Deploy LLMs 50% Cheaper with Unsloth Dynamic 2.0

Back to BlogWritten by Mitrasish, Co-founderApr 9, 2026
GGUFQuantizationUnslothGPU CloudLLM Inferencellama.cppCost OptimizationA100H100Unsloth Dynamic
GGUF Dynamic Quantization on GPU Cloud: Deploy LLMs 50% Cheaper with Unsloth Dynamic 2.0

Running a 70B model at FP16 on GPU cloud requires 140 GB of VRAM - typically 2 A100 80GB instances costing $2.08/hr or more on-demand. Quantize the same model to Q4_K_M with GGUF and it fits on a single A100 at $1.04/hr on-demand. That single decision cuts your inference bill roughly in half without rewriting your serving code. For a broader view of GPU memory planning, see the LLM VRAM requirements guide and the AI inference cost economics breakdown.

GGUF in 2026: The Universal LLM Distribution Format

GGUF (GPT-Generated Unified Format) replaced GGML in August 2023 and has since become the standard for portable LLM distribution. The key design choice: everything the model needs is bundled in a single file. Weights, tokenizer vocabulary, tokenizer config, and model hyperparameters are all packed into one .gguf file. No separate tokenizer JSON, no config files to track alongside the weights.

This matters for cloud deployment in a practical way. You download one file, pass it to llama.cpp server, and get an OpenAI-compatible HTTP API back. No Python serving stack. No separate tokenizer loading code. No framework version dependencies to manage. The binary runs on CPU, Apple Silicon M-series chips, and CUDA GPUs.

Every major local LLM tool uses GGUF: Ollama, LM Studio, Jan, Open WebUI, GPT4All, and llama.cpp itself. Pre-quantized GGUF files for most popular models are available on Hugging Face from community quantizers like Bartowski and the official Unsloth namespace. For cloud use, the combination of small file sizes (a 70B Q4_K_M is roughly 42 GB vs 140 GB for FP16) and zero-dependency serving makes GGUF the lowest-friction path to single-GPU inference at scale.

Quantization Methods Compared: GGUF K-Quants, AWQ, GPTQ, and ExLlamaV2

Not all quantization formats are interchangeable. The right choice depends on your serving framework and hardware.

MethodFormatGPU RequiredFrameworkTypical Use
GGUF K-Quants.ggufOptional (CPU/GPU)llama.cpp, OllamaLocal and cloud inference
AWQ.safetensorsCUDA GPUvLLM, TGIGPU-optimized cloud serving
GPTQ.safetensorsCUDA GPUvLLM, AutoGPTQLegacy GPU quantization
ExLlamaV2.safetensorsCUDA GPUTabbyAPI, text-gen-webuiHigh-throughput alternative

GGUF K-quants use a mixed-precision block quantization scheme. In Q4_K_M for example, most blocks use 4-bit weights but some attention and embedding layers retain higher precision. This is different from naive INT4 where every weight gets the same treatment. The result is better perplexity at the same average bit-width.

AWQ (Activation-aware Weight Quantization) is the better choice when running vLLM at high throughput - it's GPU-native and integrates cleanly with vLLM's continuous batching. See the vLLM production deployment guide for FP8 and multi-GPU vLLM configurations. For format decision context and a direct comparison of Ollama (GGUF) vs vLLM (safetensors), see Ollama vs vLLM.

Unsloth Dynamic 2.0: Per-Layer Intelligent Quantization

Standard quantization applies the same bit-width to every layer in the model. Unsloth Dynamic 2.0 doesn't.

The core insight: transformer layers are not equally sensitive to precision loss. Embedding layers and the first and last few attention blocks hold structural information that the rest of the model depends on. Quantize these aggressively and quality drops sharply. The middle feed-forward network layers are more redundant and can survive heavier compression without noticeable degradation.

Dynamic 2.0 measures per-layer sensitivity during quantization and assigns bit-widths accordingly. Sensitive layers stay at Q6 or Q8. Less sensitive FFN layers drop to Q2 or Q3. The average bit-width across the model ends up similar to uniform Q4, but the output quality is measurably better. In practice, Dynamic 2.0 at "Q4" produces perplexity closer to uniform Q5, at the same file size.

This is offline quantization, not dynamic quantization at inference time. The bit-width decisions are baked into the .gguf file during export. The inference runtime (llama.cpp) then loads the pre-quantized weights and runs as normal.

The Unsloth team documented their benchmarks and methodology in their Dynamic 2.0 blog post. For a comparison of Unsloth's fine-tuning capabilities against other frameworks, see the Axolotl vs Unsloth vs TorchTune comparison.

GPU VRAM Requirements by Quant Level

Add 10-20% headroom to these numbers for KV cache and framework overhead. Actual VRAM usage varies by context length and batch size.

Model SizeQ2_KQ4_K_MQ5_K_MQ6_KQ8_0FP16
7B~3.8 GB~4.5 GB~5.2 GB~5.9 GB~7.7 GB~14 GB
13B~6.9 GB~8.4 GB~9.7 GB~10.8 GB~14.3 GB~26 GB
32B~17 GB~20 GB~23 GB~26 GB~34 GB~64 GB
70B~38 GB~42 GB~48 GB~56 GB~75 GB~140 GB

For a complete GPU-to-model matching reference including MoE architectures and multi-GPU configurations, see the GPU memory requirements for LLMs guide and the GPU requirements cheat sheet.

Key practical thresholds:

  • 7B Q4_K_M at 4.5 GB: fits on virtually any modern GPU, including consumer cards with 8 GB VRAM.
  • 70B Q4_K_M at 42 GB: fits on a single A100 80GB or H100 80GB with room for context.
  • 70B Q8_0 at 75 GB: tight on an A100 80GB, comfortable on an H100 80GB.
  • 70B FP16 at 140 GB: requires 2 A100 80GB GPUs minimum.

Step-by-Step: Quantize Any Model to GGUF with Unsloth Dynamic 2.0

This workflow starts from a Hugging Face model checkpoint and produces a .gguf file ready for llama.cpp serving.

1. Install dependencies

bash
pip install unsloth
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

2. Load the model with Unsloth

python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.3-70B-Instruct",
    max_seq_length=8192,
    load_in_4bit=True,  # Load in 4-bit for quantization
)

3. Export to GGUF

python
# Standard Q4_K_M (uniform 4-bit K-quant)
model.save_pretrained_gguf("llama-70b-q4km", tokenizer, quantization_method="q4_k_m")

# Unsloth Dynamic 2.0 (per-layer selection, better quality at same size)
model.save_pretrained_gguf("llama-70b-dynamic", tokenizer, quantization_method="q4_k_xl")

# Q5_K_M for near-lossless quality
model.save_pretrained_gguf("llama-70b-q5km", tokenizer, quantization_method="q5_k_m")

The q4_k_xl method triggers Unsloth's Dynamic 2.0 per-layer sensitivity analysis. It takes longer than standard Q4_K_M but produces a better model at the same file size. Dynamic 2.0 output files follow the UD-Q4_K_XL naming convention. Verify the current method name against the Unsloth GitHub README, as the API surface changes with releases.

4. Verify output size

bash
ls -lh llama-70b-q4km/
# Expect: ~42 GB for 70B Q4_K_M
# Standard Q4_K_M vs Dynamic 2.0 should be similar in file size

5. Optional: push to Hugging Face

python
# Push quantized model to your HF repo
model.push_to_hub_gguf(
    "your-username/llama-70b-q4km-gguf",
    tokenizer,
    quantization_method="q4_k_m",
    token="hf_..."
)

Deploy GGUF Models on GPU Cloud with llama.cpp Server

1. Provision a GPU instance

Log into Spheron, navigate to the GPU catalog, and select an A100 80GB PCIe ($1.14/hr spot, $1.04/hr on-demand) for 70B Q4_K_M models. SSH in and verify the GPU is available:

bash
nvidia-smi
# Should show A100 80GB, CUDA version, driver version

2. Build llama.cpp with CUDA support

bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Verify build
./build/bin/llama-cli --version

3. Download the GGUF model

bash
pip install huggingface_hub
huggingface-cli download \
  bartowski/Llama-3.3-70B-Instruct-GGUF \
  Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

4. Launch the server with full GPU offloading

bash
./build/bin/llama-server \
  -m ./models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  -ngl 99 \
  --port 8080 \
  --ctx-size 8192 \
  --n-predict -1 \
  --threads 8 \
  --host 0.0.0.0

Security note: --host 0.0.0.0 binds the server to all network interfaces, including the public IP of your cloud instance. Anyone who can reach that IP on port 8080 can send inference requests. Before exposing this endpoint, restrict access with a firewall rule (allow only trusted IPs on port 8080) or place an authenticated reverse proxy (nginx with auth_basic, Caddy with JWT, or similar) in front of it. On Spheron instances, you can also bind to 127.0.0.1 and use SSH port forwarding for local testing without opening a public port.

The -ngl 99 flag offloads all layers to GPU. For a 70B model with 80 transformer layers, this means --ngl 80 would also work. Using 99 is safe - llama.cpp caps at the actual layer count.

5. Test the OpenAI-compatible endpoint

bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [{"role": "user", "content": "What is GGUF quantization?"}],
    "max_tokens": 200
  }'

The response format is identical to the OpenAI API. Point any OpenAI SDK client at http://your-instance-ip:8080/v1 and it works without code changes. See self-hosted OpenAI-compatible API with vLLM for a comparison with the vLLM serving path.

Docker alternative

bash
docker run --gpus all \
  -v /path/to/models:/models \
  -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  -ngl 99 \
  --port 8080 \
  --host 0.0.0.0

Same security note applies: -p 8080:8080 maps the container port to the host's public interface. Use -p 127.0.0.1:8080:8080 to bind locally only, or restrict port 8080 with a firewall rule before running this on a cloud instance.

vLLM GGUF Support (Alternative to llama.cpp Server)

vLLM added GGUF loading support in mid-2024 (around v0.5.5+). Teams already running vLLM who want to experiment with GGUF files don't need to switch to llama.cpp.

bash
vllm serve ./models/Meta-Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --tokenizer meta-llama/Llama-3.3-70B-Instruct \
  --served-model-name llama-70b

The --tokenizer flag is required because vLLM needs the tokenizer separately when loading GGUF files (unlike llama.cpp, which reads it from the GGUF bundle).

One practical caveat: vLLM's GGUF support is less mature than llama.cpp. Mixed-precision K-quants work, but the testing coverage and performance optimization are more limited. For workflows centered on GGUF, llama.cpp is still the recommended path. For production deployments needing high-concurrency batching and vLLM's continuous batching engine, AWQ or FP8 via native safetensors is a better fit. See the vLLM production deployment guide for those configurations.

Benchmarks: Quality vs Speed vs VRAM on A100 and H100

These figures are directional estimates based on community benchmarks for Llama 3.3 70B (see the llama.cpp quantization perplexity benchmarks and Unsloth's Dynamic 2.0 evaluation results). Treat them as approximate ranges, not measured values for your specific setup. Measure your model and workload before making infrastructure decisions.

Quality: perplexity increase vs BF16 (lower is better)

QuantPerplexity deltaQuality rating
Q2_K+0.80Acceptable for low-stakes use
Q4_K_M+0.18Excellent for production
Q5_K_M+0.08Near-lossless
Q8_0+0.01Indistinguishable
Dynamic 2.0 (Q4)+0.12Better than uniform Q4

Dynamic 2.0 closes about a third of the gap between uniform Q4_K_M and Q5_K_M, at the same file size and memory footprint as Q4_K_M.

Throughput: tokens/sec on A100 80GB, Llama 3.3 70B, batch=1 (estimated — results vary significantly by driver version, context length, and thermal state; benchmark your specific setup)

QuantTokens/secVRAM used
Q4_K_M~20-30~42 GB
Q5_K_M~28-36~48 GB
Q8_0~20-25~75 GB
FP16 (2x A100)~30-40140 GB (2 GPUs)

FP16 on 2x A100 and Q4_K_M on 1x A100 have similar throughput at batch=1, but Q4_K_M costs half as much per hour. For higher batch sizes, 2x A100 FP16 scales better. Use llama.cpp's built-in benchmark tool to measure your specific configuration:

bash
./build/bin/llama-bench -m ./models/model.gguf -ngl 99

Cost Analysis: Single-GPU Quantized vs Multi-GPU FP16

Pricing as of 09 Apr 2026 from the Spheron GPU catalog:

ConfigGPUOn-Demand $/hrSpot $/hrNotes
70B FP162x A100 PCIe~$2.08~$2.28 (spot)Minimum 2 GPUs needed
70B Q4_K_M1x A100 PCIe$1.04$1.14 (spot)Single GPU, full GPU offload
70B Q4_K_M1x A100 SXM4$1.64$0.45 (spot)Higher memory bandwidth
70B Q4_K_M1x H100 PCIe$2.01N/AMore VRAM headroom

Cost per million tokens (formula: $/hr / (tokens/sec × 3,600) × 1,000,000):

Config$/hrEst. Tokens/secEst. $/M tokens
70B FP16, 2x A100 PCIe (on-demand)$2.08~35*~$16.51*
70B Q4_K_M, 1x A100 PCIe (on-demand)$1.04~25*~$11.56*
70B Q4_K_M, 1x A100 PCIe (spot)$1.14~25*~$12.67*

*Figures are estimates based on directional benchmarks at batch=1. Actual throughput depends on context length, batch size, and prompt characteristics. Measure your workload with llama-bench before committing to a configuration.

Pricing fluctuates based on GPU availability. The prices above are based on 09 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For a broader cost optimization framework, see AI inference cost economics and the GPU cost optimization playbook.

When to Use GGUF vs AWQ vs Native FP16: Decision Framework

ScenarioRecommended formatReason
Serving with llama.cpp or OllamaGGUFNative format, simplest setup
Serving with vLLM at high throughputAWQ or FP8Better batching, GPU-native ops
Single developer, limited GPU budgetGGUF Q4_K_MFits on smaller GPU, cheaper instance
Production API, latency SLAFP16 or FP8 (vLLM)Maximum throughput, predictable latency
Hybrid CPU and GPU inferenceGGUFOnly format that handles both cleanly
Moving from Ollama to cloudGGUFZero conversion needed, same files

If serving on GPU cloud rather than CPU, 2:4 structured pruning via SparseGPT or Wanda is a GPU-native alternative that activates hardware sparse tensor cores on Ampere and Hopper, delivering throughput gains that GGUF on llama.cpp cannot match.

For teams already running Ollama locally, the cloud transition is straightforward: the same .gguf files work on a Spheron GPU instance with llama.cpp server. No re-quantization, no format conversion.

Production Setup: Serving Quantized Models at Scale on Spheron

Process management with systemd

Keep the llama.cpp server running after SSH disconnect:

ini
# /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp server
After=network.target

[Service]
Type=simple
ExecStart=/home/user/llama.cpp/build/bin/llama-server \
  -m /models/llama-70b-q4km.gguf \
  -ngl 99 \
  --port 8080 \
  --host 0.0.0.0 \
  --ctx-size 8192 \
  --metrics
Restart=always
User=user

[Install]
WantedBy=multi-user.target

Important: This service binds to 0.0.0.0 and will restart automatically after reboots, keeping the port open at all times. Restrict access before enabling this unit in production. Add a firewall rule to limit port 8080 to trusted source IPs, or place an authenticated reverse proxy in front of the service. Without this, any machine that can reach your instance will have free access to your inference endpoint.

bash
sudo systemctl enable llama-server
sudo systemctl start llama-server

Health check

llama.cpp server exposes a /health endpoint:

bash
curl http://localhost:8080/health
# Returns: {"status": "ok"}

Load balancing across multiple instances

For higher throughput, run multiple llama.cpp instances on separate ports (or separate GPU instances) and use nginx as a load balancer:

nginx
upstream llama_pool {
    server 127.0.0.1:8080;
    server 127.0.0.1:8081;
}

server {
    listen 80;
    location /v1/ {
        proxy_pass http://llama_pool;
    }
}

Throughput monitoring

The --metrics flag enables a Prometheus-compatible metrics endpoint at /metrics. Without this flag, the /metrics endpoint is not available. Key metrics to track: llama_prompt_tokens_total, llama_tokens_predicted_total, and llama_kv_cache_usage_ratio.

For broader production architecture patterns, see GPU monitoring for ML workloads and production GPU cloud architecture.


Quantization is the most direct way to cut GPU cloud spend without sacrificing production quality. A 70B model at Q4_K_M fits on a single A100 on Spheron at $1.04/hr on-demand - roughly half the cost of a two-GPU FP16 setup.

Rent A100 80GB → | Rent H100 → | View all pricing →

Get started on Spheron →

STEPS / 06

Quick Setup Guide

  1. Choose your quantization level and calculate VRAM requirements

    Decide on the quantization target based on acceptable quality loss and available GPU VRAM. For 70B models: Q4_K_M needs ~38-42 GB, Q5_K_M needs ~47-50 GB, Q8_0 needs ~72-75 GB. Use Unsloth Dynamic 2.0 for any model you want to quantize from scratch, or download pre-quantized GGUF files from Hugging Face (search for Bartowski or unsloth namespace).

  2. Provision a GPU instance on Spheron

    Log into Spheron, pick an A100 80GB PCIe ($1.14/hr spot, $1.04/hr on-demand) or H100 PCIe ($2.01/hr on-demand) based on your model size and throughput requirements. SSH into the instance and verify with nvidia-smi. For 70B Q4_K_M, a single A100 80GB is sufficient.

  3. Quantize your model with Unsloth Dynamic 2.0

    Install Unsloth: pip install unsloth. Load your model and call model.save_pretrained_gguf('output_dir', tokenizer, quantization_method='q4_k_m') for standard Q4_K_M or quantization_method='q4_k_xl' for Unsloth Dynamic 2.0 per-layer quantization. Unsloth will generate the .gguf file directly.

  4. Build llama.cpp with CUDA support

    Clone the llama.cpp repo, then build with: cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc). This compiles the CUDA backend for full GPU offloading. Verify with ./build/bin/llama-cli --version.

  5. Launch llama.cpp server with GPU offloading

    Start the OpenAI-compatible server: ./build/bin/llama-server -m /path/to/model.gguf -ngl 99 --port 8080 --ctx-size 8192 --n-predict -1 --threads 8. The -ngl 99 flag offloads all layers to GPU. For a 70B Q4_K_M model, -ngl 80 offloads all 80 layers.

  6. Test the endpoint

    Send a chat completion request: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "llama", "messages": [{"role": "user", "content": "Hello"}]}'. The response format is identical to the OpenAI API.

FAQ / 05

Frequently Asked Questions

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and Ollama. It bundles model weights, tokenizer, and metadata in a single file and supports mixed-precision K-quant methods (Q4_K_M, Q5_K_M, Q8_0). AWQ and GPTQ are GPU-native formats that require CUDA and work better with vLLM and TensorRT-LLM. GGUF runs on CPU, Apple Silicon, and GPU with llama.cpp. For cloud GPU deployment, GGUF with llama.cpp server provides a simple OpenAI-compatible API without a Python serving stack. For cloud GPU deployment targeting Blackwell hardware, MXFP4 (hardware-native FP4) is another option covered in the MXFP4 quantization guide at spheron.network/blog/mxfp4-microscaling-quantization-gpu-cloud/.

Unsloth Dynamic 2.0 is a per-layer intelligent quantization method that assigns different quantization bit-widths to different layers based on their sensitivity. Layers that are more sensitive to precision loss are kept at higher bit-widths (e.g., Q6 or Q8), while less sensitive layers are quantized more aggressively (Q2 or Q3). The result is better output quality than uniform quantization at the same average file size.

A 70B model quantized to Q4_K_M requires approximately 38-42 GB of VRAM for the weights, plus KV cache overhead. This fits on a single A100 80GB or H100 80GB with room for context. At FP16, the same model requires approximately 140 GB, needing 2-4 GPUs. The single-GPU Q4_K_M configuration costs roughly half as much per hour on Spheron as a two-GPU FP16 setup.

Yes. llama.cpp supports full GPU offloading via CUDA. On Spheron, spin up an A100 or H100 instance, install llama.cpp with CUDA support, download the GGUF model from Hugging Face, and launch the server with the -ngl flag set to the number of layers to offload to GPU. All layers on GPU is typical for cloud inference. llama.cpp exposes an OpenAI-compatible API on port 8080.

Q4_K_M and Q5_K_M are the practical sweet spots. Q4_K_M has a perplexity increase of roughly 0.15-0.20 points versus BF16 on Llama 3.3 70B, which is imperceptible in production use. Q5_K_M brings perplexity within 0.08 points of BF16 at a 25% larger file size. Q8_0 is near-lossless but gives up most of the file size savings. For most cloud GPU deployments targeting cost reduction, Q4_K_M is the default choice.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.