The 2026 llama.cpp engine rewrite significantly improved inference speed on Ada Lovelace and Blackwell-class hardware through CUDA kernel rewrites. For teams running models larger than a single GPU's VRAM, multi-GPU tensor split means you're not limited by what fits on one card.
This isn't a dev-only tool anymore. If you're serving fewer than 15-20 concurrent users, a single well-configured llama-server instance on a cloud GPU is cheaper and simpler than running a full vLLM stack. No Docker, no Python serving process, no Hugging Face model downloads in safetensors format. Just a compiled binary, a GGUF file, and an OpenAI-compatible API on port 8080.
For background on GGUF quantization formats and how to pick between Q4_K_M, Q5_K_M, and Q8_0, see the GGUF dynamic quantization guide. If you're choosing between llama.cpp and vLLM for your use case, the Ollama vs vLLM comparison covers the concurrency and throughput tradeoffs in detail.
TL;DR
| Feature | llama.cpp | vLLM | Ollama |
|---|---|---|---|
| Best for | Small teams, GGUF flexibility | High concurrency production | Local dev, prototyping |
| Concurrency ceiling | ~16 parallel slots | 180+ concurrent requests | ~5-10 users |
| Multi-GPU | Layer split (default) + row tensor-parallel | True tensor parallelism | No |
| GGUF support | Native | Limited (safetensors preferred) | Native |
| Setup time | 5-10 min | 10-15 min | 2 min |
| OpenAI API | Yes (/v1/chat/completions) | Yes | Yes |
| Continuous batching | No | Yes | No |
| Binary size | Single binary, ~50 MB | Docker image, ~8 GB | Installer, ~500 MB |
When llama.cpp Beats vLLM and SGLang
Small teams and internal tools. If your service handles sub-16 concurrent users, llama.cpp's --parallel N flag gives you parallel decoding slots without the operational overhead of vLLM's continuous batching engine. Fewer moving parts means less to debug at 2am.
GGUF flexibility. llama.cpp loads any GGUF file directly: Q2_K, Q4_K_M, Q5_K_M, Q8_0, IQ4_XS. No conversion, no quantization step on your side. You download the file, point the server at it, and you're serving. vLLM prefers safetensors and its GGUF support is more limited.
CPU fallback. If a model's layers don't all fit on your GPU VRAM, llama.cpp falls back automatically to CPU for the remaining layers. You get slower inference rather than an OOM crash. vLLM does not have this behavior.
Mixed hardware. Tensor split across GPUs of different VRAM sizes works out of the box. If you have an 80GB and a 40GB GPU, --tensor-split 2,1 allocates layers proportionally. vLLM's tensor parallelism requires homogeneous GPU setups for full efficiency.
Zero Python overhead. A single compiled binary handles the HTTP server, the inference loop, and the API. No FastAPI, no asyncio workers, no CUDA version mismatches between PyTorch and your driver.
For high-concurrency production (50+ users), see the vLLM vs SGLang benchmark guide for the throughput numbers you'd need to justify the added complexity.
2026 Engine Improvements: Multi-GPU and Faster Inference
Two things make llama.cpp worth reconsidering for production in 2026:
--tensor-split for multi-GPU setups. The flag distributes model layers across all detected GPUs based on the ratios you specify. By default this uses --split-mode layer (pipeline parallel): the model loads across GPUs at startup, layers route to specific GPUs, and at any moment only one GPU is computing. This unlocks models too large for any single GPU you have, but compute does not scale with GPU count in this mode.
CUDA kernel rewrites on Ada Lovelace and Blackwell. The April 2026 engine rewrite improved throughput significantly on RTX 5090 and H100 over the previous kernels for typical Q4_K_M inference workloads. The RTX 5090's Blackwell architecture benefits the most: improved kernels take better advantage of Blackwell's memory bandwidth and architecture, boosting Q4_K_M throughput across quantization levels. Note that CUDA 12.8 is needed on Blackwell to avoid a regression. For specific version numbers and changelog entries, see the llama.cpp GitHub releases page.
Multi-GPU modes in llama.cpp. There are two approaches:
--split-mode layer(default): Layers are assigned to specific GPUs serially. GPU 0 handles layers 0-N, GPU 1 handles layers N+1-M. At any moment, only one GPU is computing. VRAM utilization scales, compute does not.--split-mode row(NCCL required): Weight matrices within each layer are sharded across all GPUs. All GPUs compute in parallel on every forward pass. Both VRAM and compute scale with GPU count. Build with-DGGML_CUDA_NCCL=ONto enable. Gains are most visible on fast GPU interconnect (NVLink preferred); PCIe setups see smaller improvements due to interconnect bandwidth.- Tensor parallelism (vLLM): Similar weight-sharding approach, combined with continuous batching for high-concurrency production loads.
Layer split is the right default for fitting large models across GPUs. For actual per-token throughput gains from multiple GPUs, --split-mode row is the option within llama.cpp. For very high concurrency (50+ users), vLLM's continuous batching engine scales further.
Choosing the Right Spheron GPU
| GPU | VRAM | Best for (GGUF) | On-demand $/hr | Spot $/hr |
|---|---|---|---|---|
| RTX 5090 | 32 GB GDDR7 | 7B-32B Q4_K_M | ~$0.92 | N/A |
| L40S | 48 GB GDDR6 | 7B-34B Q4_K_M, 13B Q8_0 | ~$1.16 | ~$0.61 |
| H100 PCIe 80GB | 80 GB HBM2e | 34B-70B Q4_K_M | $2.01 | N/A |
| H100 SXM5 80GB | 80 GB HBM3 | 34B-70B Q4_K_M, multi-GPU 70B+ | $5.01 | $2.91 |
RTX 5090 on Spheron is the cheapest competent option for 7B-32B models at Q4_K_M. The 32GB GDDR7 bandwidth handles fast decode on smaller quantized models, and the Blackwell architecture gets the full benefit of the 2026 kernel improvements.
For 34B-70B models, you want 48-80GB of VRAM. L40S instances fit 34B Q4_K_M comfortably with headroom for a 32K context. The on-demand H100 cloud instance is the right choice when you need 70B at Q4_K_M on a single GPU, or when you're setting up a two-GPU tensor split for larger models.
Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
GGUF Quantization Choices and VRAM-Per-Quant Tradeoffs
| Model Size | Q4_K_M | Q5_K_M | Q8_0 | Fits on |
|---|---|---|---|---|
| 7B | ~4.5 GB | ~5.2 GB | ~7.7 GB | RTX 5090, L40S |
| 13B | ~8.4 GB | ~9.7 GB | ~14.3 GB | RTX 5090, L40S |
| 32B | ~20 GB | ~23 GB | ~34 GB | L40S, H100 |
| 70B | ~42 GB | ~48 GB | ~75 GB | H100 80GB |
Q4_K_M is the production default. Perplexity loss versus BF16 is about 0.15-0.20 points, which is imperceptible in practice. It fits 70B models on a single H100 80GB with enough VRAM headroom for an 8K context window.
Q5_K_M closes to within 0.08 perplexity points of BF16. Worth the extra VRAM if your GPU has it. For 7B-13B models on an RTX 5090 or L40S, Q5_K_M is a straightforward upgrade from Q4_K_M.
Q8_0 is near-lossless but largely eliminates the file size benefit. A 70B Q8_0 model is 75 GB, which won't fit on a single H100 80GB without tensor split. Use it only when you have a specific quality requirement that Q5_K_M doesn't meet.
Pre-quantized models in all these formats are available from the Bartowski and Unsloth namespaces on Hugging Face. For custom quantization with Unsloth Dynamic 2.0's per-layer intelligent quantization, see the GGUF dynamic quantization guide.
Build llama.cpp with CUDA on a Cloud GPU
After provisioning your Spheron instance, SSH in and verify the GPU:
nvidia-smi
# Should show your GPU model, VRAM, CUDA version, and driverThen build llama.cpp with CUDA support:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build with CUDA backend
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Verify the server binary
./build/bin/llama-server --versionThe build takes 3-8 minutes depending on GPU instance CPU count. The server binary is at ./build/bin/llama-server.
For ROCm (AMD GPU) builds, replace -DGGML_CUDA=ON with -DGGML_HIPBLAS=ON. This guide focuses on CUDA; the flag names and API surface are otherwise identical.
Core llama-server Flags
| Flag | What it does | Recommended value |
|---|---|---|
-m | Model path | Path to your .gguf file |
-ngl | GPU layers to offload | 99 (all layers) |
--ctx-size | Context length | 8192 default; 32768 for long context |
--parallel | Concurrent request slots | 8-16 for production |
--host | Bind address | 0.0.0.0 for external access |
--port | Server port | 8080 |
--n-predict | Max tokens per response | -1 (unlimited) |
--threads | CPU threads for CPU layers | $(nproc) |
--tensor-split | Multi-GPU split ratios | 1,1 for two equal GPUs |
--split-mode | Multi-GPU compute mode | layer (default, pipeline) or row (NCCL, true tensor-parallel) |
--flash-attn | Flash attention | Add for longer contexts |
Launching the Server
Download a model first:
pip install huggingface_hub
huggingface-cli download bartowski/Llama-3.3-70B-Instruct-GGUF \
--include "Llama-3.3-70B-Instruct-Q4_K_M.gguf" \
--local-dir ./modelsFor 7B-13B models:
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
--include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
--local-dir ./modelsLaunch the server:
./build/bin/llama-server \
-m ./models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
-ngl 99 \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 8192 \
--parallel 8 \
--flash-attnYou'll see layer loading progress, then HTTP server listening on 0.0.0.0:8080. From that point, the server is ready.
Multi-GPU Tensor Split Configuration
For models that exceed a single GPU's VRAM, add --tensor-split:
./build/bin/llama-server \
-m ./models/your-model.gguf \
-ngl 99 \
--tensor-split 1,1 \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 8192 \
--parallel 8Finding the right ratios. Run nvidia-smi before starting the server:
GPU 0: NVIDIA H100 80GB SXM5 - 81559 MiB total
GPU 1: NVIDIA H100 80GB SXM5 - 81559 MiB totalTwo equal 80GB GPUs: --tensor-split 1,1.
Unequal GPUs (80GB + 48GB):
GPU 0: 80GB
GPU 1: 48GBThe ratio is 80:48, which simplifies to 5:3. Use --tensor-split 5,3.
When tensor split is worth it vs. upgrading GPU. Running a 70B Q8_0 model (75 GB) across two H100 80GB GPUs with --tensor-split 1,1 works, but throughput is roughly the same as a single H100 running the same model at Q4_K_M (42 GB). In most cases, a smaller quant on a single GPU is simpler and similarly performant. Tensor split is most useful when you want the quality of Q5_K_M or Q8_0 on a 70B model and you don't have a single GPU with enough VRAM.
Exposing an OpenAI-Compatible Endpoint
Test the endpoint immediately after the server starts:
curl http://YOUR_IP:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"max_tokens": 100
}'llama-server supports these OpenAI-compatible endpoints:
POST /v1/chat/completionsPOST /v1/completionsPOST /v1/embeddingsGET /v1/models
Python SDK integration:
from openai import OpenAI
client = OpenAI(
base_url="http://YOUR_IP:8080/v1",
api_key="any-string", # llama-server does not validate the key by default
)
response = client.chat.completions.create(
model="llama",
messages=[{"role": "user", "content": "Explain GGUF quantization in one sentence."}],
max_tokens=200,
)
print(response.choices[0].message.content)TypeScript/Node.js:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://YOUR_IP:8080/v1",
apiKey: "any-string",
});
const response = await client.chat.completions.create({
model: "llama",
messages: [{ role: "user", content: "Hello" }],
});For production hardening patterns (auth headers, nginx reverse proxy, health check endpoints), see the self-hosted OpenAI-compatible API guide which covers these in detail for vLLM but the nginx and systemd patterns are identical.
Running as a Persistent Service with systemd
For production, run llama-server as a managed systemd service so it restarts automatically on crash or reboot:
# /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp server
After=network.target
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/build/bin/llama-server \
-m /home/ubuntu/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
-ngl 99 \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 8192 \
--parallel 8 \
--flash-attn
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetEnable and start:
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
sudo systemctl status llama-serverLogs are accessible via journalctl -u llama-server -f. The service restarts within 10 seconds after a crash.
Concurrency Limits: When to Graduate to vLLM
llama.cpp is the right tool until it isn't. Here's a concrete checklist for when to migrate:
Graduate to vLLM when:
- Queue depth is consistently above 8 requests
- p95 TTFT (time to first token) exceeds 3 seconds under normal load
- You need multi-GPU throughput at very high concurrency beyond what
--split-mode rowprovides (vLLM's continuous batching scales further for 50+ users) - You need Prometheus
/metricsfor auto-scaling or alerting - You're serving more than 20 concurrent users at steady state
Stay on llama.cpp when:
- You need GGUF format flexibility with any quantization level
- You want a single binary with no Python runtime dependency
- Your team has fewer than 15 concurrent users
- You need CPU fallback for models that barely don't fit on GPU
- You're running on mixed GPU hardware where sizes differ
At that point, see the vLLM production deployment guide or the Ollama vs vLLM comparison for migration path.
Cost: Right-Sizing a Single GPU for llama.cpp
Monthly cost at 720 hours/month:
| Setup | Model | On-demand | Spot | Monthly (on-demand) | Monthly (spot) |
|---|---|---|---|---|---|
| H100 SXM5 80GB | Llama 3.3 70B Q4_K_M | $5.01/hr | $2.91/hr | ~$3,607/mo | ~$2,095/mo |
| H100 PCIe 80GB | Llama 3.3 70B Q4_K_M | $2.01/hr | N/A | ~$1,447/mo | N/A |
| L40S 48GB | Mistral 7B Q4_K_M | ~$1.16/hr | ~$0.61/hr | ~$835/mo | ~$439/mo |
| RTX 5090 32GB | Llama 3.1 8B Q4_K_M | ~$0.92/hr | N/A | ~$662/mo | N/A |
Spot pricing on H100 SXM5 cuts the monthly bill from $3,607 to $2,095 for the same 70B model. The trade-off is that spot instances can be reclaimed with short notice. For development, internal tools, or workloads that can tolerate interruption, spot is a reasonable default. For customer-facing APIs with uptime SLAs, stick to on-demand.
Spheron aggregates compute from 5+ providers, which is why on-demand H100 pricing here is meaningfully lower than single-hyperscaler rates.
llama.cpp server on a right-sized cloud GPU is the lowest-overhead path to a production OpenAI-compatible API for sub-20-user workloads. Spheron provisions bare-metal GPU instances in under 2 minutes with per-minute billing.
Quick Setup Guide
Choose a GPU based on model size: RTX 5090 (32GB) for 7B-32B Q4_K_M models at lowest on-demand cost, L40S (48GB) for 7B-34B models with more VRAM headroom, H100 80GB for 70B models. Log in at app.spheron.ai and provision a bare-metal instance. Verify the GPU with nvidia-smi before proceeding.
Clone the llama.cpp repo and build the CUDA backend: git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc). Verify with ./build/bin/llama-server --version.
Use huggingface-cli to download a pre-quantized GGUF file. Example for Llama 3.3 70B Q4_K_M: huggingface-cli download bartowski/Llama-3.3-70B-Instruct-GGUF --include 'Llama-3.3-70B-Instruct-Q4_K_M.gguf' --local-dir ./models. For 7B-14B models, use a Q4_K_M or Q5_K_M quant from the Bartowski or Unsloth namespaces.
Start the server: ./build/bin/llama-server -m ./models/your-model.gguf -ngl 99 --host 0.0.0.0 --port 8080 --ctx-size 8192 --parallel 8. The -ngl 99 flag offloads all layers to GPU. --parallel 8 allows 8 concurrent request slots. The OpenAI-compatible API is available on port 8080.
For models that exceed a single GPU's VRAM, add the --tensor-split flag: ./build/bin/llama-server -m ./models/model.gguf -ngl 99 --tensor-split 1,1 --host 0.0.0.0 --port 8080. Ratios in --tensor-split correspond to the relative VRAM allocation across GPUs listed by nvidia-smi. For two 80GB GPUs, 1,1 splits evenly.
Send a test request: curl http://YOUR_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "llama", "messages": [{"role": "user", "content": "Hello"}]}'. In Python: set client = openai.OpenAI(base_url='http://YOUR_IP:8080/v1', api_key='any-string') and call client.chat.completions.create() with your normal parameters.
Frequently Asked Questions
Provision a CUDA-enabled GPU instance (L40S or H100 recommended), clone the llama.cpp repo, build with cmake -DGGML_CUDA=ON, download a GGUF model from Hugging Face, then run llama-server with --ngl 99 to offload all layers to GPU. The server exposes an OpenAI-compatible API on port 8080 by default.
Tensor split (--tensor-split flag) distributes model layers across multiple GPUs proportionally. Use it when a GGUF model is too large to fit on a single GPU. For two equal GPUs, --tensor-split 1,1 splits evenly. For unequal GPUs, match the ratios to each GPU's VRAM (e.g. --tensor-split 2,1 for an 80GB/40GB pair). The default behavior uses --split-mode layer (pipeline parallel): layers are assigned serially to each GPU, so compute does not scale with GPU count, but you unlock bigger models. For true tensor parallelism within llama.cpp, use --split-mode row (build with -DGGML_CUDA_NCCL=ON), which shards weight matrices across GPUs with NCCL reductions and improves throughput on fast interconnects like NVLink.
llama.cpp server supports parallel slots (--parallel N flag) for concurrent decoding. In practice, 8-16 parallel slots is a reasonable limit on a single H100 before VRAM pressure and KV cache contention degrade latency. Beyond 16-20 concurrent users with sustained load, vLLM's continuous batching engine will outperform llama.cpp in total throughput. Use llama.cpp for teams of up to about 15 concurrent users; migrate to vLLM beyond that threshold.
Yes. llama-server exposes /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models endpoints that are compatible with the OpenAI Python SDK and any client that supports OpenAI's REST API. Set base_url to http://YOUR_IP:8080/v1 and api_key to any non-empty string. No other code changes are needed.
Q4_K_M is the standard production choice: it fits 70B models on a single A100 80GB or H100 80GB, and perplexity loss versus BF16 is roughly 0.15-0.20 points (imperceptible in practice). Q5_K_M is worth the extra VRAM if you have headroom - it closes to within 0.08 perplexity points of BF16. Q8_0 is near-lossless but largely eliminates the size benefit. For 7B-14B models on an L40S or RTX 5090, Q4_K_M fits comfortably and leaves room for large contexts.
