Tutorial

Deploy llama.cpp Server on GPU Cloud: Multi-GPU + GGUF (2026)

llama.cpp server gpullama.cpp multi gpudeploy llama.cpp productionllama.cpp gpu cloudllama-server openai compatibleGGUFGPU CloudLLM InferenceLLM Deployment
Deploy llama.cpp Server on GPU Cloud: Multi-GPU + GGUF (2026)

The 2026 llama.cpp engine rewrite significantly improved inference speed on Ada Lovelace and Blackwell-class hardware through CUDA kernel rewrites. For teams running models larger than a single GPU's VRAM, multi-GPU tensor split means you're not limited by what fits on one card.

This isn't a dev-only tool anymore. If you're serving fewer than 15-20 concurrent users, a single well-configured llama-server instance on a cloud GPU is cheaper and simpler than running a full vLLM stack. No Docker, no Python serving process, no Hugging Face model downloads in safetensors format. Just a compiled binary, a GGUF file, and an OpenAI-compatible API on port 8080.

For background on GGUF quantization formats and how to pick between Q4_K_M, Q5_K_M, and Q8_0, see the GGUF dynamic quantization guide. If you're choosing between llama.cpp and vLLM for your use case, the Ollama vs vLLM comparison covers the concurrency and throughput tradeoffs in detail.

TL;DR

Featurellama.cppvLLMOllama
Best forSmall teams, GGUF flexibilityHigh concurrency productionLocal dev, prototyping
Concurrency ceiling~16 parallel slots180+ concurrent requests~5-10 users
Multi-GPULayer split (default) + row tensor-parallelTrue tensor parallelismNo
GGUF supportNativeLimited (safetensors preferred)Native
Setup time5-10 min10-15 min2 min
OpenAI APIYes (/v1/chat/completions)YesYes
Continuous batchingNoYesNo
Binary sizeSingle binary, ~50 MBDocker image, ~8 GBInstaller, ~500 MB

When llama.cpp Beats vLLM and SGLang

Small teams and internal tools. If your service handles sub-16 concurrent users, llama.cpp's --parallel N flag gives you parallel decoding slots without the operational overhead of vLLM's continuous batching engine. Fewer moving parts means less to debug at 2am.

GGUF flexibility. llama.cpp loads any GGUF file directly: Q2_K, Q4_K_M, Q5_K_M, Q8_0, IQ4_XS. No conversion, no quantization step on your side. You download the file, point the server at it, and you're serving. vLLM prefers safetensors and its GGUF support is more limited.

CPU fallback. If a model's layers don't all fit on your GPU VRAM, llama.cpp falls back automatically to CPU for the remaining layers. You get slower inference rather than an OOM crash. vLLM does not have this behavior.

Mixed hardware. Tensor split across GPUs of different VRAM sizes works out of the box. If you have an 80GB and a 40GB GPU, --tensor-split 2,1 allocates layers proportionally. vLLM's tensor parallelism requires homogeneous GPU setups for full efficiency.

Zero Python overhead. A single compiled binary handles the HTTP server, the inference loop, and the API. No FastAPI, no asyncio workers, no CUDA version mismatches between PyTorch and your driver.

For high-concurrency production (50+ users), see the vLLM vs SGLang benchmark guide for the throughput numbers you'd need to justify the added complexity.

2026 Engine Improvements: Multi-GPU and Faster Inference

Two things make llama.cpp worth reconsidering for production in 2026:

--tensor-split for multi-GPU setups. The flag distributes model layers across all detected GPUs based on the ratios you specify. By default this uses --split-mode layer (pipeline parallel): the model loads across GPUs at startup, layers route to specific GPUs, and at any moment only one GPU is computing. This unlocks models too large for any single GPU you have, but compute does not scale with GPU count in this mode.

CUDA kernel rewrites on Ada Lovelace and Blackwell. The April 2026 engine rewrite improved throughput significantly on RTX 5090 and H100 over the previous kernels for typical Q4_K_M inference workloads. The RTX 5090's Blackwell architecture benefits the most: improved kernels take better advantage of Blackwell's memory bandwidth and architecture, boosting Q4_K_M throughput across quantization levels. Note that CUDA 12.8 is needed on Blackwell to avoid a regression. For specific version numbers and changelog entries, see the llama.cpp GitHub releases page.

Multi-GPU modes in llama.cpp. There are two approaches:

  • --split-mode layer (default): Layers are assigned to specific GPUs serially. GPU 0 handles layers 0-N, GPU 1 handles layers N+1-M. At any moment, only one GPU is computing. VRAM utilization scales, compute does not.
  • --split-mode row (NCCL required): Weight matrices within each layer are sharded across all GPUs. All GPUs compute in parallel on every forward pass. Both VRAM and compute scale with GPU count. Build with -DGGML_CUDA_NCCL=ON to enable. Gains are most visible on fast GPU interconnect (NVLink preferred); PCIe setups see smaller improvements due to interconnect bandwidth.
  • Tensor parallelism (vLLM): Similar weight-sharding approach, combined with continuous batching for high-concurrency production loads.

Layer split is the right default for fitting large models across GPUs. For actual per-token throughput gains from multiple GPUs, --split-mode row is the option within llama.cpp. For very high concurrency (50+ users), vLLM's continuous batching engine scales further.

Choosing the Right Spheron GPU

GPUVRAMBest for (GGUF)On-demand $/hrSpot $/hr
RTX 509032 GB GDDR77B-32B Q4_K_M~$0.92N/A
L40S48 GB GDDR67B-34B Q4_K_M, 13B Q8_0~$1.16~$0.61
H100 PCIe 80GB80 GB HBM2e34B-70B Q4_K_M$2.01N/A
H100 SXM5 80GB80 GB HBM334B-70B Q4_K_M, multi-GPU 70B+$5.01$2.91

RTX 5090 on Spheron is the cheapest competent option for 7B-32B models at Q4_K_M. The 32GB GDDR7 bandwidth handles fast decode on smaller quantized models, and the Blackwell architecture gets the full benefit of the 2026 kernel improvements.

For 34B-70B models, you want 48-80GB of VRAM. L40S instances fit 34B Q4_K_M comfortably with headroom for a 32K context. The on-demand H100 cloud instance is the right choice when you need 70B at Q4_K_M on a single GPU, or when you're setting up a two-GPU tensor split for larger models.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

GGUF Quantization Choices and VRAM-Per-Quant Tradeoffs

Model SizeQ4_K_MQ5_K_MQ8_0Fits on
7B~4.5 GB~5.2 GB~7.7 GBRTX 5090, L40S
13B~8.4 GB~9.7 GB~14.3 GBRTX 5090, L40S
32B~20 GB~23 GB~34 GBL40S, H100
70B~42 GB~48 GB~75 GBH100 80GB

Q4_K_M is the production default. Perplexity loss versus BF16 is about 0.15-0.20 points, which is imperceptible in practice. It fits 70B models on a single H100 80GB with enough VRAM headroom for an 8K context window.

Q5_K_M closes to within 0.08 perplexity points of BF16. Worth the extra VRAM if your GPU has it. For 7B-13B models on an RTX 5090 or L40S, Q5_K_M is a straightforward upgrade from Q4_K_M.

Q8_0 is near-lossless but largely eliminates the file size benefit. A 70B Q8_0 model is 75 GB, which won't fit on a single H100 80GB without tensor split. Use it only when you have a specific quality requirement that Q5_K_M doesn't meet.

Pre-quantized models in all these formats are available from the Bartowski and Unsloth namespaces on Hugging Face. For custom quantization with Unsloth Dynamic 2.0's per-layer intelligent quantization, see the GGUF dynamic quantization guide.

Build llama.cpp with CUDA on a Cloud GPU

After provisioning your Spheron instance, SSH in and verify the GPU:

bash
nvidia-smi
# Should show your GPU model, VRAM, CUDA version, and driver

Then build llama.cpp with CUDA support:

bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build with CUDA backend
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Verify the server binary
./build/bin/llama-server --version

The build takes 3-8 minutes depending on GPU instance CPU count. The server binary is at ./build/bin/llama-server.

For ROCm (AMD GPU) builds, replace -DGGML_CUDA=ON with -DGGML_HIPBLAS=ON. This guide focuses on CUDA; the flag names and API surface are otherwise identical.

Core llama-server Flags

FlagWhat it doesRecommended value
-mModel pathPath to your .gguf file
-nglGPU layers to offload99 (all layers)
--ctx-sizeContext length8192 default; 32768 for long context
--parallelConcurrent request slots8-16 for production
--hostBind address0.0.0.0 for external access
--portServer port8080
--n-predictMax tokens per response-1 (unlimited)
--threadsCPU threads for CPU layers$(nproc)
--tensor-splitMulti-GPU split ratios1,1 for two equal GPUs
--split-modeMulti-GPU compute modelayer (default, pipeline) or row (NCCL, true tensor-parallel)
--flash-attnFlash attentionAdd for longer contexts

Launching the Server

Download a model first:

bash
pip install huggingface_hub
huggingface-cli download bartowski/Llama-3.3-70B-Instruct-GGUF \
  --include "Llama-3.3-70B-Instruct-Q4_K_M.gguf" \
  --local-dir ./models

For 7B-13B models:

bash
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  --local-dir ./models

Launch the server:

bash
./build/bin/llama-server \
  -m ./models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 8192 \
  --parallel 8 \
  --flash-attn

You'll see layer loading progress, then HTTP server listening on 0.0.0.0:8080. From that point, the server is ready.

Multi-GPU Tensor Split Configuration

For models that exceed a single GPU's VRAM, add --tensor-split:

bash
./build/bin/llama-server \
  -m ./models/your-model.gguf \
  -ngl 99 \
  --tensor-split 1,1 \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 8192 \
  --parallel 8

Finding the right ratios. Run nvidia-smi before starting the server:

GPU 0: NVIDIA H100 80GB SXM5 - 81559 MiB total
GPU 1: NVIDIA H100 80GB SXM5 - 81559 MiB total

Two equal 80GB GPUs: --tensor-split 1,1.

Unequal GPUs (80GB + 48GB):

GPU 0: 80GB
GPU 1: 48GB

The ratio is 80:48, which simplifies to 5:3. Use --tensor-split 5,3.

When tensor split is worth it vs. upgrading GPU. Running a 70B Q8_0 model (75 GB) across two H100 80GB GPUs with --tensor-split 1,1 works, but throughput is roughly the same as a single H100 running the same model at Q4_K_M (42 GB). In most cases, a smaller quant on a single GPU is simpler and similarly performant. Tensor split is most useful when you want the quality of Q5_K_M or Q8_0 on a 70B model and you don't have a single GPU with enough VRAM.

Exposing an OpenAI-Compatible Endpoint

Test the endpoint immediately after the server starts:

bash
curl http://YOUR_IP:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "max_tokens": 100
  }'

llama-server supports these OpenAI-compatible endpoints:

  • POST /v1/chat/completions
  • POST /v1/completions
  • POST /v1/embeddings
  • GET /v1/models

Python SDK integration:

python
from openai import OpenAI

client = OpenAI(
    base_url="http://YOUR_IP:8080/v1",
    api_key="any-string",  # llama-server does not validate the key by default
)

response = client.chat.completions.create(
    model="llama",
    messages=[{"role": "user", "content": "Explain GGUF quantization in one sentence."}],
    max_tokens=200,
)
print(response.choices[0].message.content)

TypeScript/Node.js:

typescript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://YOUR_IP:8080/v1",
  apiKey: "any-string",
});

const response = await client.chat.completions.create({
  model: "llama",
  messages: [{ role: "user", content: "Hello" }],
});

For production hardening patterns (auth headers, nginx reverse proxy, health check endpoints), see the self-hosted OpenAI-compatible API guide which covers these in detail for vLLM but the nginx and systemd patterns are identical.

Running as a Persistent Service with systemd

For production, run llama-server as a managed systemd service so it restarts automatically on crash or reboot:

ini
# /etc/systemd/system/llama-server.service
[Unit]
Description=llama.cpp server
After=network.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/build/bin/llama-server \
  -m /home/ubuntu/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 8192 \
  --parallel 8 \
  --flash-attn
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

bash
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
sudo systemctl status llama-server

Logs are accessible via journalctl -u llama-server -f. The service restarts within 10 seconds after a crash.

Concurrency Limits: When to Graduate to vLLM

llama.cpp is the right tool until it isn't. Here's a concrete checklist for when to migrate:

Graduate to vLLM when:

  • Queue depth is consistently above 8 requests
  • p95 TTFT (time to first token) exceeds 3 seconds under normal load
  • You need multi-GPU throughput at very high concurrency beyond what --split-mode row provides (vLLM's continuous batching scales further for 50+ users)
  • You need Prometheus /metrics for auto-scaling or alerting
  • You're serving more than 20 concurrent users at steady state

Stay on llama.cpp when:

  • You need GGUF format flexibility with any quantization level
  • You want a single binary with no Python runtime dependency
  • Your team has fewer than 15 concurrent users
  • You need CPU fallback for models that barely don't fit on GPU
  • You're running on mixed GPU hardware where sizes differ

At that point, see the vLLM production deployment guide or the Ollama vs vLLM comparison for migration path.

Cost: Right-Sizing a Single GPU for llama.cpp

Monthly cost at 720 hours/month:

SetupModelOn-demandSpotMonthly (on-demand)Monthly (spot)
H100 SXM5 80GBLlama 3.3 70B Q4_K_M$5.01/hr$2.91/hr~$3,607/mo~$2,095/mo
H100 PCIe 80GBLlama 3.3 70B Q4_K_M$2.01/hrN/A~$1,447/moN/A
L40S 48GBMistral 7B Q4_K_M~$1.16/hr~$0.61/hr~$835/mo~$439/mo
RTX 5090 32GBLlama 3.1 8B Q4_K_M~$0.92/hrN/A~$662/moN/A

Spot pricing on H100 SXM5 cuts the monthly bill from $3,607 to $2,095 for the same 70B model. The trade-off is that spot instances can be reclaimed with short notice. For development, internal tools, or workloads that can tolerate interruption, spot is a reasonable default. For customer-facing APIs with uptime SLAs, stick to on-demand.

Spheron aggregates compute from 5+ providers, which is why on-demand H100 pricing here is meaningfully lower than single-hyperscaler rates.

llama.cpp server on a right-sized cloud GPU is the lowest-overhead path to a production OpenAI-compatible API for sub-20-user workloads. Spheron provisions bare-metal GPU instances in under 2 minutes with per-minute billing.

H200 on Spheron →

STEPS / 06

Quick Setup Guide

  1. Pick your GPU and provision on Spheron

    Choose a GPU based on model size: RTX 5090 (32GB) for 7B-32B Q4_K_M models at lowest on-demand cost, L40S (48GB) for 7B-34B models with more VRAM headroom, H100 80GB for 70B models. Log in at app.spheron.ai and provision a bare-metal instance. Verify the GPU with nvidia-smi before proceeding.

  2. Build llama.cpp with CUDA support

    Clone the llama.cpp repo and build the CUDA backend: git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc). Verify with ./build/bin/llama-server --version.

  3. Download a GGUF model from Hugging Face

    Use huggingface-cli to download a pre-quantized GGUF file. Example for Llama 3.3 70B Q4_K_M: huggingface-cli download bartowski/Llama-3.3-70B-Instruct-GGUF --include 'Llama-3.3-70B-Instruct-Q4_K_M.gguf' --local-dir ./models. For 7B-14B models, use a Q4_K_M or Q5_K_M quant from the Bartowski or Unsloth namespaces.

  4. Launch llama-server with GPU offloading

    Start the server: ./build/bin/llama-server -m ./models/your-model.gguf -ngl 99 --host 0.0.0.0 --port 8080 --ctx-size 8192 --parallel 8. The -ngl 99 flag offloads all layers to GPU. --parallel 8 allows 8 concurrent request slots. The OpenAI-compatible API is available on port 8080.

  5. Configure multi-GPU tensor split (optional)

    For models that exceed a single GPU's VRAM, add the --tensor-split flag: ./build/bin/llama-server -m ./models/model.gguf -ngl 99 --tensor-split 1,1 --host 0.0.0.0 --port 8080. Ratios in --tensor-split correspond to the relative VRAM allocation across GPUs listed by nvidia-smi. For two 80GB GPUs, 1,1 splits evenly.

  6. Test and integrate with the OpenAI SDK

    Send a test request: curl http://YOUR_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "llama", "messages": [{"role": "user", "content": "Hello"}]}'. In Python: set client = openai.OpenAI(base_url='http://YOUR_IP:8080/v1', api_key='any-string') and call client.chat.completions.create() with your normal parameters.

FAQ / 05

Frequently Asked Questions

Provision a CUDA-enabled GPU instance (L40S or H100 recommended), clone the llama.cpp repo, build with cmake -DGGML_CUDA=ON, download a GGUF model from Hugging Face, then run llama-server with --ngl 99 to offload all layers to GPU. The server exposes an OpenAI-compatible API on port 8080 by default.

Tensor split (--tensor-split flag) distributes model layers across multiple GPUs proportionally. Use it when a GGUF model is too large to fit on a single GPU. For two equal GPUs, --tensor-split 1,1 splits evenly. For unequal GPUs, match the ratios to each GPU's VRAM (e.g. --tensor-split 2,1 for an 80GB/40GB pair). The default behavior uses --split-mode layer (pipeline parallel): layers are assigned serially to each GPU, so compute does not scale with GPU count, but you unlock bigger models. For true tensor parallelism within llama.cpp, use --split-mode row (build with -DGGML_CUDA_NCCL=ON), which shards weight matrices across GPUs with NCCL reductions and improves throughput on fast interconnects like NVLink.

llama.cpp server supports parallel slots (--parallel N flag) for concurrent decoding. In practice, 8-16 parallel slots is a reasonable limit on a single H100 before VRAM pressure and KV cache contention degrade latency. Beyond 16-20 concurrent users with sustained load, vLLM's continuous batching engine will outperform llama.cpp in total throughput. Use llama.cpp for teams of up to about 15 concurrent users; migrate to vLLM beyond that threshold.

Yes. llama-server exposes /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models endpoints that are compatible with the OpenAI Python SDK and any client that supports OpenAI's REST API. Set base_url to http://YOUR_IP:8080/v1 and api_key to any non-empty string. No other code changes are needed.

Q4_K_M is the standard production choice: it fits 70B models on a single A100 80GB or H100 80GB, and perplexity loss versus BF16 is roughly 0.15-0.20 points (imperceptible in practice). Q5_K_M is worth the extra VRAM if you have headroom - it closes to within 0.08 perplexity points of BF16. Q8_0 is near-lossless but largely eliminates the size benefit. For 7B-14B models on an L40S or RTX 5090, Q4_K_M fits comfortably and leaves room for large contexts.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.