Comparison

Ollama vs vLLM: Which Should You Use to Self-Host LLMs?

Back to BlogWritten by Mitrasish, Co-founderMar 29, 2026
OllamavLLMLLM InferenceSelf-Hosted LLMGPU CloudAI DeploymentLLM Serving
Ollama vs vLLM: Which Should You Use to Self-Host LLMs?

Ollama and vLLM both let you run LLMs on your own hardware. That's where the similarity ends. Ollama is built for developers who want to get a model running in two minutes, no Docker, no config files. vLLM is built for engineers who need to serve thousands of requests per day with predictable latency.

Picking the wrong tool costs you either time (overengineering a prototype) or reliability (shipping Ollama to production and watching it fall over at 10 concurrent users). This post lays out exactly when each tool makes sense, with benchmark numbers on the same hardware and cloud-init scripts for deploying both on Spheron.

TL;DR

ComparisonOllamavLLM
Best forLocal dev, prototypingProduction APIs, high concurrency
Setup time~2 minutes~5 minutes
Throughput (H100, 32 concurrent)~320 tok/s~1,450 tok/s
OpenAI-compatible APIYesYes
Continuous batchingNoYes
GPU memory efficiencyGood (static KV cache)Excellent (PagedAttention)
GGUF/quantization supportYes (native)FP8, AWQ, GPTQ recommended; GGUF and BitsAndBytes supported
Multi-model servingYesNo (one model per server)

Feature Comparison

FeatureOllamavLLM
OpenAI API compatibilityYes (/v1/chat/completions)Yes (full drop-in)
Multi-model servingYes (load on demand)No (one model per server process)
Quantization formatsGGUF (Q2–Q8, FP16)FP8, AWQ, GPTQ, FP16 (recommended); GGUF, BitsAndBytes also supported
GPU memory managementStatic KV cache per requestPagedAttention (non-contiguous pages)
Concurrent requestsSequential queueContinuous batching
Setup complexityInstall script, no DockerDocker + HF token
Supported backendsllama.cpp, Metal, CUDACUDA, ROCm (H100/MI300X optimized)
Metrics endpointNonePrometheus /metrics
Streaming supportYesYes
Model formatGGUFHugging Face safetensors
Apple Silicon supportYes (Metal)No
Multi-GPU tensor parallelismNoYes

The most important differences are the last three rows: model format, Apple Silicon, and multi-GPU. If you're on a Mac, Ollama is your only option. If you need to spread a 70B model across multiple H100s, vLLM is your only option. For everything in between, the choice comes down to how many concurrent users you're serving.

When to Use Ollama

Ollama fits any situation where you're running one model, talking to it yourself (or with a handful of people), and speed of setup matters more than throughput.

Prototyping. You want to test whether a model is good enough for your use case before building anything. ollama run llama3.1:8b gets you a working chat interface in 90 seconds. No Hugging Face account, no Docker, no token setup.

Local development. You're building an app that calls an LLM API and want to iterate fast without paying for tokens. Ollama's OpenAI-compatible endpoint at localhost:11434/v1 works as a drop-in for the OpenAI SDK. Change one line in your app and your API calls hit your local GPU.

Testing multiple models. You want to compare Llama 3.1, Mistral, and Qwen in the same session. Ollama loads models on demand and keeps recently used models in memory. ollama run mistral while llama3.1 is still loaded will work if you have the VRAM headroom.

Internal tools. You're building a tool for a handful of people inside your organization. Five simultaneous users is fine with Ollama. It queues requests and processes them one at a time, which works well when load is predictable and low.

Apple Silicon. Ollama uses Metal on M1/M2/M3/M4 Macs. vLLM does not support Metal. If you're developing on a Mac and don't have an NVIDIA GPU, Ollama is your only self-hosted option.

See our full Ollama setup guide for installation details, quantization options, and GPU acceleration configuration.

When to Use vLLM

vLLM fits any situation where you're serving a model to real users under real load, and you need consistent performance.

Production inference APIs. You're building a product that calls an LLM backend on every user request. At 20 concurrent users, Ollama queues 19 of them. vLLM's continuous batching processes all 20 in the same forward pass. The difference in p99 latency is substantial.

High concurrency. vLLM's PagedAttention manages KV cache in non-contiguous memory pages, similar to how an OS manages virtual memory. This means it can handle far more concurrent sequences than Ollama before running out of VRAM. On an H100 80GB serving Llama 3.1 8B, vLLM can sustain 180+ concurrent FP16 requests before OOM. Ollama hits OOM around 40.

SLA-bound latency. If you have a p95 latency target, vLLM's continuous batching and chunked prefill give you far more control over latency distribution. Ollama's sequential queue means a single slow request delays all subsequent ones.

Cost-efficient scale. Continuous batching means each GPU is doing useful work on more tokens per clock cycle. vLLM's throughput per dollar is meaningfully better than Ollama at scale, especially on H100 with FP8 enabled. For broader strategies on reducing GPU spend, see the GPU cost optimization playbook.

Multi-GPU and large models. Running Llama 3.3 70B? You need either FP8 on a single H100 (tight) or tensor parallelism across two H100s. vLLM handles both. Ollama does not support multi-GPU tensor parallelism.

For multi-GPU vLLM setups, see our vLLM production deployment guide.

Same Hardware Benchmarks: H100 80GB

These numbers were measured on Spheron H100 SXM5 80GB instances at $2.40/hr on-demand as of March 25, 2026. Ollama uses Q4_K_M GGUF; vLLM uses FP16 unless noted. Prompt: 100 tokens in, 200 tokens out.

MetricOllama (llama3.1:8b Q4_K_M)vLLM (Llama-3.1-8B-Instruct FP16)
Single-user throughput~420 tok/s~510 tok/s
8 concurrent requests~310 tok/s total~1,100 tok/s total
32 concurrent requests~320 tok/s total~1,450 tok/s total
Time to first token (single)~35ms~28ms
Time to first token (32 concurrent)~290ms~95ms
VRAM usage (8B model + KV cache)~9.2 GB~17.5 GB (FP16) / ~9.8 GB (FP8)
Max concurrent before OOM (80GB)~40~180 (FP16) / ~350+ (FP8)

The single-user numbers are close. At one concurrent request, both tools are fast enough for interactive use. The gap opens at 8+ concurrent requests, where Ollama's sequential queue means total throughput barely moves while vLLM scales proportionally.

Note that Ollama uses Q4_K_M GGUF (4-bit quantization), which is why its VRAM usage is lower at single-user load. vLLM in FP16 uses more VRAM but handles concurrent load far better. With vLLM FP8 enabled, VRAM drops to roughly the same level as Ollama's GGUF, while throughput under concurrent load stays 3-4x higher.

For a full GPU-by-GPU inference breakdown, see Best GPU for AI Inference 2026.

Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

One-Click Setup on Spheron

Both scripts work as cloud-init user-data on Spheron GPU instances. Paste either script into the user-data field when launching a new instance, and the tool will be ready by the time you SSH in.

Ollama Cloud-Init

bash
#!/bin/bash
# Ollama cloud-init for Spheron GPU instances
curl -fsSL https://ollama.com/install.sh | sh

# Bind to all interfaces so you can reach it remotely
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
EOF

systemctl daemon-reload
systemctl enable ollama
systemctl restart ollama

# Pull a model (change as needed)
sleep 5
ollama pull llama3.1:8b

The OLLAMA_HOST=0.0.0.0 override is essential if you're accessing Ollama from outside the instance. Without it, Ollama binds to localhost only and remote API calls will fail.

vLLM Cloud-Init

bash
#!/bin/bash
# vLLM cloud-init for Spheron GPU instances
apt-get update -y
apt-get install -y docker.io

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update -y
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

# Launch vLLM
# For gated models (like Llama 3.1), add: -e HUGGING_FACE_HUB_TOKEN=hf_your_token_here
# To avoid auth issues, you can use an openly licensed model like:
# mistralai/Mistral-7B-Instruct-v0.3
docker run -d \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 256

Two things to know about this script. First, meta-llama/Llama-3.1-8B-Instruct is a gated model. You need to accept Meta's license on Hugging Face and pass your token via -e HUGGING_FACE_HUB_TOKEN=hf_.... If you want to skip the auth step, replace the model with mistralai/Mistral-7B-Instruct-v0.3, which is openly licensed. Second, --ipc=host is required. Removing it will cause cryptic CUDA errors under load.

For detailed Spheron instance setup, see the Ollama quick guide and vLLM server guide in our docs.

Migrating from Ollama to vLLM

Both tools expose an OpenAI-compatible API, so the migration is mostly a configuration change, not a code rewrite. For a full walkthrough of running a self-hosted OpenAI-compatible endpoint on Spheron, see Self-hosted OpenAI-compatible API with vLLM. Here's what actually changes.

1. Model Format

This is the only non-trivial part. Ollama uses GGUF, a format specific to llama.cpp with quantization baked in. vLLM uses Hugging Face safetensors format. You cannot point vLLM at a .gguf file.

Find the equivalent Hugging Face model for whatever you're running in Ollama:

Ollama model nameHugging Face equivalent
llama3.1:8bmeta-llama/Llama-3.1-8B-Instruct
llama3.1:70bmeta-llama/Llama-3.1-70B-Instruct
mistralmistralai/Mistral-7B-Instruct-v0.3
qwen2.5:7bQwen/Qwen2.5-7B-Instruct
gemma2:9bgoogle/gemma-2-9b-it

vLLM does support FP8, AWQ, and GPTQ quantization via Hugging Face, so you can still run quantized models. You just can't use GGUF files. For most 7B-13B models on an H100, FP16 fits comfortably and gives you full quality with no quantization artifacts.

2. API Base URL

Change one line in your application:

python
# Before (Ollama)
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# After (vLLM)
client = OpenAI(base_url="http://your-instance-ip:8000/v1", api_key="token")

Everything else stays the same: system prompts, user messages, temperature, max_tokens. The /v1/chat/completions endpoint behaves identically.

3. Model Name in Requests

Ollama uses short names like llama3.1:8b. vLLM uses the full Hugging Face path. Update the model field in your API calls:

python
# Before
response = client.chat.completions.create(model="llama3.1:8b", ...)

# After
response = client.chat.completions.create(model="meta-llama/Llama-3.1-8B-Instruct", ...)

4. Concurrency Settings

In Ollama, you don't configure concurrency. In vLLM, --max-num-seqs controls how many requests can be in-flight at once. Start at 256 for a single H100 with an 8B model and adjust based on your vllm:kv_cache_usage_perc metric. If you're consistently above 90% KV cache utilization, either lower --max-num-seqs or reduce --max-model-len.

Which One Should You Use?

SituationUse
Building a prototype or internal toolOllama
Running local dev without DockerOllama
Testing multiple models in one sessionOllama
Serving 5+ concurrent usersvLLM
Need p99 latency SLAvLLM
Cost-optimizing production inferencevLLM
Running on Apple SiliconOllama
Multi-GPU tensor parallelismvLLM

The practical rule: start with Ollama. If you're hitting concurrent request limits or need Prometheus metrics for your SLA dashboards, switch to vLLM. The migration is straightforward, and the throughput improvement at scale justifies the added complexity.


Both tools run well on Spheron GPU cloud. Spin up an H100 for vLLM production serving or any 8GB+ GPU for Ollama prototyping, all on-demand with per-minute billing.

Rent H100 for vLLM → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.