Running LLMs locally means zero API costs, zero latency to a cloud endpoint, and complete data privacy. No tokens leave your machine. No rate limits. No vendor lock-in.
Ollama makes this practical. It wraps llama.cpp, the most optimized CPU/GPU inference engine for quantized models, in a simple CLI and REST API. You download a model with one command, run it with another, and get interactive chat speeds on consumer hardware.
This guide covers everything you need to run LLMs locally with Ollama: hardware requirements, installation, model selection, GPU acceleration, quantization, performance tuning, API integration, and when to scale to cloud GPUs.
Hardware Requirements
Local LLM performance depends primarily on VRAM (for GPU inference) or RAM (for CPU inference). The model must fit entirely in memory for acceptable speeds.
Minimum Requirements by Model Size
| Model Size | Min RAM (CPU) | Min VRAM (GPU) | Example Models |
|---|---|---|---|
| 1B–3B | 4 GB | 2 GB | Phi-3 Mini, Gemma 2B, TinyLlama |
| 7B–8B | 8 GB | 6 GB | Llama 3.1 8B, Mistral 7B, Gemma 7B |
| 13B | 16 GB | 10 GB | Llama 2 13B, CodeLlama 13B |
| 20B–34B | 32 GB | 16 GB | CodeLlama 34B, Yi-34B |
| 70B | 64 GB | 40 GB+ | Llama 2 70B, Llama 3.1 70B |
These are approximate requirements for Q4_K_M quantization (4-bit), which is the default Ollama format. FP16 models require roughly 4x the VRAM.
GPU vs CPU Inference Speed
| Configuration | Llama 3.1 8B (Q4) | Llama 2 13B (Q4) | Llama 2 70B (Q4) |
|---|---|---|---|
| RTX 4090 (24 GB) | 80–120 tok/s | 40–60 tok/s | CPU offload (~5 tok/s) |
| RTX 3090 (24 GB) | 50–70 tok/s | 30–45 tok/s | CPU offload (~3 tok/s) |
| RTX 4060 Ti (16 GB) | 40–60 tok/s | 20–30 tok/s | Does not fit |
| Apple M3 Max (48 GB unified) | 30–45 tok/s | 20–30 tok/s | 8–12 tok/s |
| CPU only (Ryzen 9 7950X) | 8–15 tok/s | 5–10 tok/s | 1–3 tok/s |
GPU inference is 5–10x faster than CPU. If you have an NVIDIA GPU with 8+ GB VRAM, GPU acceleration makes the difference between unusable and interactive.
Installation
macOS
Download the installer from ollama.com/download or install via Homebrew:
brew install ollamaOllama automatically uses Apple Silicon GPU (Metal) on M1/M2/M3/M4 Macs.
Linux
curl -fsSL https://ollama.com/install.sh | shFor NVIDIA GPU support, ensure CUDA drivers are installed. Ollama detects NVIDIA GPUs automatically.
Windows
Download the installer from ollama.com/download. Ollama supports NVIDIA GPUs on Windows via CUDA.
Verify Installation
ollama --versionRunning Your First Model
Download and run a model with a single command:
ollama run llama3.1This downloads the Llama 3.1 8B model (Q4_K_M quantization, ~4.7 GB) and starts an interactive chat session. First run takes a few minutes for the download; subsequent runs start in seconds.
To pull a model without starting chat:
ollama pull llama3.1Essential Commands
# List installed models
ollama list
# Show model details (size, quantization, parameters)
ollama show llama3.1
# Remove a model
ollama rm llama3.1
# Run a specific quantization variant
ollama run llama3.1:70b-instruct-q4_K_M
# Run with a system prompt
ollama run llama3.1 --system "You are a Python expert. Respond with code only."Choosing the Right Model
Ollama's model library contains hundreds of models. Here are the best options by use case:
Recommended Models
| Model | Size | Best For | Speed (RTX 4090) |
|---|---|---|---|
| llama3.1:8b | 4.7 GB | General chat, writing, reasoning | 80–120 tok/s |
| mistral | 4.1 GB | Fast general-purpose assistant | 85–130 tok/s |
| codellama:13b | 7.4 GB | Code generation and review | 40–60 tok/s |
| llama3.1:70b | 40 GB | Complex reasoning, analysis | 8–12 tok/s |
| phi3:mini | 2.2 GB | Lightweight, fast responses | 100–150 tok/s |
| mixtral:8x7b | 26 GB | Multi-task, strong reasoning | 20–35 tok/s |
| gemma2:9b | 5.4 GB | Google's efficient model | 60–90 tok/s |
| deepseek-coder-v2:16b | 8.9 GB | Advanced code generation | 35–50 tok/s |
| qwen2.5:7b | 4.4 GB | Multilingual, strong reasoning | 70–110 tok/s |
For most users, llama3.1:8b or mistral provides the best balance of quality and speed. If you have 24+ GB VRAM, mixtral:8x7b offers significantly better reasoning at interactive speeds.
Understanding Quantization
Ollama models use GGUF quantization, a format that compresses model weights to reduce memory usage while preserving quality. The quantization level determines the tradeoff between size, speed, and quality.
| Quantization | Bits per Weight | Size (7B model) | Quality | Speed |
|---|---|---|---|---|
| Q2_K | 2-bit | ~2.8 GB | Noticeably degraded | Fastest |
| Q4_K_M | 4-bit | ~4.1 GB | Near-original quality | Fast (default) |
| Q5_K_M | 5-bit | ~4.8 GB | Very close to original | Moderate |
| Q6_K | 6-bit | ~5.5 GB | Minimal quality loss | Slower |
| Q8_0 | 8-bit | ~7.2 GB | Near-lossless | Slowest quantized |
| FP16 | 16-bit | ~14 GB | Full precision | Requires most VRAM |
Q4_K_M is the sweet spot for most users, it preserves 95%+ of model quality while cutting VRAM usage by ~4x compared to FP16. For code generation or tasks requiring high precision, Q5_K_M or Q6_K is worth the extra memory.
To run a specific quantization:
ollama run llama3.1:8b-instruct-q5_K_MGPU Acceleration and Performance Tuning
Verify GPU Detection
ollama psThis shows running models and whether they're using GPU. If your NVIDIA GPU isn't detected:
# Check CUDA installation
nvidia-smi
# Verify Ollama sees the GPU
OLLAMA_DEBUG=1 ollama run llama3.1GPU Layer Offloading
For models that don't fully fit in VRAM, Ollama automatically splits layers between GPU and CPU. More GPU layers means faster inference. You can control this in a Modelfile:
FROM llama3.1
PARAMETER num_gpu 35Context Length Configuration
Longer context windows use more memory. The default is typically 2048–4096 tokens. To increase:
ollama run llama3.1 --num-ctx 8192Each doubling of context length roughly doubles KV cache memory usage. For a 7B model at Q4:
| Context Length | KV Cache Memory | Total VRAM (approx) |
|---|---|---|
| 2,048 | ~0.5 GB | ~5 GB |
| 4,096 | ~1 GB | ~5.5 GB |
| 8,192 | ~2 GB | ~6.5 GB |
| 16,384 | ~4 GB | ~8.5 GB |
| 32,768 | ~8 GB | ~12.5 GB |
Memory Management
If you run out of VRAM, Ollama will fall back to CPU for some layers, significantly slowing inference. To optimize:
- Use a smaller quantization (Q4_K_M instead of Q8_0)
- Reduce context length if you don't need long conversations
- Close other GPU-consuming applications
- Consider a smaller model variant
API Integration
Ollama runs a local REST API on port 11434. This makes it easy to integrate into applications.
REST API
# Generate a completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Explain quicksort in one paragraph",
"stream": false
}'
# Chat with message history
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"stream": false
}'Python Integration
pip install ollamaimport ollama
# Simple generation
response = ollama.generate(model="llama3.1", prompt="Write a haiku about coding")
print(response["response"])
# Chat with history
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to check if a number is prime."},
]
response = ollama.chat(model="llama3.1", messages=messages)
print(response["message"]["content"])LangChain Integration
pip install langchain-ollamafrom langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate
model = OllamaLLM(model="llama3.1")
prompt = ChatPromptTemplate.from_template("Explain {topic} in simple terms.")
chain = prompt | model
result = chain.invoke({"topic": "quantum computing"})
print(result)Building a Simple Chatbot
import ollama
def chat():
messages = []
print("Chat with Llama 3.1 (type 'exit' to quit)")
while True:
user_input = input("\nYou: ")
if user_input.lower() == "exit":
break
messages.append({"role": "user", "content": user_input})
response = ollama.chat(model="llama3.1", messages=messages)
assistant_message = response["message"]["content"]
messages.append({"role": "assistant", "content": assistant_message})
print(f"\nAI: {assistant_message}")
chat()Custom Models with Modelfiles
Ollama supports custom model configurations via Modelfiles, similar to Dockerfiles for LLMs:
FROM llama3.1
# Set system prompt
SYSTEM You are a senior Python developer. Always include type hints and docstrings.
# Configure parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096Build and run your custom model:
ollama create python-expert -f Modelfile
ollama run python-expertThis is useful for creating task-specific assistants with fixed system prompts and tuned parameters.
When to Scale to Cloud GPUs
Ollama on local hardware works well for development, prototyping, and personal use. Local GPUs have limitations:
| Limitation | Local GPU | Cloud GPU (Spheron) |
|---|---|---|
| VRAM | 24 GB (RTX 4090) | Up to 141 GB (H200) |
| Largest model | 13B (comfortable) | 70B+ (single GPU) |
| Multi-GPU | PCIe bottleneck | NVLink at 600–900 GB/s |
| Uptime | Personal machine | 24/7 dedicated server |
| Scaling | Single GPU | 1–8 GPU clusters |
When your models outgrow 24 GB, require 24/7 uptime, or need multi-GPU parallelism, Spheron provides cloud GPU instances starting at $0.55/hr with pre-configured CUDA environments and full root access.
Explore GPU options on Spheron →
Frequently Asked Questions
How much VRAM do I need to run Llama 3.1 8B?
The Q4_K_M quantized version (Ollama's default) requires approximately 5–6 GB of VRAM including KV cache. Any GPU with 8 GB VRAM (RTX 3060, RTX 4060, etc.) can run it comfortably. On CPU, you need at least 8 GB of RAM, but inference will be 5–10x slower.
Can I run Ollama on Apple Silicon Macs?
Yes. Ollama automatically uses Metal GPU acceleration on M1/M2/M3/M4 Macs. Apple Silicon's unified memory architecture means the GPU can access all system RAM, so a Mac with 32 GB unified memory can run models that wouldn't fit on a 24 GB discrete GPU. Performance is roughly 60–70% of an equivalent NVIDIA GPU.
What's the difference between Ollama and llama.cpp?
Ollama is a user-friendly wrapper around llama.cpp. It handles model downloading, GGUF format management, GPU detection, and provides a REST API, all things you'd configure manually with raw llama.cpp. If you want maximum control and custom builds, use llama.cpp directly. For ease of use, Ollama is the better choice.
Can I run multiple models simultaneously?
Yes. Ollama loads models on demand and keeps them in memory. You can run multiple models by making API calls to different model names. However, each loaded model consumes VRAM, so running two 7B models simultaneously requires roughly 10–12 GB of VRAM.
How does quantization affect output quality?
Q4_K_M (4-bit) preserves approximately 95% of the original model's quality for most tasks. You may notice slight degradation in complex reasoning, math, or code generation compared to FP16. Q5_K_M and Q6_K offer better quality at the cost of more VRAM. For most conversational and writing tasks, Q4_K_M is indistinguishable from the full-precision model.
Is Ollama suitable for production use?
Ollama is excellent for development, testing, and personal use. For production serving with multiple concurrent users, SLA requirements, and load balancing, consider dedicated inference servers using vLLM, TensorRT-LLM, or Triton Inference Server on cloud GPUs. Ollama's REST API can serve light production loads but lacks features like batching, auto-scaling, and health monitoring.