How to Run LLMs Locally with Ollama: GPU-Accelerated Setup Guide

Running LLMs locally means zero API costs, zero latency to a cloud endpoint, and complete data privacy. No tokens leave your machine. No rate limits. No vendor lock-in.

Ollama makes this practical. It wraps llama.cpp, the most optimized CPU/GPU inference engine for quantized models, in a simple CLI and REST API. You download a model with one command, run it with another, and get interactive chat speeds on consumer hardware.

This guide covers everything you need to run LLMs locally with Ollama: hardware requirements, installation, model selection, GPU acceleration, quantization, performance tuning, API integration, and when to scale to cloud GPUs.

Hardware Requirements

Local LLM performance depends primarily on VRAM (for GPU inference) or RAM (for CPU inference). The model must fit entirely in memory for acceptable speeds.

Minimum Requirements by Model Size

Model Size	Min RAM (CPU)	Min VRAM (GPU)	Example Models
1B–3B	4 GB	2 GB	Phi-3 Mini, Gemma 2B, TinyLlama
7B–8B	8 GB	6 GB	Llama 3.1 8B, Mistral 7B, Gemma 7B
13B	16 GB	10 GB	Llama 2 13B, CodeLlama 13B
20B–34B	32 GB	16 GB	CodeLlama 34B, Yi-34B
70B	64 GB	40 GB+	Llama 2 70B, Llama 3.1 70B

These are approximate requirements for Q4_K_M quantization (4-bit), which is the default Ollama format. FP16 models require roughly 4x the VRAM.

GPU vs CPU Inference Speed

Configuration	Llama 3.1 8B (Q4)	Llama 2 13B (Q4)	Llama 2 70B (Q4)
RTX 4090 (24 GB)	80–120 tok/s	40–60 tok/s	CPU offload (~5 tok/s)
RTX 3090 (24 GB)	50–70 tok/s	30–45 tok/s	CPU offload (~3 tok/s)
RTX 4060 Ti (16 GB)	40–60 tok/s	20–30 tok/s	Does not fit
Apple M3 Max (48 GB unified)	30–45 tok/s	20–30 tok/s	8–12 tok/s
CPU only (Ryzen 9 7950X)	8–15 tok/s	5–10 tok/s	1–3 tok/s

GPU inference is 5–10x faster than CPU. If you have an NVIDIA GPU with 8+ GB VRAM, GPU acceleration makes the difference between unusable and interactive.

Installation

macOS

Download the installer from ollama.com/download or install via Homebrew:

bash

brew install ollama

Ollama automatically uses Apple Silicon GPU (Metal) on M1/M2/M3/M4 Macs.

Linux

bash

curl -fsSL https://ollama.com/install.sh | sh

For NVIDIA GPU support, ensure CUDA drivers are installed. Ollama detects NVIDIA GPUs automatically.

Windows

Download the installer from ollama.com/download. Ollama supports NVIDIA GPUs on Windows via CUDA.

Verify Installation

bash

ollama --version

Running Your First Model

Download and run a model with a single command:

bash

ollama run llama3.1

This downloads the Llama 3.1 8B model (Q4_K_M quantization, ~4.7 GB) and starts an interactive chat session. First run takes a few minutes for the download; subsequent runs start in seconds.

To pull a model without starting chat:

bash

ollama pull llama3.1

Essential Commands

bash

# List installed models
ollama list

# Show model details (size, quantization, parameters)
ollama show llama3.1

# Remove a model
ollama rm llama3.1

# Run a specific quantization variant
ollama run llama3.1:70b-instruct-q4_K_M

# Run with a system prompt
ollama run llama3.1 --system "You are a Python expert. Respond with code only."

Choosing the Right Model

Ollama's model library contains hundreds of models. Here are the best options by use case:

Recommended Models

Model	Size	Best For	Speed (RTX 4090)
llama3.1:8b	4.7 GB	General chat, writing, reasoning	80–120 tok/s
mistral	4.1 GB	Fast general-purpose assistant	85–130 tok/s
codellama:13b	7.4 GB	Code generation and review	40–60 tok/s
llama3.1:70b	40 GB	Complex reasoning, analysis	8–12 tok/s
phi3:mini	2.2 GB	Lightweight, fast responses	100–150 tok/s
mixtral:8x7b	26 GB	Multi-task, strong reasoning	20–35 tok/s
gemma2:9b	5.4 GB	Google's efficient model	60–90 tok/s
deepseek-coder-v2:16b	8.9 GB	Advanced code generation	35–50 tok/s
qwen2.5:7b	4.4 GB	Multilingual, strong reasoning	70–110 tok/s

For most users, llama3.1:8b or mistral provides the best balance of quality and speed. If you have 24+ GB VRAM, mixtral:8x7b offers significantly better reasoning at interactive speeds.

Understanding Quantization

Ollama models use GGUF quantization, a format that compresses model weights to reduce memory usage while preserving quality. The quantization level determines the tradeoff between size, speed, and quality.

Quantization	Bits per Weight	Size (7B model)	Quality	Speed
Q2_K	2-bit	~2.8 GB	Noticeably degraded	Fastest
Q4_K_M	4-bit	~4.1 GB	Near-original quality	Fast (default)
Q5_K_M	5-bit	~4.8 GB	Very close to original	Moderate
Q6_K	6-bit	~5.5 GB	Minimal quality loss	Slower
Q8_0	8-bit	~7.2 GB	Near-lossless	Slowest quantized
FP16	16-bit	~14 GB	Full precision	Requires most VRAM

Q4_K_M is the sweet spot for most users, it preserves 95%+ of model quality while cutting VRAM usage by ~4x compared to FP16. For code generation or tasks requiring high precision, Q5_K_M or Q6_K is worth the extra memory.

To run a specific quantization:

bash

ollama run llama3.1:8b-instruct-q5_K_M

GPU Acceleration and Performance Tuning

Verify GPU Detection

bash

ollama ps

This shows running models and whether they're using GPU. If your NVIDIA GPU isn't detected:

bash

# Check CUDA installation
nvidia-smi

# Verify Ollama sees the GPU
OLLAMA_DEBUG=1 ollama run llama3.1

GPU Layer Offloading

For models that don't fully fit in VRAM, Ollama automatically splits layers between GPU and CPU. More GPU layers means faster inference. You can control this in a Modelfile:

FROM llama3.1
PARAMETER num_gpu 35

Context Length Configuration

Longer context windows use more memory. The default is typically 2048–4096 tokens. To increase:

bash

ollama run llama3.1 --num-ctx 8192

Each doubling of context length roughly doubles KV cache memory usage. For a 7B model at Q4:

Context Length	KV Cache Memory	Total VRAM (approx)
2,048	~0.5 GB	~5 GB
4,096	~1 GB	~5.5 GB
8,192	~2 GB	~6.5 GB
16,384	~4 GB	~8.5 GB
32,768	~8 GB	~12.5 GB

Memory Management

If you run out of VRAM, Ollama will fall back to CPU for some layers, significantly slowing inference. To optimize:

Use a smaller quantization (Q4_K_M instead of Q8_0)
Reduce context length if you don't need long conversations
Close other GPU-consuming applications
Consider a smaller model variant

API Integration

Ollama runs a local REST API on port 11434. This makes it easy to integrate into applications.

REST API

bash

# Generate a completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain quicksort in one paragraph",
  "stream": false
}'

# Chat with message history
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "stream": false
}'

Python Integration

bash

pip install ollama

python

import ollama

# Simple generation
response = ollama.generate(model="llama3.1", prompt="Write a haiku about coding")
print(response["response"])

# Chat with history
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to check if a number is prime."},
]
response = ollama.chat(model="llama3.1", messages=messages)
print(response["message"]["content"])

LangChain Integration

bash

pip install langchain-ollama

python

from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate

model = OllamaLLM(model="llama3.1")
prompt = ChatPromptTemplate.from_template("Explain {topic} in simple terms.")
chain = prompt | model

result = chain.invoke({"topic": "quantum computing"})
print(result)

Building a Simple Chatbot

python

import ollama

def chat():
    messages = []
    print("Chat with Llama 3.1 (type 'exit' to quit)")

    while True:
        user_input = input("\nYou: ")
        if user_input.lower() == "exit":
            break

        messages.append({"role": "user", "content": user_input})
        response = ollama.chat(model="llama3.1", messages=messages)
        assistant_message = response["message"]["content"]
        messages.append({"role": "assistant", "content": assistant_message})
        print(f"\nAI: {assistant_message}")

chat()

Custom Models with Modelfiles

Ollama supports custom model configurations via Modelfiles, similar to Dockerfiles for LLMs:

FROM llama3.1

# Set system prompt
SYSTEM You are a senior Python developer. Always include type hints and docstrings.

# Configure parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

Build and run your custom model:

bash

ollama create python-expert -f Modelfile
ollama run python-expert

This is useful for creating task-specific assistants with fixed system prompts and tuned parameters.

When to Scale to Cloud GPUs

Ollama on local hardware works well for development, prototyping, and personal use. Local GPUs have limitations:

Limitation	Local GPU	Cloud GPU (Spheron)
VRAM	24 GB (RTX 4090)	Up to 141 GB (H200)
Largest model	13B (comfortable)	70B+ (single GPU)
Multi-GPU	PCIe bottleneck	NVLink at 600–900 GB/s
Uptime	Personal machine	24/7 dedicated server
Scaling	Single GPU	1–8 GPU clusters

When your models outgrow 24 GB, require 24/7 uptime, or need multi-GPU parallelism, Spheron provides cloud GPU instances starting at $0.55/hr with pre-configured CUDA environments and full root access.

Explore GPU options on Spheron →

Frequently Asked Questions

How much VRAM do I need to run Llama 3.1 8B?

The Q4_K_M quantized version (Ollama's default) requires approximately 5–6 GB of VRAM including KV cache. Any GPU with 8 GB VRAM (RTX 3060, RTX 4060, etc.) can run it comfortably. On CPU, you need at least 8 GB of RAM, but inference will be 5–10x slower.

Can I run Ollama on Apple Silicon Macs?

Yes. Ollama automatically uses Metal GPU acceleration on M1/M2/M3/M4 Macs. Apple Silicon's unified memory architecture means the GPU can access all system RAM, so a Mac with 32 GB unified memory can run models that wouldn't fit on a 24 GB discrete GPU. Performance is roughly 60–70% of an equivalent NVIDIA GPU.

What's the difference between Ollama and llama.cpp?

Ollama is a user-friendly wrapper around llama.cpp. It handles model downloading, GGUF format management, GPU detection, and provides a REST API, all things you'd configure manually with raw llama.cpp. If you want maximum control and custom builds, use llama.cpp directly. For ease of use, Ollama is the better choice.

Can I run multiple models simultaneously?

Yes. Ollama loads models on demand and keeps them in memory. You can run multiple models by making API calls to different model names. However, each loaded model consumes VRAM, so running two 7B models simultaneously requires roughly 10–12 GB of VRAM.

How does quantization affect output quality?

Q4_K_M (4-bit) preserves approximately 95% of the original model's quality for most tasks. You may notice slight degradation in complex reasoning, math, or code generation compared to FP16. Q5_K_M and Q6_K offer better quality at the cost of more VRAM. For most conversational and writing tasks, Q4_K_M is indistinguishable from the full-precision model.

Is Ollama suitable for production use?

Ollama is excellent for development, testing, and personal use. For production serving with multiple concurrent users, SLA requirements, and load balancing, consider dedicated inference servers using vLLM, TensorRT-LLM, or Triton Inference Server on cloud GPUs. Ollama's REST API can serve light production loads but lacks features like batching, auto-scaling, and health monitoring.

Hardware Requirements

Minimum Requirements by Model Size

GPU vs CPU Inference Speed

Installation

macOS

Linux

Windows

Verify Installation

Running Your First Model

Essential Commands

Choosing the Right Model

Recommended Models

Understanding Quantization

GPU Acceleration and Performance Tuning

Verify GPU Detection

GPU Layer Offloading

Context Length Configuration

Memory Management

API Integration

REST API

Python Integration

LangChain Integration

Building a Simple Chatbot

Custom Models with Modelfiles

When to Scale to Cloud GPUs

Frequently Asked Questions

How much VRAM do I need to run Llama 3.1 8B?

Can I run Ollama on Apple Silicon Macs?

What's the difference between Ollama and llama.cpp?

Can I run multiple models simultaneously?

How does quantization affect output quality?

Is Ollama suitable for production use?

Build what's next.