Tutorial

How to Run LLMs Locally with Ollama: GPU-Accelerated Setup Guide

Back to BlogWritten by SpheronNov 3, 2025
LLMOllamaGPU CloudLocal InferenceQuantizationAI Deployment
How to Run LLMs Locally with Ollama: GPU-Accelerated Setup Guide

Running LLMs locally means zero API costs, zero latency to a cloud endpoint, and complete data privacy. No tokens leave your machine. No rate limits. No vendor lock-in.

Ollama makes this practical. It wraps llama.cpp, the most optimized CPU/GPU inference engine for quantized models, in a simple CLI and REST API. You download a model with one command, run it with another, and get interactive chat speeds on consumer hardware.

This guide covers everything you need to run LLMs locally with Ollama: hardware requirements, installation, model selection, GPU acceleration, quantization, performance tuning, API integration, and when to scale to cloud GPUs.

Hardware Requirements

Local LLM performance depends primarily on VRAM (for GPU inference) or RAM (for CPU inference). The model must fit entirely in memory for acceptable speeds.

Minimum Requirements by Model Size

Model SizeMin RAM (CPU)Min VRAM (GPU)Example Models
1B–3B4 GB2 GBPhi-3 Mini, Gemma 2B, TinyLlama
7B–8B8 GB6 GBLlama 3.1 8B, Mistral 7B, Gemma 7B
13B16 GB10 GBLlama 2 13B, CodeLlama 13B
20B–34B32 GB16 GBCodeLlama 34B, Yi-34B
70B64 GB40 GB+Llama 2 70B, Llama 3.1 70B

These are approximate requirements for Q4_K_M quantization (4-bit), which is the default Ollama format. FP16 models require roughly 4x the VRAM.

GPU vs CPU Inference Speed

ConfigurationLlama 3.1 8B (Q4)Llama 2 13B (Q4)Llama 2 70B (Q4)
RTX 4090 (24 GB)80–120 tok/s40–60 tok/sCPU offload (~5 tok/s)
RTX 3090 (24 GB)50–70 tok/s30–45 tok/sCPU offload (~3 tok/s)
RTX 4060 Ti (16 GB)40–60 tok/s20–30 tok/sDoes not fit
Apple M3 Max (48 GB unified)30–45 tok/s20–30 tok/s8–12 tok/s
CPU only (Ryzen 9 7950X)8–15 tok/s5–10 tok/s1–3 tok/s

GPU inference is 5–10x faster than CPU. If you have an NVIDIA GPU with 8+ GB VRAM, GPU acceleration makes the difference between unusable and interactive.

Installation

macOS

Download the installer from ollama.com/download or install via Homebrew:

bash
brew install ollama

Ollama automatically uses Apple Silicon GPU (Metal) on M1/M2/M3/M4 Macs.

Linux

bash
curl -fsSL https://ollama.com/install.sh | sh

For NVIDIA GPU support, ensure CUDA drivers are installed. Ollama detects NVIDIA GPUs automatically.

Windows

Download the installer from ollama.com/download. Ollama supports NVIDIA GPUs on Windows via CUDA.

Verify Installation

bash
ollama --version

Running Your First Model

Download and run a model with a single command:

bash
ollama run llama3.1

This downloads the Llama 3.1 8B model (Q4_K_M quantization, ~4.7 GB) and starts an interactive chat session. First run takes a few minutes for the download; subsequent runs start in seconds.

To pull a model without starting chat:

bash
ollama pull llama3.1

Essential Commands

bash
# List installed models
ollama list

# Show model details (size, quantization, parameters)
ollama show llama3.1

# Remove a model
ollama rm llama3.1

# Run a specific quantization variant
ollama run llama3.1:70b-instruct-q4_K_M

# Run with a system prompt
ollama run llama3.1 --system "You are a Python expert. Respond with code only."

Choosing the Right Model

Ollama's model library contains hundreds of models. Here are the best options by use case:

Recommended Models

ModelSizeBest ForSpeed (RTX 4090)
llama3.1:8b4.7 GBGeneral chat, writing, reasoning80–120 tok/s
mistral4.1 GBFast general-purpose assistant85–130 tok/s
codellama:13b7.4 GBCode generation and review40–60 tok/s
llama3.1:70b40 GBComplex reasoning, analysis8–12 tok/s
phi3:mini2.2 GBLightweight, fast responses100–150 tok/s
mixtral:8x7b26 GBMulti-task, strong reasoning20–35 tok/s
gemma2:9b5.4 GBGoogle's efficient model60–90 tok/s
deepseek-coder-v2:16b8.9 GBAdvanced code generation35–50 tok/s
qwen2.5:7b4.4 GBMultilingual, strong reasoning70–110 tok/s

For most users, llama3.1:8b or mistral provides the best balance of quality and speed. If you have 24+ GB VRAM, mixtral:8x7b offers significantly better reasoning at interactive speeds.

Understanding Quantization

Ollama models use GGUF quantization, a format that compresses model weights to reduce memory usage while preserving quality. The quantization level determines the tradeoff between size, speed, and quality.

QuantizationBits per WeightSize (7B model)QualitySpeed
Q2_K2-bit~2.8 GBNoticeably degradedFastest
Q4_K_M4-bit~4.1 GBNear-original qualityFast (default)
Q5_K_M5-bit~4.8 GBVery close to originalModerate
Q6_K6-bit~5.5 GBMinimal quality lossSlower
Q8_08-bit~7.2 GBNear-losslessSlowest quantized
FP1616-bit~14 GBFull precisionRequires most VRAM

Q4_K_M is the sweet spot for most users, it preserves 95%+ of model quality while cutting VRAM usage by ~4x compared to FP16. For code generation or tasks requiring high precision, Q5_K_M or Q6_K is worth the extra memory.

To run a specific quantization:

bash
ollama run llama3.1:8b-instruct-q5_K_M

GPU Acceleration and Performance Tuning

Verify GPU Detection

bash
ollama ps

This shows running models and whether they're using GPU. If your NVIDIA GPU isn't detected:

bash
# Check CUDA installation
nvidia-smi

# Verify Ollama sees the GPU
OLLAMA_DEBUG=1 ollama run llama3.1

GPU Layer Offloading

For models that don't fully fit in VRAM, Ollama automatically splits layers between GPU and CPU. More GPU layers means faster inference. You can control this in a Modelfile:

FROM llama3.1
PARAMETER num_gpu 35

Context Length Configuration

Longer context windows use more memory. The default is typically 2048–4096 tokens. To increase:

bash
ollama run llama3.1 --num-ctx 8192

Each doubling of context length roughly doubles KV cache memory usage. For a 7B model at Q4:

Context LengthKV Cache MemoryTotal VRAM (approx)
2,048~0.5 GB~5 GB
4,096~1 GB~5.5 GB
8,192~2 GB~6.5 GB
16,384~4 GB~8.5 GB
32,768~8 GB~12.5 GB

Memory Management

If you run out of VRAM, Ollama will fall back to CPU for some layers, significantly slowing inference. To optimize:

  1. Use a smaller quantization (Q4_K_M instead of Q8_0)
  2. Reduce context length if you don't need long conversations
  3. Close other GPU-consuming applications
  4. Consider a smaller model variant

API Integration

Ollama runs a local REST API on port 11434. This makes it easy to integrate into applications.

REST API

bash
# Generate a completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain quicksort in one paragraph",
  "stream": false
}'

# Chat with message history
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "stream": false
}'

Python Integration

bash
pip install ollama
python
import ollama

# Simple generation
response = ollama.generate(model="llama3.1", prompt="Write a haiku about coding")
print(response["response"])

# Chat with history
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to check if a number is prime."},
]
response = ollama.chat(model="llama3.1", messages=messages)
print(response["message"]["content"])

LangChain Integration

bash
pip install langchain-ollama
python
from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate

model = OllamaLLM(model="llama3.1")
prompt = ChatPromptTemplate.from_template("Explain {topic} in simple terms.")
chain = prompt | model

result = chain.invoke({"topic": "quantum computing"})
print(result)

Building a Simple Chatbot

python
import ollama

def chat():
    messages = []
    print("Chat with Llama 3.1 (type 'exit' to quit)")

    while True:
        user_input = input("\nYou: ")
        if user_input.lower() == "exit":
            break

        messages.append({"role": "user", "content": user_input})
        response = ollama.chat(model="llama3.1", messages=messages)
        assistant_message = response["message"]["content"]
        messages.append({"role": "assistant", "content": assistant_message})
        print(f"\nAI: {assistant_message}")

chat()

Custom Models with Modelfiles

Ollama supports custom model configurations via Modelfiles, similar to Dockerfiles for LLMs:

FROM llama3.1

# Set system prompt
SYSTEM You are a senior Python developer. Always include type hints and docstrings.

# Configure parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

Build and run your custom model:

bash
ollama create python-expert -f Modelfile
ollama run python-expert

This is useful for creating task-specific assistants with fixed system prompts and tuned parameters.

When to Scale to Cloud GPUs

Ollama on local hardware works well for development, prototyping, and personal use. Local GPUs have limitations:

LimitationLocal GPUCloud GPU (Spheron)
VRAM24 GB (RTX 4090)Up to 141 GB (H200)
Largest model13B (comfortable)70B+ (single GPU)
Multi-GPUPCIe bottleneckNVLink at 600–900 GB/s
UptimePersonal machine24/7 dedicated server
ScalingSingle GPU1–8 GPU clusters

When your models outgrow 24 GB, require 24/7 uptime, or need multi-GPU parallelism, Spheron provides cloud GPU instances starting at $0.55/hr with pre-configured CUDA environments and full root access.

Explore GPU options on Spheron →

Frequently Asked Questions

How much VRAM do I need to run Llama 3.1 8B?

The Q4_K_M quantized version (Ollama's default) requires approximately 5–6 GB of VRAM including KV cache. Any GPU with 8 GB VRAM (RTX 3060, RTX 4060, etc.) can run it comfortably. On CPU, you need at least 8 GB of RAM, but inference will be 5–10x slower.

Can I run Ollama on Apple Silicon Macs?

Yes. Ollama automatically uses Metal GPU acceleration on M1/M2/M3/M4 Macs. Apple Silicon's unified memory architecture means the GPU can access all system RAM, so a Mac with 32 GB unified memory can run models that wouldn't fit on a 24 GB discrete GPU. Performance is roughly 60–70% of an equivalent NVIDIA GPU.

What's the difference between Ollama and llama.cpp?

Ollama is a user-friendly wrapper around llama.cpp. It handles model downloading, GGUF format management, GPU detection, and provides a REST API, all things you'd configure manually with raw llama.cpp. If you want maximum control and custom builds, use llama.cpp directly. For ease of use, Ollama is the better choice.

Can I run multiple models simultaneously?

Yes. Ollama loads models on demand and keeps them in memory. You can run multiple models by making API calls to different model names. However, each loaded model consumes VRAM, so running two 7B models simultaneously requires roughly 10–12 GB of VRAM.

How does quantization affect output quality?

Q4_K_M (4-bit) preserves approximately 95% of the original model's quality for most tasks. You may notice slight degradation in complex reasoning, math, or code generation compared to FP16. Q5_K_M and Q6_K offer better quality at the cost of more VRAM. For most conversational and writing tasks, Q4_K_M is indistinguishable from the full-precision model.

Is Ollama suitable for production use?

Ollama is excellent for development, testing, and personal use. For production serving with multiple concurrent users, SLA requirements, and load balancing, consider dedicated inference servers using vLLM, TensorRT-LLM, or Triton Inference Server on cloud GPUs. Ollama's REST API can serve light production loads but lacks features like batching, auto-scaling, and health monitoring.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.