What GPU do I need to run ColPali for production document indexing?

For indexing throughput at scale, an H200 or B200 is ideal because ColPali's patch embedding phase is memory-bandwidth-bound rather than FLOPs-bound. A single H200 (141GB HBM3e, 4.8 TB/s bandwidth) can process roughly 30-80 PDF pages per second depending on resolution and batch size. For smaller corpora under 100K documents, an A100 80GB or L40S is sufficient.

How is ColPali different from text-based RAG over PDFs?

Text RAG extracts text via OCR or PDF parsing, then embeds chunks. ColPali skips OCR entirely: it encodes each document page as a grid of visual patch embeddings directly from the image. This preserves tables, charts, diagrams, and layout information that OCR destroys or loses. Retrieval uses late-interaction MaxSim scoring across patch vectors rather than a single dense vector per chunk.

Which vector database supports ColPali's multi-vector late-interaction retrieval?

Qdrant and Milvus both support native multi-vector storage and MaxSim scoring, which is required for ColBERT-style late interaction. Qdrant's multi-vector collection type is the most straightforward to configure for ColPali. pgvector does not support late-interaction natively and would require custom MaxSim implementation at query time.

How much storage does a ColPali index take for 1 million PDF pages?

Each PDF page at 448x448 resolution produces 1030 patch tokens (or ~196 for a 16x16 patch grid depending on the model variant). At 128-dim float32 embeddings, one page costs roughly 100KB-500KB of index storage. For 1M pages, plan for 100GB-500GB of raw vector storage before compression. Using int8 quantization cuts this by 4x.

Can I colocate the ColPali retriever and a VLM answer generator on one GPU node?

Yes, and this is the recommended production setup to avoid cross-node latency. A single H200 node has 141GB of HBM3e, which is enough to hold ColQwen2.5-7B (the retrieval model) plus a mid-size VLM like Qwen3-VL 7B for answer generation. For larger answer generators like Qwen3-VL 72B, a two-node setup with fast InfiniBand interconnect is needed.

ColPali and Multimodal Document RAG on GPU Cloud: Visual PDF Retrieval Without OCR (2026)

OCR-based PDF RAG breaks on real enterprise documents because OCR discards the spatial relationships between tables, charts, and text. A multi-column financial report or a slide deck full of diagrams loses most of its information when you flatten it to a string. ColPali takes a different path: it treats every PDF page as an image and builds patch-level embeddings directly from the visual signal, skipping OCR entirely. For context on the broader RAG stack before diving into the visual layer, see the agentic RAG infrastructure guide.

The catch is that visual retrieval requires more GPU memory and careful system design. This guide covers the ColPali and ColQwen2.5 model families, GPU sizing for indexing workloads, vector database configuration for multi-vector late-interaction search, and a complete end-to-end deployment on Spheron GPU cloud.

Why OCR-Based RAG Fails on Real Documents

Three failure modes compound when document structure is complex:

OCR errors accumulate on scanned and complex-layout pages. Character recognition accuracy drops from 95-99% on clean typewritten text to 60-80% on photocopied documents, handwritten annotations, or pages with mixed fonts and watermarks. At 80% accuracy, a 500-word page produces roughly 100 garbled tokens. Embed that text and you get an embedding that does not represent what the page actually says.

Tables and charts lose their structure when flattened to text. A revenue table with five product lines and twelve monthly columns becomes a linear sequence of numbers with no positional context. A bar chart becomes its axis labels and nothing else. The embedding model has no way to recover the structural relationships that give those numbers meaning.

Multi-column layouts produce garbled reading order. pdfplumber, pymupdf, and Unstructured.io all use bounding-box heuristics to reconstruct reading order. On dense two-column academic papers or three-column news layouts, these heuristics misorder sentences at a rate that makes the resulting text nearly unusable for precise retrieval.

According to document AI research, roughly 80% of enterprise PDFs contain at least one table, chart, or complex layout element. For those documents, OCR-based RAG starts with corrupted or structurally degraded input. ColPali sidesteps this by treating the page as a visual object.

How ColPali Works: Late-Interaction Over Patch Embeddings

ColPali and its successors (ColQwen2.5, ColSmolVLM, ColInternVL) use a vision-language model to encode each document page as a grid of patch tokens, then project each patch to a 128-dimensional embedding vector.

ColPali-3 (used in this post as shorthand for the PaliGemma-3B-based variant; the official release series is vidore/colpali-v1.2 and related versions on HuggingFace) uses PaliGemma as the base VLM. At 448x448 pixel resolution, the ViT encoder produces 1030 patch tokens per page. Each token is projected to 128 dims via a learned linear layer. That gives you 1030 vectors per page, each capturing a local visual region.

Retrieval uses ColBERT-style late interaction, not a single aggregated vector. For a query, you embed the query text (or query image) into its own set of patch-like vectors. Then you compute MaxSim: for each query vector, find the maximum cosine similarity with any document patch vector. Sum the per-query-vector MaxSim scores across all query vectors to get the final page score.

This has a concrete benefit over standard dense retrieval. Standard dense RAG compresses all patch information into one vector via mean-pooling, losing positional and local visual information. MaxSim late interaction can match a query about "the Q3 revenue figure" to the specific table cell patch that contains it, even if the rest of the page is irrelevant.

Approach	Representation	Scoring	Layout Preserved
Text RAG (dense)	1 vector per chunk	Cosine similarity	No
ColPali	N patch vectors per page	MaxSim late interaction	Yes
Hybrid (text + image)	chunk vector + caption vector	Weighted sum	Partial

The multi-vector representation is what makes Qdrant or Milvus (with multi-vector collection support) a requirement. Standard vector databases that expect one vector per document cannot store or score ColPali indexes without custom extensions.

Picking a Model: ColPali, ColQwen2.5, ColSmolVLM, ColInternVL

All these models share the colpali-engine library as the serving backbone. colpali-engine is a unified library; the ColPali name is sometimes used loosely to refer to any ColBERT-style visual retrieval model, but technically ColPali refers specifically to the PaliGemma-based variant.

Model	Base VLM	VRAM (FP16)	Patches per page	ViDoRe nDCG@5	Best for
ColPali-3	PaliGemma-3B	~8GB	1030 (448px)	~78%	General retrieval, good baseline
ColQwen2.5-7B	Qwen2.5-VL-7B	~16GB	~196-1030 (varies)	~84%	Production English/multilingual
ColSmolVLM-256M	SmolVLM-256M	~2GB	~196	~65%	Edge deployment, real-time batch
ColSmolVLM-500M	SmolVLM-500M	~3GB	~196	~70%	Resource-constrained environments
ColInternVL2-4B	InternVL2-4B	~10GB	varies	~80%	Chinese/multilingual documents

For most production workloads in English or multilingual settings, ColQwen2.5-7B is the right starting point. It has the highest retrieval accuracy on the ViDoRe benchmark and handles mixed-language document corpora well. Use ColSmolVLM if VRAM budget is tight or you need real-time indexing on smaller hardware.

Index Size and Storage Math

The storage cost is substantial and worth planning before you provision.

ColPali-3 at 448x448px (1030 patches per page):

Each patch: 128-dim float32 = 512 bytes
Per page: 1030 patches × 512 bytes = 527KB
100K pages: ~53GB raw vector storage
1M pages: ~527GB raw vector storage
With int8 quantization: ~132GB

ColQwen2.5 at default resolution (~196 patches per page):

Per page: 196 patches × 512 bytes = 100KB
100K pages: ~10GB
1M pages: ~100GB raw
With int8 quantization: ~25GB

Corpus Size	ColPali-3 (FP32)	ColPali-3 (int8)	ColQwen2.5 (FP32)	ColQwen2.5 (int8)
10K pages	5.3GB	1.3GB	1GB	0.25GB
100K pages	53GB	13GB	10GB	2.5GB
1M pages	527GB	132GB	100GB	25GB
10M pages	5.3TB	1.3TB	1TB	250GB

Qdrant supports both in-memory and on-disk indexes via memory-mapped files. For corpora over 100K pages, use on-disk storage with on_disk: true in the vector config. For large-scale deployments, int8 scalar quantization is nearly lossless for retrieval quality (typically less than 1% nDCG@5 degradation) while cutting storage and bandwidth requirements by 4x.

GPU Requirements for Indexing at Scale

Indexing is memory-bandwidth-bound, not compute-bound. The forward pass over patch tokens moves large activation tensors through HBM on every batch. GPUs with higher HBM bandwidth process more pages per second at the same batch size.

HBM bandwidth comparison:

A100 80GB: 2.0 TB/s
H100 SXM5: 3.35 TB/s
H200 SXM5: 4.8 TB/s (HBM3e)
B200 SXM6: 8.0 TB/s (HBM3e). For architecture details and workload benchmarks, see the complete B200 guide.

Throughput scales roughly linearly with bandwidth for pure embedding workloads (batch size 8, FP16):

GPU	HBM Bandwidth	ColQwen2.5-7B throughput	ColPali-3 throughput
A100 80GB	2.0 TB/s	~12 pages/sec	~18 pages/sec
H100 SXM5	3.35 TB/s	~20 pages/sec	~30 pages/sec
H200 SXM5	4.8 TB/s	~28 pages/sec	~42 pages/sec
B200 SXM6	8.0 TB/s	~48 pages/sec	~70 pages/sec

These are approximate figures based on memory bandwidth scaling. Always benchmark your specific document resolution and batch size.

Time and cost to index 1M pages:

GPU	Time (ColQwen2.5)	On-demand cost
A100 80GB	~23 hours	~$24
H100 SXM5	~14 hours	~$41
H200 SXM5	~10 hours	~$45
B200 SXM6	~6 hours	~$40

H200 and B200 deliver 2-4x higher indexing throughput than A100, which matters when you need to process millions of pages quickly. For a large one-time indexing run, renting an H200 or B200 on Spheron finishes the job in hours rather than days, at higher on-demand rates. For cost-sensitive batch workloads where total spend matters more than time, A100 on-demand at $1.04/hr is the most economical option per dollar spent.

The Serving Stack: Vector DB, Late-Interaction Reranking, and Query Latency

ColPali retrieval has three layers: query encoding, vector search, and (optionally) VLM answer generation.

Query latency breakdown (single query, collocated GPU node):

Query embedding: 20-50ms (ColQwen2.5-7B on GPU)
MaxSim vector search over Qdrant: 5-50ms (depends on corpus size and index type)
VLM generation for answer extraction: 500ms-2s (7B model, one page image)

Vector database comparison:

DB	Multi-vector (ColBERT)	MaxSim native	GPU indexing	Cloud/self-host
Qdrant	Yes	Yes	CPU + GPU offload	Both
Milvus	Yes (v2.4+)	Yes	Yes	Both
pgvector	No	No (custom required)	No	Self-host only
Weaviate	No native	No (custom required)	No	Both

Qdrant is the recommended choice for getting started: the multi-vector collection API is straightforward, and MaxSim scoring is a first-class feature. Use Milvus if you need GPU-accelerated index building at very large scale (10M+ pages) and are comfortable with the operational complexity.

For the query embedding server specifically, running a self-hosted embedding server on the same node as Qdrant eliminates the 100-400ms network hop that managed embedding APIs add. Colocation cuts p99 query latency by roughly 2-4x compared to routing queries through separate cloud services.

End-to-End Pipeline: Ingestion, Indexing, Retrieval, and VLM Generation

Phase 1: PDF Ingestion

Convert each PDF page to an image before passing it to the model. pdf2image wraps poppler and handles most PDF variants:

python

from pdf2image import convert_from_path
from pathlib import Path
import PIL.Image

def render_pages(pdf_path: str, dpi: int = 150) -> list[PIL.Image.Image]:
    pages = convert_from_path(pdf_path, dpi=dpi)
    # ColPali-3 requires exactly 448x448 input for its fixed ViT encoder
    return [p.resize((448, 448)) for p in pages]

For ColQwen2.5, you can pass higher-resolution images and let the model's dynamic resolution tiling handle it. For ColPali-3, resize to 448x448 before batch processing.

Phase 2: Batch Indexing

python

import uuid
import torch
from colpali_engine.models import ColQwen2_5, ColQwen2_5Processor
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct, MultiVectorConfig, MultiVectorComparator
)

# Load model
model = ColQwen2_5.from_pretrained(
    "vidore/colqwen2.5-v0.2",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
).eval()
processor = ColQwen2_5Processor.from_pretrained("vidore/colqwen2.5-v0.2")

# Create Qdrant collection with multi-vector support (qdrant-client>=1.9.0)
client = QdrantClient(url="http://localhost:6333")
client.create_collection(
    collection_name="colpali_docs",
    vectors_config={
        "colpali": VectorParams(
            size=128,
            distance=Distance.COSINE,
            multivector_config=MultiVectorConfig(
                comparator=MultiVectorComparator.MAX_SIM
            ),
        )
    }
)

def index_pages(pages: list[PIL.Image.Image], doc_id: str, start_page: int = 0):
    batch_size = 8
    all_embeddings = []
    for i in range(0, len(pages), batch_size):
        batch = pages[i:i+batch_size]
        inputs = processor.process_images(batch).to(model.device)
        with torch.no_grad():
            embeddings = model(**inputs)  # shape: (batch, n_patches, 128)
        all_embeddings.extend(embeddings.cpu().float().numpy().tolist())

    points = [
        PointStruct(
            id=str(uuid.uuid5(uuid.NAMESPACE_DNS, f"{doc_id}::page::{start_page + idx}")),
            vector={"colpali": emb},
            payload={"doc_id": doc_id, "page": start_page + idx}
        )
        for idx, emb in enumerate(all_embeddings)
    ]
    client.upsert(collection_name="colpali_docs", points=points)

Phase 3: Query and Retrieval

python

def retrieve(query_text: str, top_k: int = 5) -> list[dict]:
    inputs = processor.process_queries([query_text]).to(model.device)
    with torch.no_grad():
        query_embedding = model(**inputs)  # shape: (1, n_query_tokens, 128)

    results = client.query_points(
        collection_name="colpali_docs",
        query=query_embedding[0].cpu().float().numpy().tolist(),
        using="colpali",
        limit=top_k,
    )
    return [r.payload for r in results.points]

Phase 4: VLM Answer Generation

Pass the retrieved page images to a VLM for answer extraction. Using vLLM's OpenAI-compatible endpoint (see the VLM deployment guide for setup details):

python

import base64, httpx

def answer_from_pages(query: str, page_images: list[PIL.Image.Image]) -> str:
    import io

    def encode_image(img: PIL.Image.Image) -> str:
        buf = io.BytesIO()
        img.save(buf, format="JPEG")
        return base64.b64encode(buf.getvalue()).decode()

    content = []
    for img in page_images[:3]:  # limit to top-3 pages to control token cost
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{encode_image(img)}"}
        })
    content.append({"type": "text", "text": query})

    response = httpx.post(
        "http://localhost:8000/v1/chat/completions",
        json={
            "model": "Qwen/Qwen2.5-VL-7B-Instruct",
            "messages": [{"role": "user", "content": content}],
            "max_tokens": 512
        },
        timeout=30
    )
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

Production Deployment on Spheron

Single-Node Setup (Up to ~500K Pages)

One H200 141GB node handles the full stack: ColQwen2.5-7B retriever, Qdrant in-memory, and a 7B VLM for answer generation.

Docker Compose configuration:

yaml

version: "3.9"
services:
  qdrant:
    image: qdrant/qdrant:v1.11.0
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      - QDRANT__SERVICE__GRPC_PORT=6334
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  colpali-api:
    build: ./colpali-service
    ports:
      - "8080:8080"
    environment:
      - QDRANT_URL=http://qdrant:6333
      - MODEL_ID=vidore/colqwen2.5-v0.2
    depends_on:
      - qdrant
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  vllm-serve:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    command: >
      --model Qwen/Qwen2.5-VL-7B-Instruct
      --tensor-parallel-size 1
      --max-model-len 16384
      --gpu-memory-utilization 0.35
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

volumes:
  qdrant_data:

The GPU memory split on an H200 141GB: ColQwen2.5-7B takes ~16GB, Qdrant in-memory for 500K pages at ColQwen resolution takes ~50GB, Qwen2.5-VL-7B via vLLM with --gpu-memory-utilization 0.35 takes ~49GB. Total: ~115GB, leaving 26GB headroom for KV cache.

Multi-Node Setup (1M+ Pages)

For corpora over 1M pages, split across two roles:

Indexing node: B200 for batch embedding jobs. Runs only the colpali-engine indexing pipeline. Writes to a shared Qdrant cluster.
Inference node: H200 for VLM answer generation. Runs Qdrant reader + colpali-api + vLLM. Reads from the shared index.

Spheron supports InfiniBand interconnect on multi-node configurations, which keeps cross-node embedding and search latency low. Provision both nodes from the Spheron dashboard and connect them via the private network.

Benchmarks: ColPali vs Text RAG vs Hybrid on Enterprise PDF Corpora

ViDoRe (Visual Document Retrieval Benchmark) is the standard benchmark for this class of models. Results below are from the illuin-technology/vidore-benchmark leaderboard as of April 2026, rounded to the nearest percent:

Approach	Financial PDFs nDCG@5	Slides nDCG@5	Scanned docs nDCG@5	Avg
Text RAG (BM25)	48%	35%	28%	37%
Text RAG (dense, BGE-M3)	62%	44%	31%	46%
ColPali-3	78%	82%	74%	78%
ColQwen2.5-7B	84%	87%	79%	83%
Hybrid (text + ColPali)	86%	86%	76%	83%

The hybrid approach (text RAG combined with ColPali reranking) is not always better than ColPali alone. On scanned documents, OCR errors in the text pipeline degrade the hybrid score below ColPali-only. If your corpus includes a significant share of scanned or photographed documents, skip the OCR pipeline and use ColQwen2.5 exclusively.

Financial PDFs show the largest gap between text-only and visual retrieval: 62% dense recall vs 84% ColQwen. The reason is dense financial documents with multi-line tables, footnotes, and mixed numeric formats that OCR parsers routinely misread.

Cost Comparison: Managed RAG APIs vs Self-Host on GPU Cloud

Managed document AI APIs charge per page indexed and per query processed. Self-hosting ColQwen2.5 on Spheron charges for GPU time only.

Indexing cost per 1,000 pages:

Option	Cost per 1K pages	Notes
Azure AI Document Intelligence	$1.50	Standard tier, text extraction + layout
Amazon Textract (forms + tables)	$1.50-6.00	Per-page, varies by feature tier
Self-host ColQwen2.5 (B200 on-demand, $6.73/hr)	~$0.04	1K pages in ~21 seconds at 48 pages/sec
Self-host ColQwen2.5 (A100 spot, $1.15/hr)	~$0.03	1K pages in ~83 seconds at 12 pages/sec

Query cost per 1,000 queries (retrieval + VLM generation):

Option	Cost per 1K queries	Notes
OpenAI GPT-4o with Vision	$5-25	Depends on image resolution and output length
Anthropic Claude 3.5 Sonnet	$4-15	Input/output token pricing
Self-host (H200 on-demand, $4.54/hr)	~$5-15	At 5-15 queries/minute with 7B VLM

The breakeven depends on which workload dominates. For indexing, self-hosting ColQwen2.5 runs 37x cheaper per page than Azure Document Intelligence regardless of GPU tier. For query serving, self-hosting on H200 on-demand at $4.54/hr is cost-competitive with managed API pricing when sustained throughput stays above roughly 10 queries/minute. Below that threshold, managed APIs carry simpler pricing and comparable cost.

Pricing fluctuates based on GPU availability. The prices above are based on 30 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

ColPali's embedding-throughput-bound indexing phase runs fastest on high-bandwidth GPUs like the H200 and B200, and colocating the retriever with the VLM answerer on one node cuts query latency by eliminating cross-network hops.
Rent H200 → | Rent B200 → | View all GPU pricing →

Why OCR-Based RAG Fails on Real Documents

How ColPali Works: Late-Interaction Over Patch Embeddings

Picking a Model: ColPali, ColQwen2.5, ColSmolVLM, ColInternVL

Index Size and Storage Math

GPU Requirements for Indexing at Scale

The Serving Stack: Vector DB, Late-Interaction Reranking, and Query Latency

End-to-End Pipeline: Ingestion, Indexing, Retrieval, and VLM Generation

Phase 1: PDF Ingestion

Phase 2: Batch Indexing

Phase 3: Query and Retrieval

Phase 4: VLM Answer Generation

Production Deployment on Spheron

Single-Node Setup (Up to ~500K Pages)

Multi-Node Setup (1M+ Pages)

Benchmarks: ColPali vs Text RAG vs Hybrid on Enterprise PDF Corpora

Cost Comparison: Managed RAG APIs vs Self-Host on GPU Cloud

Build what's next.