Tutorial

ColPali and Multimodal Document RAG on GPU Cloud: Visual PDF Retrieval Without OCR (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 30, 2026
ColPali DeploymentMultimodal Document RAGVisual Document Retrieval GPUPDF RAG Without OCRColQwen Self-HostVector Database GPUVLM InferenceGPU CloudColBERT Late InteractionMulti-Vector Search
ColPali and Multimodal Document RAG on GPU Cloud: Visual PDF Retrieval Without OCR (2026)

OCR-based PDF RAG breaks on real enterprise documents because OCR discards the spatial relationships between tables, charts, and text. A multi-column financial report or a slide deck full of diagrams loses most of its information when you flatten it to a string. ColPali takes a different path: it treats every PDF page as an image and builds patch-level embeddings directly from the visual signal, skipping OCR entirely. For context on the broader RAG stack before diving into the visual layer, see the agentic RAG infrastructure guide.

The catch is that visual retrieval requires more GPU memory and careful system design. This guide covers the ColPali and ColQwen2.5 model families, GPU sizing for indexing workloads, vector database configuration for multi-vector late-interaction search, and a complete end-to-end deployment on Spheron GPU cloud.

Why OCR-Based RAG Fails on Real Documents

Three failure modes compound when document structure is complex:

OCR errors accumulate on scanned and complex-layout pages. Character recognition accuracy drops from 95-99% on clean typewritten text to 60-80% on photocopied documents, handwritten annotations, or pages with mixed fonts and watermarks. At 80% accuracy, a 500-word page produces roughly 100 garbled tokens. Embed that text and you get an embedding that does not represent what the page actually says.

Tables and charts lose their structure when flattened to text. A revenue table with five product lines and twelve monthly columns becomes a linear sequence of numbers with no positional context. A bar chart becomes its axis labels and nothing else. The embedding model has no way to recover the structural relationships that give those numbers meaning.

Multi-column layouts produce garbled reading order. pdfplumber, pymupdf, and Unstructured.io all use bounding-box heuristics to reconstruct reading order. On dense two-column academic papers or three-column news layouts, these heuristics misorder sentences at a rate that makes the resulting text nearly unusable for precise retrieval.

According to document AI research, roughly 80% of enterprise PDFs contain at least one table, chart, or complex layout element. For those documents, OCR-based RAG starts with corrupted or structurally degraded input. ColPali sidesteps this by treating the page as a visual object.

How ColPali Works: Late-Interaction Over Patch Embeddings

ColPali and its successors (ColQwen2.5, ColSmolVLM, ColInternVL) use a vision-language model to encode each document page as a grid of patch tokens, then project each patch to a 128-dimensional embedding vector.

ColPali-3 (used in this post as shorthand for the PaliGemma-3B-based variant; the official release series is vidore/colpali-v1.2 and related versions on HuggingFace) uses PaliGemma as the base VLM. At 448x448 pixel resolution, the ViT encoder produces 1030 patch tokens per page. Each token is projected to 128 dims via a learned linear layer. That gives you 1030 vectors per page, each capturing a local visual region.

Retrieval uses ColBERT-style late interaction, not a single aggregated vector. For a query, you embed the query text (or query image) into its own set of patch-like vectors. Then you compute MaxSim: for each query vector, find the maximum cosine similarity with any document patch vector. Sum the per-query-vector MaxSim scores across all query vectors to get the final page score.

This has a concrete benefit over standard dense retrieval. Standard dense RAG compresses all patch information into one vector via mean-pooling, losing positional and local visual information. MaxSim late interaction can match a query about "the Q3 revenue figure" to the specific table cell patch that contains it, even if the rest of the page is irrelevant.

ApproachRepresentationScoringLayout Preserved
Text RAG (dense)1 vector per chunkCosine similarityNo
ColPaliN patch vectors per pageMaxSim late interactionYes
Hybrid (text + image)chunk vector + caption vectorWeighted sumPartial

The multi-vector representation is what makes Qdrant or Milvus (with multi-vector collection support) a requirement. Standard vector databases that expect one vector per document cannot store or score ColPali indexes without custom extensions.

Picking a Model: ColPali, ColQwen2.5, ColSmolVLM, ColInternVL

All these models share the colpali-engine library as the serving backbone. colpali-engine is a unified library; the ColPali name is sometimes used loosely to refer to any ColBERT-style visual retrieval model, but technically ColPali refers specifically to the PaliGemma-based variant.

ModelBase VLMVRAM (FP16)Patches per pageViDoRe nDCG@5Best for
ColPali-3PaliGemma-3B~8GB1030 (448px)~78%General retrieval, good baseline
ColQwen2.5-7BQwen2.5-VL-7B~16GB~196-1030 (varies)~84%Production English/multilingual
ColSmolVLM-256MSmolVLM-256M~2GB~196~65%Edge deployment, real-time batch
ColSmolVLM-500MSmolVLM-500M~3GB~196~70%Resource-constrained environments
ColInternVL2-4BInternVL2-4B~10GBvaries~80%Chinese/multilingual documents

For most production workloads in English or multilingual settings, ColQwen2.5-7B is the right starting point. It has the highest retrieval accuracy on the ViDoRe benchmark and handles mixed-language document corpora well. Use ColSmolVLM if VRAM budget is tight or you need real-time indexing on smaller hardware.

Index Size and Storage Math

The storage cost is substantial and worth planning before you provision.

ColPali-3 at 448x448px (1030 patches per page):

  • Each patch: 128-dim float32 = 512 bytes
  • Per page: 1030 patches × 512 bytes = 527KB
  • 100K pages: ~53GB raw vector storage
  • 1M pages: ~527GB raw vector storage
  • With int8 quantization: ~132GB

ColQwen2.5 at default resolution (~196 patches per page):

  • Per page: 196 patches × 512 bytes = 100KB
  • 100K pages: ~10GB
  • 1M pages: ~100GB raw
  • With int8 quantization: ~25GB
Corpus SizeColPali-3 (FP32)ColPali-3 (int8)ColQwen2.5 (FP32)ColQwen2.5 (int8)
10K pages5.3GB1.3GB1GB0.25GB
100K pages53GB13GB10GB2.5GB
1M pages527GB132GB100GB25GB
10M pages5.3TB1.3TB1TB250GB

Qdrant supports both in-memory and on-disk indexes via memory-mapped files. For corpora over 100K pages, use on-disk storage with on_disk: true in the vector config. For large-scale deployments, int8 scalar quantization is nearly lossless for retrieval quality (typically less than 1% nDCG@5 degradation) while cutting storage and bandwidth requirements by 4x.

GPU Requirements for Indexing at Scale

Indexing is memory-bandwidth-bound, not compute-bound. The forward pass over patch tokens moves large activation tensors through HBM on every batch. GPUs with higher HBM bandwidth process more pages per second at the same batch size.

HBM bandwidth comparison:

  • A100 80GB: 2.0 TB/s
  • H100 SXM5: 3.35 TB/s
  • H200 SXM5: 4.8 TB/s (HBM3e)
  • B200 SXM6: 8.0 TB/s (HBM3e). For architecture details and workload benchmarks, see the complete B200 guide.

Throughput scales roughly linearly with bandwidth for pure embedding workloads (batch size 8, FP16):

GPUHBM BandwidthColQwen2.5-7B throughputColPali-3 throughput
A100 80GB2.0 TB/s~12 pages/sec~18 pages/sec
H100 SXM53.35 TB/s~20 pages/sec~30 pages/sec
H200 SXM54.8 TB/s~28 pages/sec~42 pages/sec
B200 SXM68.0 TB/s~48 pages/sec~70 pages/sec

These are approximate figures based on memory bandwidth scaling. Always benchmark your specific document resolution and batch size.

Time and cost to index 1M pages:

GPUTime (ColQwen2.5)On-demand cost
A100 80GB~23 hours~$24
H100 SXM5~14 hours~$41
H200 SXM5~10 hours~$45
B200 SXM6~6 hours~$40

H200 and B200 deliver 2-4x higher indexing throughput than A100, which matters when you need to process millions of pages quickly. For a large one-time indexing run, renting an H200 or B200 on Spheron finishes the job in hours rather than days, at higher on-demand rates. For cost-sensitive batch workloads where total spend matters more than time, A100 on-demand at $1.04/hr is the most economical option per dollar spent.

The Serving Stack: Vector DB, Late-Interaction Reranking, and Query Latency

ColPali retrieval has three layers: query encoding, vector search, and (optionally) VLM answer generation.

Query latency breakdown (single query, collocated GPU node):

  1. Query embedding: 20-50ms (ColQwen2.5-7B on GPU)
  2. MaxSim vector search over Qdrant: 5-50ms (depends on corpus size and index type)
  3. VLM generation for answer extraction: 500ms-2s (7B model, one page image)

Vector database comparison:

DBMulti-vector (ColBERT)MaxSim nativeGPU indexingCloud/self-host
QdrantYesYesCPU + GPU offloadBoth
MilvusYes (v2.4+)YesYesBoth
pgvectorNoNo (custom required)NoSelf-host only
WeaviateNo nativeNo (custom required)NoBoth

Qdrant is the recommended choice for getting started: the multi-vector collection API is straightforward, and MaxSim scoring is a first-class feature. Use Milvus if you need GPU-accelerated index building at very large scale (10M+ pages) and are comfortable with the operational complexity.

For the query embedding server specifically, running a self-hosted embedding server on the same node as Qdrant eliminates the 100-400ms network hop that managed embedding APIs add. Colocation cuts p99 query latency by roughly 2-4x compared to routing queries through separate cloud services.

End-to-End Pipeline: Ingestion, Indexing, Retrieval, and VLM Generation

Phase 1: PDF Ingestion

Convert each PDF page to an image before passing it to the model. pdf2image wraps poppler and handles most PDF variants:

python
from pdf2image import convert_from_path
from pathlib import Path
import PIL.Image

def render_pages(pdf_path: str, dpi: int = 150) -> list[PIL.Image.Image]:
    pages = convert_from_path(pdf_path, dpi=dpi)
    # ColPali-3 requires exactly 448x448 input for its fixed ViT encoder
    return [p.resize((448, 448)) for p in pages]

For ColQwen2.5, you can pass higher-resolution images and let the model's dynamic resolution tiling handle it. For ColPali-3, resize to 448x448 before batch processing.

Phase 2: Batch Indexing

python
import uuid
import torch
from colpali_engine.models import ColQwen2_5, ColQwen2_5Processor
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct, MultiVectorConfig, MultiVectorComparator
)

# Load model
model = ColQwen2_5.from_pretrained(
    "vidore/colqwen2.5-v0.2",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
).eval()
processor = ColQwen2_5Processor.from_pretrained("vidore/colqwen2.5-v0.2")

# Create Qdrant collection with multi-vector support (qdrant-client>=1.9.0)
client = QdrantClient(url="http://localhost:6333")
client.create_collection(
    collection_name="colpali_docs",
    vectors_config={
        "colpali": VectorParams(
            size=128,
            distance=Distance.COSINE,
            multivector_config=MultiVectorConfig(
                comparator=MultiVectorComparator.MAX_SIM
            ),
        )
    }
)

def index_pages(pages: list[PIL.Image.Image], doc_id: str, start_page: int = 0):
    batch_size = 8
    all_embeddings = []
    for i in range(0, len(pages), batch_size):
        batch = pages[i:i+batch_size]
        inputs = processor.process_images(batch).to(model.device)
        with torch.no_grad():
            embeddings = model(**inputs)  # shape: (batch, n_patches, 128)
        all_embeddings.extend(embeddings.cpu().float().numpy().tolist())

    points = [
        PointStruct(
            id=str(uuid.uuid5(uuid.NAMESPACE_DNS, f"{doc_id}::page::{start_page + idx}")),
            vector={"colpali": emb},
            payload={"doc_id": doc_id, "page": start_page + idx}
        )
        for idx, emb in enumerate(all_embeddings)
    ]
    client.upsert(collection_name="colpali_docs", points=points)

Phase 3: Query and Retrieval

python
def retrieve(query_text: str, top_k: int = 5) -> list[dict]:
    inputs = processor.process_queries([query_text]).to(model.device)
    with torch.no_grad():
        query_embedding = model(**inputs)  # shape: (1, n_query_tokens, 128)

    results = client.query_points(
        collection_name="colpali_docs",
        query=query_embedding[0].cpu().float().numpy().tolist(),
        using="colpali",
        limit=top_k,
    )
    return [r.payload for r in results.points]

Phase 4: VLM Answer Generation

Pass the retrieved page images to a VLM for answer extraction. Using vLLM's OpenAI-compatible endpoint (see the VLM deployment guide for setup details):

python
import base64, httpx

def answer_from_pages(query: str, page_images: list[PIL.Image.Image]) -> str:
    import io

    def encode_image(img: PIL.Image.Image) -> str:
        buf = io.BytesIO()
        img.save(buf, format="JPEG")
        return base64.b64encode(buf.getvalue()).decode()

    content = []
    for img in page_images[:3]:  # limit to top-3 pages to control token cost
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{encode_image(img)}"}
        })
    content.append({"type": "text", "text": query})

    response = httpx.post(
        "http://localhost:8000/v1/chat/completions",
        json={
            "model": "Qwen/Qwen2.5-VL-7B-Instruct",
            "messages": [{"role": "user", "content": content}],
            "max_tokens": 512
        },
        timeout=30
    )
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

Production Deployment on Spheron

Single-Node Setup (Up to ~500K Pages)

One H200 141GB node handles the full stack: ColQwen2.5-7B retriever, Qdrant in-memory, and a 7B VLM for answer generation.

Docker Compose configuration:

yaml
version: "3.9"
services:
  qdrant:
    image: qdrant/qdrant:v1.11.0
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      - QDRANT__SERVICE__GRPC_PORT=6334
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  colpali-api:
    build: ./colpali-service
    ports:
      - "8080:8080"
    environment:
      - QDRANT_URL=http://qdrant:6333
      - MODEL_ID=vidore/colqwen2.5-v0.2
    depends_on:
      - qdrant
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  vllm-serve:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    command: >
      --model Qwen/Qwen2.5-VL-7B-Instruct
      --tensor-parallel-size 1
      --max-model-len 16384
      --gpu-memory-utilization 0.35
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

volumes:
  qdrant_data:

The GPU memory split on an H200 141GB: ColQwen2.5-7B takes ~16GB, Qdrant in-memory for 500K pages at ColQwen resolution takes ~50GB, Qwen2.5-VL-7B via vLLM with --gpu-memory-utilization 0.35 takes ~49GB. Total: ~115GB, leaving 26GB headroom for KV cache.

Multi-Node Setup (1M+ Pages)

For corpora over 1M pages, split across two roles:

  • Indexing node: B200 for batch embedding jobs. Runs only the colpali-engine indexing pipeline. Writes to a shared Qdrant cluster.
  • Inference node: H200 for VLM answer generation. Runs Qdrant reader + colpali-api + vLLM. Reads from the shared index.

Spheron supports InfiniBand interconnect on multi-node configurations, which keeps cross-node embedding and search latency low. Provision both nodes from the Spheron dashboard and connect them via the private network.

Benchmarks: ColPali vs Text RAG vs Hybrid on Enterprise PDF Corpora

ViDoRe (Visual Document Retrieval Benchmark) is the standard benchmark for this class of models. Results below are from the illuin-technology/vidore-benchmark leaderboard as of April 2026, rounded to the nearest percent:

ApproachFinancial PDFs nDCG@5Slides nDCG@5Scanned docs nDCG@5Avg
Text RAG (BM25)48%35%28%37%
Text RAG (dense, BGE-M3)62%44%31%46%
ColPali-378%82%74%78%
ColQwen2.5-7B84%87%79%83%
Hybrid (text + ColPali)86%86%76%83%

The hybrid approach (text RAG combined with ColPali reranking) is not always better than ColPali alone. On scanned documents, OCR errors in the text pipeline degrade the hybrid score below ColPali-only. If your corpus includes a significant share of scanned or photographed documents, skip the OCR pipeline and use ColQwen2.5 exclusively.

Financial PDFs show the largest gap between text-only and visual retrieval: 62% dense recall vs 84% ColQwen. The reason is dense financial documents with multi-line tables, footnotes, and mixed numeric formats that OCR parsers routinely misread.

Cost Comparison: Managed RAG APIs vs Self-Host on GPU Cloud

Managed document AI APIs charge per page indexed and per query processed. Self-hosting ColQwen2.5 on Spheron charges for GPU time only.

Indexing cost per 1,000 pages:

OptionCost per 1K pagesNotes
Azure AI Document Intelligence$1.50Standard tier, text extraction + layout
Amazon Textract (forms + tables)$1.50-6.00Per-page, varies by feature tier
Self-host ColQwen2.5 (B200 on-demand, $6.73/hr)~$0.041K pages in ~21 seconds at 48 pages/sec
Self-host ColQwen2.5 (A100 spot, $1.15/hr)~$0.031K pages in ~83 seconds at 12 pages/sec

Query cost per 1,000 queries (retrieval + VLM generation):

OptionCost per 1K queriesNotes
OpenAI GPT-4o with Vision$5-25Depends on image resolution and output length
Anthropic Claude 3.5 Sonnet$4-15Input/output token pricing
Self-host (H200 on-demand, $4.54/hr)~$5-15At 5-15 queries/minute with 7B VLM

The breakeven depends on which workload dominates. For indexing, self-hosting ColQwen2.5 runs 37x cheaper per page than Azure Document Intelligence regardless of GPU tier. For query serving, self-hosting on H200 on-demand at $4.54/hr is cost-competitive with managed API pricing when sustained throughput stays above roughly 10 queries/minute. Below that threshold, managed APIs carry simpler pricing and comparable cost.

Pricing fluctuates based on GPU availability. The prices above are based on 30 Apr 2026 and may have changed. Check current GPU pricing → for live rates.


ColPali's embedding-throughput-bound indexing phase runs fastest on high-bandwidth GPUs like the H200 and B200, and colocating the retriever with the VLM answerer on one node cuts query latency by eliminating cross-network hops.

Rent H200 → | Rent B200 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.