Deploy DeepSeek-OCR on GPU Cloud: Self-Host Production Document and Visual OCR Inference (2026 Setup Guide)

DeepSeek-OCR turns any GPU server into a document extraction API that matches or beats cloud OCR services on accuracy, at a fraction of the cost. It processes images as vision-language token sequences, which means it natively handles tables, mixed scripts, handwriting, and structured forms that classical OCR pipelines garble. This guide walks through the full production setup: hardware sizing, vLLM and SGLang container configuration, a FastAPI service for PDF and image pipelines, cost-per-million-pages benchmarks against AWS Textract and GCP Document AI, and RAG integration with Qdrant and ColPali.

What DeepSeek-OCR Is and Why It Outperforms Tesseract, PaddleOCR, and Donut

Classical OCR tools like Tesseract and PaddleOCR work through a pipeline of steps: image preprocessing, text region detection, character segmentation, and character classification. Each step introduces error. On clean, high-contrast printed text that pipeline works acceptably. Put a photocopied table, a mixed-script invoice, or handwritten annotations in front of it and accuracy falls fast.

DeepSeek-OCR approaches the problem differently. It treats the page as a visual input to a transformer encoder, processes the image as a sequence of patch tokens, and generates the extracted text autoregressively through a language model decoder. This means layout understanding, table structure, and text semantics are all handled in one pass, and the output format is controllable through the prompt. You can ask for markdown tables, JSON key-value pairs, or plain text without any post-processing logic.

Metric	Tesseract	PaddleOCR	Donut	DeepSeek-OCR
Architecture	Classical pipeline	CNN+CTC	Transformer seq2seq	Vision-language model
Handwriting	Poor	Fair	Fair	Good
Table extraction	Manual post-process	Limited	No	Native via prompt
Mixed scripts	Limited	Fair	Limited	Strong
API-compatible	No	No	No	Yes (OpenAI vision spec)
VRAM at batch=1	None	None	~4 GB	~8 GB

The trade-off is significant: DeepSeek-OCR requires >=16 GB of VRAM (more for larger batch sizes) compared to near-zero for Tesseract, and it is slower at very high DPI. For teams processing simple, clean printed documents at extremely high volume, a classical pipeline is still cheaper at the margin. For everyone dealing with enterprise documents, scanned PDFs, or structured data extraction, the accuracy difference makes DeepSeek-OCR worth the GPU cost.

For teams that want to skip text extraction entirely and retrieve documents by visual similarity, the ColPali multimodal document RAG guide covers patch-level embedding over raw page images without any OCR step.

If you want a model comparison before committing to DeepSeek-OCR, the open-source OCR VLM comparison for 2026 benchmarks DeepSeek-OCR against PaddleOCR-VL-1.6, GOT-OCR 2.0, and Granite-Docling across VRAM, throughput, and cost per 10,000 pages.

Hardware Requirements: VRAM, Throughput, and Token-Compression Math

Understanding DeepSeek-OCR's VRAM profile starts with its architecture. Rather than a single ViT encoder, it uses a DeepEncoder that combines SAM (window attention) and CLIP (global attention), feeding into a 16x convolutional compressor. This "Contexts Optical Compression" achieves 7-20x reduction in visual tokens: at 10x compression the model retains 97% of full accuracy, and in Base mode (256 vision tokens per page) it handles most printed documents without quality loss. Tiny mode produces 64 tokens, Small 100, and Large 400 for high-fidelity work. At 256 visual tokens plus up to 4,096 output tokens, a single page sits under 5,000 context tokens in Base mode.

At BF16, the model weights are roughly 6.7 GB: a DeepEncoder at ~380M params and a 3B-class MoE decoder with 570M active parameters (6 of 64 experts active per token). The Hugging Face model card recommends >=16 GB VRAM. An L40S at 48 GB comfortably handles batch=16 with significant headroom; the RTX Pro 6000 at 96 GB supports batch=32 and above. KV cache overhead is substantially lower than a 7B model given the smaller active parameter count.

The table below shows throughput estimates for each GPU. Figures are estimates based on the model's actual architecture (570M active params, MoE decoder); actual results vary by DPI, compression mode, and output length.

GPU	VRAM	Batch size at FP16	Throughput (pages/min, est.)	On-Demand $/hr	Spot $/hr
NVIDIA L40S	48 GB	16	~35	$0.72	N/A
NVIDIA RTX Pro 6000	96 GB	32	~85	$1.70	$0.59
NVIDIA RTX 5090	32 GB	12	~40	$0.924	N/A
NVIDIA H100 SXM5	80 GB	24	~88	$3.10	$0.80

Pricing fluctuates based on GPU availability. The prices above are based on 03 May 2026 and may have changed. Check current GPU pricing → for live rates.

For batch document processing where throughput per dollar is the priority, the L40S GPU rental is the most cost-effective option. The L40S delivers 35 pages/min at $0.72/hr on-demand, which works out to roughly 2,100 pages per GPU-hour. For latency-sensitive serving or long PDFs where you want the full model in VRAM with no paging, RTX Pro 6000 on Spheron gives 96 GB of GDDR7 and handles batch=32 without swapping. If you need H100-class memory bandwidth for mixed LLM and OCR workloads on the same node, H100 SXM5 instances are the right choice.

Container Setup with vLLM and SGLang Backends

vLLM Setup

vLLM's OpenAI-compatible server handles vision-language models natively through its multimodal API. DeepSeek-OCR follows the same setup path as any other VLM.

bash

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model deepseek-ai/DeepSeek-OCR \
  --tensor-parallel-size 1 \
  --max-model-len 16384 \
  --limit-mm-per-prompt image=4 \
  --served-model-name deepseek-ocr

The --limit-mm-per-prompt image=4 flag allows up to four images per request, which is useful when batching several pages into a single API call. Set --max-model-len based on your VRAM: 16,384 tokens works on L40S at batch=16. For H100, you can push to 32,768 tokens.

For FP8 quantization on H100 (roughly 1.5-2x throughput improvement), add --dtype fp8. For detailed multi-GPU tensor-parallel setup and FP8 flags, the vLLM production deployment guide covers the same configuration flow for vision-language models.

SGLang Setup

bash

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-OCR \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --host 0.0.0.0 \
  --port 30000

SGLang's RadixAttention prefix cache is the key advantage here for document workloads. If you are processing thousands of pages from the same document set, many pages share the same system prompt and layout instructions. RadixAttention caches those shared prefixes and avoids recomputing attention for them on each request. On a corpus where documents share headers, footers, or common preamble structure, prefix hit rates of 40-60% are typical, which translates to meaningful throughput gains.

Criterion	vLLM	SGLang
Startup time	~90s	~120s
Prefix caching	Basic (block-level)	RadixAttention (tree-based)
OpenAI API compat	Full	Full
Ecosystem maturity	Higher	Growing
Best for	General serving, one-off requests	Repeated templates, batch OCR

Spheron Deployment

Provision your instance from app.spheron.ai: select the GPU type, choose your region, and pull the vLLM container directly. For the CLI, refer to the Spheron docs for the instance launch flow. Once SSHed in, verify the GPU with nvidia-smi before starting the container.

For the RTX Pro 6000 or H100, a spheron.yml snippet for the vLLM service looks like:

yaml

version: "1.0"
services:
  ocr-server:
    image: vllm/vllm-openai:latest
    expose:
      - port: 8000
        as: 8000
        to:
          - global: true
    command:
      - "--model deepseek-ai/DeepSeek-OCR"
      - "--tensor-parallel-size 1"
      - "--max-model-len 16384"
      - "--limit-mm-per-prompt image=4"
    resources:
      gpu:
        units: 1
        attributes:
          vendor:
            nvidia:
              - model: rtx-pro-6000

Building a FastAPI OCR Endpoint: PDF, Image, and Multi-Page Document Pipelines

The following service accepts PDF uploads or raw images, runs each page through DeepSeek-OCR in parallel, and returns structured JSON.

python

from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
import asyncio, base64, httpx
from pdf2image import convert_from_bytes

# Requires poppler-utils installed at OS level:
# RUN apt-get install -y poppler-utils
# in your Dockerfile, or the PDF conversion will raise PDFInfoNotInstalledError.

app = FastAPI()
VLLM_URL = "http://localhost:8000/v1/chat/completions"
MODEL = "deepseek-ocr"

async def ocr_page(client: httpx.AsyncClient, img_b64: str) -> str:
    payload = {
        "model": MODEL,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
                {"type": "text", "text": "Extract all text from this image. Preserve tables and structure as Markdown."}
            ]
        }],
        "max_tokens": 4096,
        "temperature": 0
    }
    r = await client.post(VLLM_URL, json=payload, timeout=60)
    r.raise_for_status()
    choices = r.json().get("choices", [])
    return choices[0].get("message", {}).get("content", "") if choices else ""

@app.post("/ocr")
async def ocr_document(file: UploadFile = File(...)):
    raw = await file.read()
    if file.filename and file.filename.lower().endswith(".pdf"):
        images = convert_from_bytes(raw, dpi=150, fmt="jpeg")
        page_bytes = []
        for img in images:
            import io
            buf = io.BytesIO()
            img.save(buf, format="JPEG")
            page_bytes.append(buf.getvalue())
    else:
        page_bytes = [raw]

    async with httpx.AsyncClient() as client:
        tasks = [
            ocr_page(client, base64.b64encode(b).decode())
            for b in page_bytes
        ]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    pages = []
    errors = []
    for i, result in enumerate(results):
        if isinstance(result, Exception):
            errors.append({"page": i + 1, "error": str(result)})
        else:
            pages.append({"page": i + 1, "text": result})

    response: dict = {"pages": pages, "page_count": len(results)}
    if errors:
        response["partial_failures"] = errors
    return JSONResponse(response)

A few notes on this implementation:

dpi=150 is the default for printed documents. Increase to 200-300 for handwriting, dense tables, or small fonts at the cost of 3-4x more visual tokens per page.
temperature=0 gives deterministic output. For mixed handwriting or degraded scans where some creative transcription helps, try temperature=0.1.
For large PDFs (more than 20 pages), break the conversion into groups of 10 pages and use a background task queue (e.g., arq or Celery with Redis). Sending 50 concurrent 6,000-token requests will exhaust VRAM on any single GPU.
Add slowapi rate limiting on the /ocr endpoint if you are exposing this service publicly. Without it, a single large PDF flood will queue out all GPU memory.

Production Tuning: Batching, Page-Level Parallelism, and DPI Trade-Offs

vLLM concurrency: Set --max-num-seqs 16 to allow up to 16 concurrent requests at the vLLM level. For offline batch jobs, this is the primary knob for throughput. For interactive serving where latency matters, drop it to 4-8 to reduce queue depth.

asyncio.gather parallelism: The FastAPI handler above submits all pages simultaneously via asyncio.gather. The bottleneck is GPU compute, not Python thread scheduling. On a loaded system, this means all pages arrive at vLLM at once and compete for the same VRAM-constrained batch slots. Pair this with --max-num-seqs tuning so the queue doesn't overflow.

DPI and token count trade-offs:

DPI	Visual tokens/page (approx.)	Relative accuracy	Relative compute
72	~600	Low	0.3x
150	~2,000	Good for printed text	1x (baseline)
200	~3,500	Better for small fonts	1.75x
300	~6,000	Best, needed for handwriting	3x

Test at 150 DPI first on your actual corpus. Many enterprise printed documents produce excellent results at 150 DPI with no need to triple compute costs.

Token compression: If DeepSeek-OCR supports spatial token pooling, pass --mm-processor-kwargs '{"image_compression_factor":0.5}' to vLLM. This halves visual tokens at a 3-5% accuracy penalty on clean documents. On an L40S, token compression roughly doubles batch size capacity with negligible quality loss for standard printed text.

Memory management for large documents: For PDFs over 20 pages, chunk processing into groups and use a persistent background queue. A 50-page PDF at 200 DPI generates 50 x 3,500 = 175,000 visual tokens of concurrent context. At batch=8, that saturates L40S VRAM. Process in chunks of 10 pages per batch submission instead.

Cost Per 1M Pages: Spheron vs Hyperscaler GPU Cloud Pricing Comparison

The numbers below use the throughput estimates from the hardware table above with the following assumptions:

DPI = 150, avg 2,000 tokens of image input per page
GPU utilization = 85%
No hyperscaler support charges, egress fees, or minimum commitments included
Per-page cost = (hourly rate) / (pages/min x 60 x 0.85 utilization)

Platform	GPU	Pricing model	Per-page cost (est.)	Cost per 1M pages
Spheron (on-demand)	L40S	$0.72/hr	$0.00034	~$340
Spheron (on-demand)	RTX Pro 6000	$1.70/hr	$0.00042	~$420
Spheron (spot)	RTX Pro 6000	$0.59/hr	$0.00014	~$144
Spheron (on-demand)	H100 SXM5	$3.10/hr	$0.00094	~$940
Spheron (spot)	H100 SXM5	$0.80/hr	$0.00024	~$240
AWS Textract	-	per page	$0.0015	$1,500
GCP Document AI	-	per page	$0.0015	$1,500
Azure Form Recognizer	-	per page	$0.0015	$1,500

Pricing fluctuates based on GPU availability. The prices above are based on 03 May 2026 and may have changed. Check current GPU pricing → for live rates.

The L40S on-demand at ~$340 per million pages is 4.4x cheaper than Textract. RTX Pro 6000 spot at ~$144 is over 10x cheaper. Beyond cost, DeepSeek-OCR returns structured Markdown output with table preservation. AWS Textract's DetectDocumentText returns flat text with positional bounding boxes but requires separate table analysis calls at additional per-page fees. If your pipeline needs tables and structured extraction, the managed services cost comparison gets worse once you add the table analysis tier.

The caveat: at volumes below 50,000 pages per month, the fixed cost of maintaining a GPU instance likely outweighs Textract's pay-per-page simplicity. Self-hosting makes economic sense starting around 100,000 pages/month, or immediately if you need output quality that classical OCR cannot deliver.

Integrating with RAG Stacks (ColPali, Qdrant, vLLM) for Document Q&A

Once DeepSeek-OCR extracts text from your documents, you have a standard RAG indexing problem. The architecture looks like this:

PDF upload -> DeepSeek-OCR FastAPI -> text chunks
                                            |
                               TEI embedding model (GPU)
                                            |
                                 Qdrant vector store
                                            |
                          vLLM LLM (Qwen / DeepSeek-V3) <- query

For text-heavy documents (contracts, technical manuals, financial filings), DeepSeek-OCR plus a standard embedding model gives better chunk-level retrieval precision than ColPali's page-level embeddings. The OCR step produces dense text chunks that embed accurately, and you get sub-page retrieval granularity.

For visual-heavy documents (slides, scanned catalogs, annotated engineering drawings), swap the OCR step for ColPali visual embeddings and skip the text extraction entirely. ColPali preserves spatial layout information that even accurate OCR loses when converting to linear text.

Set up self-hosted Qdrant or Milvus on Spheron on the same cluster as your OCR and inference nodes to eliminate network round-trips during retrieval. For the TEI embedding server setup and reranker configuration, see the self-host embedding and reranker models guide.

Once documents are indexed, a basic query loop looks like this:

python

# After indexing OCR output into Qdrant
hits = qdrant_client.search(
    collection_name="docs",
    query_vector=embed(query),
    limit=5
)
context = "\n---\n".join(h.payload["text"] for h in hits)
answer = vllm_client.chat.completions.create(
    model="deepseek-v3",
    messages=[
        {"role": "system", "content": "Answer using only the provided context."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
    ]
)

Run the embedding model and vLLM inference server on the same GPU node to avoid inter-service latency. An L40S handles BGE-M3 embeddings (1 GB VRAM) plus DeepSeek-V3 distilled 7B inference (14 GB VRAM) concurrently with room to spare.

Common Failure Modes: Skewed Layouts, Handwriting, Table Extraction

Skewed or rotated pages

A 5-degree page skew drops OCR accuracy by 15-20% on standard printed text. Pre-process with OpenCV deskew before sending to DeepSeek-OCR:

python

import cv2
import numpy as np

def deskew(image: np.ndarray) -> np.ndarray:
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
    coords = np.argwhere(gray < 128)[:, ::-1].astype(np.float32)
    if len(coords) == 0:
        return image  # nothing to deskew
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    h, w = image.shape[:2]
    M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
    return cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC)

Apply this after convert_from_bytes, converting each PIL Image with np.array(img) before passing to deskew and reconstructing the PIL Image with Image.fromarray(deskewed) before encoding to JPEG. It adds ~10-20ms per page on CPU and eliminates accuracy loss from scan misalignment.

Handwriting

DeepSeek-OCR handles printed handwriting better than Tesseract but degrades on cursive and informal script. Two adjustments help:

Set temperature=0.1 instead of temperature=0 to allow slightly more variation in transcription. Pure greedy decoding sometimes gets stuck when character boundaries are ambiguous.
Accept a 10-15% character error rate on mixed handwriting/print documents and design your downstream pipeline to tolerate noise. For high-stakes workflows, add a human review queue for low-confidence pages.

Table extraction

The default prompt "Extract all text from this image." treats tables as flowing text. For structured tables, use:

Extract all text from this image. Represent any tables as GitHub Flavored Markdown tables with pipe delimiters.

For complex tables with merged cells, run a second OCR pass on a bounding-box crop of the table region at higher DPI. Cropping to 110% of the table area and running at 200-300 DPI significantly improves cell boundary detection.

Multi-column layouts

Newspapers, academic papers, and some report formats use two or three columns. Without guidance, DeepSeek-OCR sometimes interleaves column text. Add this to the prompt:

Process columns left-to-right, top-to-bottom. Complete each column fully before moving to the next.

This instruction reduces column interleaving from a common failure to an occasional edge case. It is not a guarantee on all layouts, but it handles the majority of standard two-column academic PDFs.

DeepSeek-OCR plus a fast GPU is the most cost-effective way to run production document extraction today. Spheron gives you on-demand and spot access to the exact GPUs this pipeline runs on, with no minimum commitments.
View all GPU pricing → | Get started on Spheron →

STEPS / 04

Quick Setup Guide

Pull the DeepSeek-OCR Docker image and configure vLLM
Run the vLLM OpenAI-compatible server with the DeepSeek-OCR model weights, setting --tensor-parallel-size and --max-model-len appropriate for your GPU VRAM.
Stand up the FastAPI OCR endpoint
Create a /ocr POST endpoint that accepts a base64-encoded image or a PDF file, slices multi-page PDFs into per-page images, submits each page to the vLLM vision endpoint, and returns structured JSON.
Tune batching and page-level parallelism
Configure --max-num-seqs to control concurrency, use asyncio gather for parallel page submission, and tune DPI (150-200 DPI balances token count vs accuracy for printed documents).
Index OCR output into Qdrant for document Q&A
Pipe OCR JSON output to a text-embedding model served with Hugging Face TEI, then upsert chunks into a Qdrant collection. Use ColPali for pages where visual layout matters more than text.

FAQ / 05

Frequently Asked Questions

DeepSeek-OCR runs well on 48 GB VRAM cards like the L40S at high batch sizes. For the best throughput per dollar on long PDFs, the RTX Pro 6000 with 96 GB VRAM fits the full model without paging. On Spheron, L40S starts from $0.72/hr on-demand and the RTX Pro 6000 from $1.70/hr.

Yes. vLLM supports DeepSeek-OCR through its vision language model backend. See the vLLM production deployment guide at /blog/vllm-production-deployment-2026/ for FP8 quantization flags and multi-GPU tensor-parallel configuration.

Tesseract is a classical pipeline limited to clean printed text. PaddleOCR handles more layouts but struggles with handwriting and tables. DeepSeek-OCR processes the image as a vision-language token sequence, enabling it to recover structure, formulas, and mixed-script content in one pass.

Pass DeepSeek-OCR output as document chunks to an embedding model, then index into Qdrant or Milvus. For visual retrieval without OCR, combine with ColPali multimodal RAG at /blog/colpali-multimodal-document-rag-gpu-cloud/. See the self-hosted vector database guide at /blog/self-host-vector-database-gpu-cloud-qdrant-milvus-weaviate/ for Qdrant setup on Spheron.

On Spheron on-demand with an L40S at $0.72/hr, processing ~35 pages/min gives a per-page cost of roughly $0.00034. At 1M pages that comes to around $340, compared to $1,500 with AWS Textract. Spot pricing brings costs down further. DeepSeek-OCR also extracts tables and structured data that Textract returns as flat text.

What DeepSeek-OCR Is and Why It Outperforms Tesseract, PaddleOCR, and Donut

Hardware Requirements: VRAM, Throughput, and Token-Compression Math

Container Setup with vLLM and SGLang Backends

vLLM Setup

SGLang Setup

Spheron Deployment

Building a FastAPI OCR Endpoint: PDF, Image, and Multi-Page Document Pipelines

Production Tuning: Batching, Page-Level Parallelism, and DPI Trade-Offs

Cost Per 1M Pages: Spheron vs Hyperscaler GPU Cloud Pricing Comparison

Integrating with RAG Stacks (ColPali, Qdrant, vLLM) for Document Q&A

Common Failure Modes: Skewed Layouts, Handwriting, Table Extraction

Skewed or rotated pages

Handwriting

Table extraction

Multi-column layouts

Quick Setup Guide

Pull the DeepSeek-OCR Docker image and configure vLLM

Stand up the FastAPI OCR endpoint

Tune batching and page-level parallelism

Index OCR output into Qdrant for document Q&A

Frequently Asked Questions

01What GPU does DeepSeek-OCR need for production throughput?

02Can I run DeepSeek-OCR with vLLM?

03How does DeepSeek-OCR compare to Tesseract and PaddleOCR?

04How do I integrate DeepSeek-OCR with a RAG stack?

05What is the cost per million pages with DeepSeek-OCR on Spheron vs AWS?

Build what's next.