Tutorial

Deploy DeepSeek-OCR on GPU Cloud: Self-Host Production Document and Visual OCR Inference (2026 Setup Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 3, 2026
DeepSeek OCR DeploymentSelf-Host OCR LLMPDF OCR Pipeline GPUDocument AI GPU CloudAWS Textract AlternativeDeepSeek OCR vLLMGPU CloudvLLMSGLang
Deploy DeepSeek-OCR on GPU Cloud: Self-Host Production Document and Visual OCR Inference (2026 Setup Guide)

DeepSeek-OCR turns any GPU server into a document extraction API that matches or beats cloud OCR services on accuracy, at a fraction of the cost. It processes images as vision-language token sequences, which means it natively handles tables, mixed scripts, handwriting, and structured forms that classical OCR pipelines garble. This guide walks through the full production setup: hardware sizing, vLLM and SGLang container configuration, a FastAPI service for PDF and image pipelines, cost-per-million-pages benchmarks against AWS Textract and GCP Document AI, and RAG integration with Qdrant and ColPali.

What DeepSeek-OCR Is and Why It Outperforms Tesseract, PaddleOCR, and Donut

Classical OCR tools like Tesseract and PaddleOCR work through a pipeline of steps: image preprocessing, text region detection, character segmentation, and character classification. Each step introduces error. On clean, high-contrast printed text that pipeline works acceptably. Put a photocopied table, a mixed-script invoice, or handwritten annotations in front of it and accuracy falls fast.

DeepSeek-OCR approaches the problem differently. It treats the page as a visual input to a transformer encoder, processes the image as a sequence of patch tokens, and generates the extracted text autoregressively through a language model decoder. This means layout understanding, table structure, and text semantics are all handled in one pass, and the output format is controllable through the prompt. You can ask for markdown tables, JSON key-value pairs, or plain text without any post-processing logic.

MetricTesseractPaddleOCRDonutDeepSeek-OCR
ArchitectureClassical pipelineCNN+CTCTransformer seq2seqVision-language model
HandwritingPoorFairFairGood
Table extractionManual post-processLimitedNoNative via prompt
Mixed scriptsLimitedFairLimitedStrong
API-compatibleNoNoNoYes (OpenAI vision spec)
VRAM at batch=1NoneNone~4 GB~8 GB

The trade-off is significant: DeepSeek-OCR requires >=16 GB of VRAM (more for larger batch sizes) compared to near-zero for Tesseract, and it is slower at very high DPI. For teams processing simple, clean printed documents at extremely high volume, a classical pipeline is still cheaper at the margin. For everyone dealing with enterprise documents, scanned PDFs, or structured data extraction, the accuracy difference makes DeepSeek-OCR worth the GPU cost.

For teams that want to skip text extraction entirely and retrieve documents by visual similarity, the ColPali multimodal document RAG guide covers patch-level embedding over raw page images without any OCR step.

Hardware Requirements: VRAM, Throughput, and Token-Compression Math

Understanding DeepSeek-OCR's VRAM profile starts with its architecture. Rather than a single ViT encoder, it uses a DeepEncoder that combines SAM (window attention) and CLIP (global attention), feeding into a 16x convolutional compressor. This "Contexts Optical Compression" achieves 7-20x reduction in visual tokens: at 10x compression the model retains 97% of full accuracy, and in Base mode (256 vision tokens per page) it handles most printed documents without quality loss. Tiny mode produces 64 tokens, Small 100, and Large 400 for high-fidelity work. At 256 visual tokens plus up to 4,096 output tokens, a single page sits under 5,000 context tokens in Base mode.

At BF16, the model weights are roughly 6.7 GB: a DeepEncoder at ~380M params and a 3B-class MoE decoder with 570M active parameters (6 of 64 experts active per token). The Hugging Face model card recommends >=16 GB VRAM. An L40S at 48 GB comfortably handles batch=16 with significant headroom; the RTX Pro 6000 at 96 GB supports batch=32 and above. KV cache overhead is substantially lower than a 7B model given the smaller active parameter count.

The table below shows throughput estimates for each GPU. Figures are estimates based on the model's actual architecture (570M active params, MoE decoder); actual results vary by DPI, compression mode, and output length.

GPUVRAMBatch size at FP16Throughput (pages/min, est.)On-Demand $/hrSpot $/hr
NVIDIA L40S48 GB16~35$0.72N/A
NVIDIA RTX Pro 600096 GB32~85$1.70$0.59
NVIDIA RTX 509032 GB12~40$0.924N/A
NVIDIA H100 SXM580 GB24~88$3.10$0.80

Pricing fluctuates based on GPU availability. The prices above are based on 03 May 2026 and may have changed. Check current GPU pricing → for live rates.

For batch document processing where throughput per dollar is the priority, the L40S GPU rental is the most cost-effective option. The L40S delivers 35 pages/min at $0.72/hr on-demand, which works out to roughly 2,100 pages per GPU-hour. For latency-sensitive serving or long PDFs where you want the full model in VRAM with no paging, RTX Pro 6000 on Spheron gives 96 GB of GDDR7 and handles batch=32 without swapping. If you need H100-class memory bandwidth for mixed LLM and OCR workloads on the same node, H100 SXM5 instances are the right choice.

Container Setup with vLLM and SGLang Backends

vLLM Setup

vLLM's OpenAI-compatible server handles vision-language models natively through its multimodal API. DeepSeek-OCR follows the same setup path as any other VLM.

bash
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model deepseek-ai/DeepSeek-OCR \
  --tensor-parallel-size 1 \
  --max-model-len 16384 \
  --limit-mm-per-prompt image=4 \
  --served-model-name deepseek-ocr

The --limit-mm-per-prompt image=4 flag allows up to four images per request, which is useful when batching several pages into a single API call. Set --max-model-len based on your VRAM: 16,384 tokens works on L40S at batch=16. For H100, you can push to 32,768 tokens.

For FP8 quantization on H100 (roughly 1.5-2x throughput improvement), add --dtype fp8. For detailed multi-GPU tensor-parallel setup and FP8 flags, the vLLM production deployment guide covers the same configuration flow for vision-language models.

SGLang Setup

bash
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-OCR \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --host 0.0.0.0 \
  --port 30000

SGLang's RadixAttention prefix cache is the key advantage here for document workloads. If you are processing thousands of pages from the same document set, many pages share the same system prompt and layout instructions. RadixAttention caches those shared prefixes and avoids recomputing attention for them on each request. On a corpus where documents share headers, footers, or common preamble structure, prefix hit rates of 40-60% are typical, which translates to meaningful throughput gains.

CriterionvLLMSGLang
Startup time~90s~120s
Prefix cachingBasic (block-level)RadixAttention (tree-based)
OpenAI API compatFullFull
Ecosystem maturityHigherGrowing
Best forGeneral serving, one-off requestsRepeated templates, batch OCR

Spheron Deployment

Provision your instance from app.spheron.ai: select the GPU type, choose your region, and pull the vLLM container directly. For the CLI, refer to the Spheron docs for the instance launch flow. Once SSHed in, verify the GPU with nvidia-smi before starting the container.

For the RTX Pro 6000 or H100, a spheron.yml snippet for the vLLM service looks like:

yaml
version: "1.0"
services:
  ocr-server:
    image: vllm/vllm-openai:latest
    expose:
      - port: 8000
        as: 8000
        to:
          - global: true
    command:
      - "--model deepseek-ai/DeepSeek-OCR"
      - "--tensor-parallel-size 1"
      - "--max-model-len 16384"
      - "--limit-mm-per-prompt image=4"
    resources:
      gpu:
        units: 1
        attributes:
          vendor:
            nvidia:
              - model: rtx-pro-6000

Building a FastAPI OCR Endpoint: PDF, Image, and Multi-Page Document Pipelines

The following service accepts PDF uploads or raw images, runs each page through DeepSeek-OCR in parallel, and returns structured JSON.

python
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
import asyncio, base64, httpx
from pdf2image import convert_from_bytes

# Requires poppler-utils installed at OS level:
# RUN apt-get install -y poppler-utils
# in your Dockerfile, or the PDF conversion will raise PDFInfoNotInstalledError.

app = FastAPI()
VLLM_URL = "http://localhost:8000/v1/chat/completions"
MODEL = "deepseek-ocr"

async def ocr_page(client: httpx.AsyncClient, img_b64: str) -> str:
    payload = {
        "model": MODEL,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
                {"type": "text", "text": "Extract all text from this image. Preserve tables and structure as Markdown."}
            ]
        }],
        "max_tokens": 4096,
        "temperature": 0
    }
    r = await client.post(VLLM_URL, json=payload, timeout=60)
    r.raise_for_status()
    choices = r.json().get("choices", [])
    return choices[0].get("message", {}).get("content", "") if choices else ""

@app.post("/ocr")
async def ocr_document(file: UploadFile = File(...)):
    raw = await file.read()
    if file.filename and file.filename.lower().endswith(".pdf"):
        images = convert_from_bytes(raw, dpi=150, fmt="jpeg")
        page_bytes = []
        for img in images:
            import io
            buf = io.BytesIO()
            img.save(buf, format="JPEG")
            page_bytes.append(buf.getvalue())
    else:
        page_bytes = [raw]

    async with httpx.AsyncClient() as client:
        tasks = [
            ocr_page(client, base64.b64encode(b).decode())
            for b in page_bytes
        ]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    pages = []
    errors = []
    for i, result in enumerate(results):
        if isinstance(result, Exception):
            errors.append({"page": i + 1, "error": str(result)})
        else:
            pages.append({"page": i + 1, "text": result})

    response: dict = {"pages": pages, "page_count": len(results)}
    if errors:
        response["partial_failures"] = errors
    return JSONResponse(response)

A few notes on this implementation:

  • dpi=150 is the default for printed documents. Increase to 200-300 for handwriting, dense tables, or small fonts at the cost of 3-4x more visual tokens per page.
  • temperature=0 gives deterministic output. For mixed handwriting or degraded scans where some creative transcription helps, try temperature=0.1.
  • For large PDFs (more than 20 pages), break the conversion into groups of 10 pages and use a background task queue (e.g., arq or Celery with Redis). Sending 50 concurrent 6,000-token requests will exhaust VRAM on any single GPU.
  • Add slowapi rate limiting on the /ocr endpoint if you are exposing this service publicly. Without it, a single large PDF flood will queue out all GPU memory.

Production Tuning: Batching, Page-Level Parallelism, and DPI Trade-Offs

vLLM concurrency: Set --max-num-seqs 16 to allow up to 16 concurrent requests at the vLLM level. For offline batch jobs, this is the primary knob for throughput. For interactive serving where latency matters, drop it to 4-8 to reduce queue depth.

asyncio.gather parallelism: The FastAPI handler above submits all pages simultaneously via asyncio.gather. The bottleneck is GPU compute, not Python thread scheduling. On a loaded system, this means all pages arrive at vLLM at once and compete for the same VRAM-constrained batch slots. Pair this with --max-num-seqs tuning so the queue doesn't overflow.

DPI and token count trade-offs:

DPIVisual tokens/page (approx.)Relative accuracyRelative compute
72~600Low0.3x
150~2,000Good for printed text1x (baseline)
200~3,500Better for small fonts1.75x
300~6,000Best, needed for handwriting3x

Test at 150 DPI first on your actual corpus. Many enterprise printed documents produce excellent results at 150 DPI with no need to triple compute costs.

Token compression: If DeepSeek-OCR supports spatial token pooling, pass --mm-processor-kwargs '{"image_compression_factor":0.5}' to vLLM. This halves visual tokens at a 3-5% accuracy penalty on clean documents. On an L40S, token compression roughly doubles batch size capacity with negligible quality loss for standard printed text.

Memory management for large documents: For PDFs over 20 pages, chunk processing into groups and use a persistent background queue. A 50-page PDF at 200 DPI generates 50 x 3,500 = 175,000 visual tokens of concurrent context. At batch=8, that saturates L40S VRAM. Process in chunks of 10 pages per batch submission instead.

Cost Per 1M Pages: Spheron vs Hyperscaler GPU Cloud Pricing Comparison

The numbers below use the throughput estimates from the hardware table above with the following assumptions:

  • DPI = 150, avg 2,000 tokens of image input per page
  • GPU utilization = 85%
  • No hyperscaler support charges, egress fees, or minimum commitments included
  • Per-page cost = (hourly rate) / (pages/min x 60 x 0.85 utilization)
PlatformGPUPricing modelPer-page cost (est.)Cost per 1M pages
Spheron (on-demand)L40S$0.72/hr$0.00034~$340
Spheron (on-demand)RTX Pro 6000$1.70/hr$0.00042~$420
Spheron (spot)RTX Pro 6000$0.59/hr$0.00014~$144
Spheron (on-demand)H100 SXM5$3.10/hr$0.00094~$940
Spheron (spot)H100 SXM5$0.80/hr$0.00024~$240
AWS Textract-per page$0.0015$1,500
GCP Document AI-per page$0.0015$1,500
Azure Form Recognizer-per page$0.0015$1,500

Pricing fluctuates based on GPU availability. The prices above are based on 03 May 2026 and may have changed. Check current GPU pricing → for live rates.

The L40S on-demand at ~$340 per million pages is 4.4x cheaper than Textract. RTX Pro 6000 spot at ~$144 is over 10x cheaper. Beyond cost, DeepSeek-OCR returns structured Markdown output with table preservation. AWS Textract's DetectDocumentText returns flat text with positional bounding boxes but requires separate table analysis calls at additional per-page fees. If your pipeline needs tables and structured extraction, the managed services cost comparison gets worse once you add the table analysis tier.

The caveat: at volumes below 50,000 pages per month, the fixed cost of maintaining a GPU instance likely outweighs Textract's pay-per-page simplicity. Self-hosting makes economic sense starting around 100,000 pages/month, or immediately if you need output quality that classical OCR cannot deliver.

Integrating with RAG Stacks (ColPali, Qdrant, vLLM) for Document Q&A

Once DeepSeek-OCR extracts text from your documents, you have a standard RAG indexing problem. The architecture looks like this:

PDF upload -> DeepSeek-OCR FastAPI -> text chunks
                                            |
                               TEI embedding model (GPU)
                                            |
                                 Qdrant vector store
                                            |
                          vLLM LLM (Qwen / DeepSeek-V3) <- query

For text-heavy documents (contracts, technical manuals, financial filings), DeepSeek-OCR plus a standard embedding model gives better chunk-level retrieval precision than ColPali's page-level embeddings. The OCR step produces dense text chunks that embed accurately, and you get sub-page retrieval granularity.

For visual-heavy documents (slides, scanned catalogs, annotated engineering drawings), swap the OCR step for ColPali visual embeddings and skip the text extraction entirely. ColPali preserves spatial layout information that even accurate OCR loses when converting to linear text.

Set up self-hosted Qdrant or Milvus on Spheron on the same cluster as your OCR and inference nodes to eliminate network round-trips during retrieval. For the TEI embedding server setup and reranker configuration, see the self-host embedding and reranker models guide.

Once documents are indexed, a basic query loop looks like this:

python
# After indexing OCR output into Qdrant
hits = qdrant_client.search(
    collection_name="docs",
    query_vector=embed(query),
    limit=5
)
context = "\n---\n".join(h.payload["text"] for h in hits)
answer = vllm_client.chat.completions.create(
    model="deepseek-v3",
    messages=[
        {"role": "system", "content": "Answer using only the provided context."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
    ]
)

Run the embedding model and vLLM inference server on the same GPU node to avoid inter-service latency. An L40S handles BGE-M3 embeddings (1 GB VRAM) plus DeepSeek-V3 distilled 7B inference (14 GB VRAM) concurrently with room to spare.

Common Failure Modes: Skewed Layouts, Handwriting, Table Extraction

Skewed or rotated pages

A 5-degree page skew drops OCR accuracy by 15-20% on standard printed text. Pre-process with OpenCV deskew before sending to DeepSeek-OCR:

python
import cv2
import numpy as np

def deskew(image: np.ndarray) -> np.ndarray:
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
    coords = np.argwhere(gray < 128)[:, ::-1].astype(np.float32)
    if len(coords) == 0:
        return image  # nothing to deskew
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    h, w = image.shape[:2]
    M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
    return cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC)

Apply this after convert_from_bytes, converting each PIL Image with np.array(img) before passing to deskew and reconstructing the PIL Image with Image.fromarray(deskewed) before encoding to JPEG. It adds ~10-20ms per page on CPU and eliminates accuracy loss from scan misalignment.

Handwriting

DeepSeek-OCR handles printed handwriting better than Tesseract but degrades on cursive and informal script. Two adjustments help:

  • Set temperature=0.1 instead of temperature=0 to allow slightly more variation in transcription. Pure greedy decoding sometimes gets stuck when character boundaries are ambiguous.
  • Accept a 10-15% character error rate on mixed handwriting/print documents and design your downstream pipeline to tolerate noise. For high-stakes workflows, add a human review queue for low-confidence pages.

Table extraction

The default prompt "Extract all text from this image." treats tables as flowing text. For structured tables, use:

Extract all text from this image. Represent any tables as GitHub Flavored Markdown tables with pipe delimiters.

For complex tables with merged cells, run a second OCR pass on a bounding-box crop of the table region at higher DPI. Cropping to 110% of the table area and running at 200-300 DPI significantly improves cell boundary detection.

Multi-column layouts

Newspapers, academic papers, and some report formats use two or three columns. Without guidance, DeepSeek-OCR sometimes interleaves column text. Add this to the prompt:

Process columns left-to-right, top-to-bottom. Complete each column fully before moving to the next.

This instruction reduces column interleaving from a common failure to an occasional edge case. It is not a guarantee on all layouts, but it handles the majority of standard two-column academic PDFs.


DeepSeek-OCR plus a fast GPU is the most cost-effective way to run production document extraction today. Spheron gives you on-demand and spot access to the exact GPUs this pipeline runs on, with no minimum commitments.

View all GPU pricing → | Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.