DeepSeek-OCR turns any GPU server into a document extraction API that matches or beats cloud OCR services on accuracy, at a fraction of the cost. It processes images as vision-language token sequences, which means it natively handles tables, mixed scripts, handwriting, and structured forms that classical OCR pipelines garble. This guide walks through the full production setup: hardware sizing, vLLM and SGLang container configuration, a FastAPI service for PDF and image pipelines, cost-per-million-pages benchmarks against AWS Textract and GCP Document AI, and RAG integration with Qdrant and ColPali.
What DeepSeek-OCR Is and Why It Outperforms Tesseract, PaddleOCR, and Donut
Classical OCR tools like Tesseract and PaddleOCR work through a pipeline of steps: image preprocessing, text region detection, character segmentation, and character classification. Each step introduces error. On clean, high-contrast printed text that pipeline works acceptably. Put a photocopied table, a mixed-script invoice, or handwritten annotations in front of it and accuracy falls fast.
DeepSeek-OCR approaches the problem differently. It treats the page as a visual input to a transformer encoder, processes the image as a sequence of patch tokens, and generates the extracted text autoregressively through a language model decoder. This means layout understanding, table structure, and text semantics are all handled in one pass, and the output format is controllable through the prompt. You can ask for markdown tables, JSON key-value pairs, or plain text without any post-processing logic.
| Metric | Tesseract | PaddleOCR | Donut | DeepSeek-OCR |
|---|---|---|---|---|
| Architecture | Classical pipeline | CNN+CTC | Transformer seq2seq | Vision-language model |
| Handwriting | Poor | Fair | Fair | Good |
| Table extraction | Manual post-process | Limited | No | Native via prompt |
| Mixed scripts | Limited | Fair | Limited | Strong |
| API-compatible | No | No | No | Yes (OpenAI vision spec) |
| VRAM at batch=1 | None | None | ~4 GB | ~8 GB |
The trade-off is significant: DeepSeek-OCR requires >=16 GB of VRAM (more for larger batch sizes) compared to near-zero for Tesseract, and it is slower at very high DPI. For teams processing simple, clean printed documents at extremely high volume, a classical pipeline is still cheaper at the margin. For everyone dealing with enterprise documents, scanned PDFs, or structured data extraction, the accuracy difference makes DeepSeek-OCR worth the GPU cost.
For teams that want to skip text extraction entirely and retrieve documents by visual similarity, the ColPali multimodal document RAG guide covers patch-level embedding over raw page images without any OCR step.
Hardware Requirements: VRAM, Throughput, and Token-Compression Math
Understanding DeepSeek-OCR's VRAM profile starts with its architecture. Rather than a single ViT encoder, it uses a DeepEncoder that combines SAM (window attention) and CLIP (global attention), feeding into a 16x convolutional compressor. This "Contexts Optical Compression" achieves 7-20x reduction in visual tokens: at 10x compression the model retains 97% of full accuracy, and in Base mode (256 vision tokens per page) it handles most printed documents without quality loss. Tiny mode produces 64 tokens, Small 100, and Large 400 for high-fidelity work. At 256 visual tokens plus up to 4,096 output tokens, a single page sits under 5,000 context tokens in Base mode.
At BF16, the model weights are roughly 6.7 GB: a DeepEncoder at ~380M params and a 3B-class MoE decoder with 570M active parameters (6 of 64 experts active per token). The Hugging Face model card recommends >=16 GB VRAM. An L40S at 48 GB comfortably handles batch=16 with significant headroom; the RTX Pro 6000 at 96 GB supports batch=32 and above. KV cache overhead is substantially lower than a 7B model given the smaller active parameter count.
The table below shows throughput estimates for each GPU. Figures are estimates based on the model's actual architecture (570M active params, MoE decoder); actual results vary by DPI, compression mode, and output length.
| GPU | VRAM | Batch size at FP16 | Throughput (pages/min, est.) | On-Demand $/hr | Spot $/hr |
|---|---|---|---|---|---|
| NVIDIA L40S | 48 GB | 16 | ~35 | $0.72 | N/A |
| NVIDIA RTX Pro 6000 | 96 GB | 32 | ~85 | $1.70 | $0.59 |
| NVIDIA RTX 5090 | 32 GB | 12 | ~40 | $0.924 | N/A |
| NVIDIA H100 SXM5 | 80 GB | 24 | ~88 | $3.10 | $0.80 |
Pricing fluctuates based on GPU availability. The prices above are based on 03 May 2026 and may have changed. Check current GPU pricing → for live rates.
For batch document processing where throughput per dollar is the priority, the L40S GPU rental is the most cost-effective option. The L40S delivers 35 pages/min at $0.72/hr on-demand, which works out to roughly 2,100 pages per GPU-hour. For latency-sensitive serving or long PDFs where you want the full model in VRAM with no paging, RTX Pro 6000 on Spheron gives 96 GB of GDDR7 and handles batch=32 without swapping. If you need H100-class memory bandwidth for mixed LLM and OCR workloads on the same node, H100 SXM5 instances are the right choice.
Container Setup with vLLM and SGLang Backends
vLLM Setup
vLLM's OpenAI-compatible server handles vision-language models natively through its multimodal API. DeepSeek-OCR follows the same setup path as any other VLM.
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model deepseek-ai/DeepSeek-OCR \
--tensor-parallel-size 1 \
--max-model-len 16384 \
--limit-mm-per-prompt image=4 \
--served-model-name deepseek-ocrThe --limit-mm-per-prompt image=4 flag allows up to four images per request, which is useful when batching several pages into a single API call. Set --max-model-len based on your VRAM: 16,384 tokens works on L40S at batch=16. For H100, you can push to 32,768 tokens.
For FP8 quantization on H100 (roughly 1.5-2x throughput improvement), add --dtype fp8. For detailed multi-GPU tensor-parallel setup and FP8 flags, the vLLM production deployment guide covers the same configuration flow for vision-language models.
SGLang Setup
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-OCR \
--tp 1 \
--mem-fraction-static 0.85 \
--host 0.0.0.0 \
--port 30000SGLang's RadixAttention prefix cache is the key advantage here for document workloads. If you are processing thousands of pages from the same document set, many pages share the same system prompt and layout instructions. RadixAttention caches those shared prefixes and avoids recomputing attention for them on each request. On a corpus where documents share headers, footers, or common preamble structure, prefix hit rates of 40-60% are typical, which translates to meaningful throughput gains.
| Criterion | vLLM | SGLang |
|---|---|---|
| Startup time | ~90s | ~120s |
| Prefix caching | Basic (block-level) | RadixAttention (tree-based) |
| OpenAI API compat | Full | Full |
| Ecosystem maturity | Higher | Growing |
| Best for | General serving, one-off requests | Repeated templates, batch OCR |
Spheron Deployment
Provision your instance from app.spheron.ai: select the GPU type, choose your region, and pull the vLLM container directly. For the CLI, refer to the Spheron docs for the instance launch flow. Once SSHed in, verify the GPU with nvidia-smi before starting the container.
For the RTX Pro 6000 or H100, a spheron.yml snippet for the vLLM service looks like:
version: "1.0"
services:
ocr-server:
image: vllm/vllm-openai:latest
expose:
- port: 8000
as: 8000
to:
- global: true
command:
- "--model deepseek-ai/DeepSeek-OCR"
- "--tensor-parallel-size 1"
- "--max-model-len 16384"
- "--limit-mm-per-prompt image=4"
resources:
gpu:
units: 1
attributes:
vendor:
nvidia:
- model: rtx-pro-6000Building a FastAPI OCR Endpoint: PDF, Image, and Multi-Page Document Pipelines
The following service accepts PDF uploads or raw images, runs each page through DeepSeek-OCR in parallel, and returns structured JSON.
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
import asyncio, base64, httpx
from pdf2image import convert_from_bytes
# Requires poppler-utils installed at OS level:
# RUN apt-get install -y poppler-utils
# in your Dockerfile, or the PDF conversion will raise PDFInfoNotInstalledError.
app = FastAPI()
VLLM_URL = "http://localhost:8000/v1/chat/completions"
MODEL = "deepseek-ocr"
async def ocr_page(client: httpx.AsyncClient, img_b64: str) -> str:
payload = {
"model": MODEL,
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
{"type": "text", "text": "Extract all text from this image. Preserve tables and structure as Markdown."}
]
}],
"max_tokens": 4096,
"temperature": 0
}
r = await client.post(VLLM_URL, json=payload, timeout=60)
r.raise_for_status()
choices = r.json().get("choices", [])
return choices[0].get("message", {}).get("content", "") if choices else ""
@app.post("/ocr")
async def ocr_document(file: UploadFile = File(...)):
raw = await file.read()
if file.filename and file.filename.lower().endswith(".pdf"):
images = convert_from_bytes(raw, dpi=150, fmt="jpeg")
page_bytes = []
for img in images:
import io
buf = io.BytesIO()
img.save(buf, format="JPEG")
page_bytes.append(buf.getvalue())
else:
page_bytes = [raw]
async with httpx.AsyncClient() as client:
tasks = [
ocr_page(client, base64.b64encode(b).decode())
for b in page_bytes
]
results = await asyncio.gather(*tasks, return_exceptions=True)
pages = []
errors = []
for i, result in enumerate(results):
if isinstance(result, Exception):
errors.append({"page": i + 1, "error": str(result)})
else:
pages.append({"page": i + 1, "text": result})
response: dict = {"pages": pages, "page_count": len(results)}
if errors:
response["partial_failures"] = errors
return JSONResponse(response)A few notes on this implementation:
dpi=150is the default for printed documents. Increase to 200-300 for handwriting, dense tables, or small fonts at the cost of 3-4x more visual tokens per page.temperature=0gives deterministic output. For mixed handwriting or degraded scans where some creative transcription helps, trytemperature=0.1.- For large PDFs (more than 20 pages), break the conversion into groups of 10 pages and use a background task queue (e.g.,
arqor Celery with Redis). Sending 50 concurrent 6,000-token requests will exhaust VRAM on any single GPU. - Add
slowapirate limiting on the/ocrendpoint if you are exposing this service publicly. Without it, a single large PDF flood will queue out all GPU memory.
Production Tuning: Batching, Page-Level Parallelism, and DPI Trade-Offs
vLLM concurrency: Set --max-num-seqs 16 to allow up to 16 concurrent requests at the vLLM level. For offline batch jobs, this is the primary knob for throughput. For interactive serving where latency matters, drop it to 4-8 to reduce queue depth.
asyncio.gather parallelism: The FastAPI handler above submits all pages simultaneously via asyncio.gather. The bottleneck is GPU compute, not Python thread scheduling. On a loaded system, this means all pages arrive at vLLM at once and compete for the same VRAM-constrained batch slots. Pair this with --max-num-seqs tuning so the queue doesn't overflow.
DPI and token count trade-offs:
| DPI | Visual tokens/page (approx.) | Relative accuracy | Relative compute |
|---|---|---|---|
| 72 | ~600 | Low | 0.3x |
| 150 | ~2,000 | Good for printed text | 1x (baseline) |
| 200 | ~3,500 | Better for small fonts | 1.75x |
| 300 | ~6,000 | Best, needed for handwriting | 3x |
Test at 150 DPI first on your actual corpus. Many enterprise printed documents produce excellent results at 150 DPI with no need to triple compute costs.
Token compression: If DeepSeek-OCR supports spatial token pooling, pass --mm-processor-kwargs '{"image_compression_factor":0.5}' to vLLM. This halves visual tokens at a 3-5% accuracy penalty on clean documents. On an L40S, token compression roughly doubles batch size capacity with negligible quality loss for standard printed text.
Memory management for large documents: For PDFs over 20 pages, chunk processing into groups and use a persistent background queue. A 50-page PDF at 200 DPI generates 50 x 3,500 = 175,000 visual tokens of concurrent context. At batch=8, that saturates L40S VRAM. Process in chunks of 10 pages per batch submission instead.
Cost Per 1M Pages: Spheron vs Hyperscaler GPU Cloud Pricing Comparison
The numbers below use the throughput estimates from the hardware table above with the following assumptions:
- DPI = 150, avg 2,000 tokens of image input per page
- GPU utilization = 85%
- No hyperscaler support charges, egress fees, or minimum commitments included
- Per-page cost = (hourly rate) / (pages/min x 60 x 0.85 utilization)
| Platform | GPU | Pricing model | Per-page cost (est.) | Cost per 1M pages |
|---|---|---|---|---|
| Spheron (on-demand) | L40S | $0.72/hr | $0.00034 | ~$340 |
| Spheron (on-demand) | RTX Pro 6000 | $1.70/hr | $0.00042 | ~$420 |
| Spheron (spot) | RTX Pro 6000 | $0.59/hr | $0.00014 | ~$144 |
| Spheron (on-demand) | H100 SXM5 | $3.10/hr | $0.00094 | ~$940 |
| Spheron (spot) | H100 SXM5 | $0.80/hr | $0.00024 | ~$240 |
| AWS Textract | - | per page | $0.0015 | $1,500 |
| GCP Document AI | - | per page | $0.0015 | $1,500 |
| Azure Form Recognizer | - | per page | $0.0015 | $1,500 |
Pricing fluctuates based on GPU availability. The prices above are based on 03 May 2026 and may have changed. Check current GPU pricing → for live rates.
The L40S on-demand at ~$340 per million pages is 4.4x cheaper than Textract. RTX Pro 6000 spot at ~$144 is over 10x cheaper. Beyond cost, DeepSeek-OCR returns structured Markdown output with table preservation. AWS Textract's DetectDocumentText returns flat text with positional bounding boxes but requires separate table analysis calls at additional per-page fees. If your pipeline needs tables and structured extraction, the managed services cost comparison gets worse once you add the table analysis tier.
The caveat: at volumes below 50,000 pages per month, the fixed cost of maintaining a GPU instance likely outweighs Textract's pay-per-page simplicity. Self-hosting makes economic sense starting around 100,000 pages/month, or immediately if you need output quality that classical OCR cannot deliver.
Integrating with RAG Stacks (ColPali, Qdrant, vLLM) for Document Q&A
Once DeepSeek-OCR extracts text from your documents, you have a standard RAG indexing problem. The architecture looks like this:
PDF upload -> DeepSeek-OCR FastAPI -> text chunks
|
TEI embedding model (GPU)
|
Qdrant vector store
|
vLLM LLM (Qwen / DeepSeek-V3) <- queryFor text-heavy documents (contracts, technical manuals, financial filings), DeepSeek-OCR plus a standard embedding model gives better chunk-level retrieval precision than ColPali's page-level embeddings. The OCR step produces dense text chunks that embed accurately, and you get sub-page retrieval granularity.
For visual-heavy documents (slides, scanned catalogs, annotated engineering drawings), swap the OCR step for ColPali visual embeddings and skip the text extraction entirely. ColPali preserves spatial layout information that even accurate OCR loses when converting to linear text.
Set up self-hosted Qdrant or Milvus on Spheron on the same cluster as your OCR and inference nodes to eliminate network round-trips during retrieval. For the TEI embedding server setup and reranker configuration, see the self-host embedding and reranker models guide.
Once documents are indexed, a basic query loop looks like this:
# After indexing OCR output into Qdrant
hits = qdrant_client.search(
collection_name="docs",
query_vector=embed(query),
limit=5
)
context = "\n---\n".join(h.payload["text"] for h in hits)
answer = vllm_client.chat.completions.create(
model="deepseek-v3",
messages=[
{"role": "system", "content": "Answer using only the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
)Run the embedding model and vLLM inference server on the same GPU node to avoid inter-service latency. An L40S handles BGE-M3 embeddings (1 GB VRAM) plus DeepSeek-V3 distilled 7B inference (14 GB VRAM) concurrently with room to spare.
Common Failure Modes: Skewed Layouts, Handwriting, Table Extraction
Skewed or rotated pages
A 5-degree page skew drops OCR accuracy by 15-20% on standard printed text. Pre-process with OpenCV deskew before sending to DeepSeek-OCR:
import cv2
import numpy as np
def deskew(image: np.ndarray) -> np.ndarray:
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
coords = np.argwhere(gray < 128)[:, ::-1].astype(np.float32)
if len(coords) == 0:
return image # nothing to deskew
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
h, w = image.shape[:2]
M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
return cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC)Apply this after convert_from_bytes, converting each PIL Image with np.array(img) before passing to deskew and reconstructing the PIL Image with Image.fromarray(deskewed) before encoding to JPEG. It adds ~10-20ms per page on CPU and eliminates accuracy loss from scan misalignment.
Handwriting
DeepSeek-OCR handles printed handwriting better than Tesseract but degrades on cursive and informal script. Two adjustments help:
- Set
temperature=0.1instead oftemperature=0to allow slightly more variation in transcription. Pure greedy decoding sometimes gets stuck when character boundaries are ambiguous. - Accept a 10-15% character error rate on mixed handwriting/print documents and design your downstream pipeline to tolerate noise. For high-stakes workflows, add a human review queue for low-confidence pages.
Table extraction
The default prompt "Extract all text from this image." treats tables as flowing text. For structured tables, use:
Extract all text from this image. Represent any tables as GitHub Flavored Markdown tables with pipe delimiters.For complex tables with merged cells, run a second OCR pass on a bounding-box crop of the table region at higher DPI. Cropping to 110% of the table area and running at 200-300 DPI significantly improves cell boundary detection.
Multi-column layouts
Newspapers, academic papers, and some report formats use two or three columns. Without guidance, DeepSeek-OCR sometimes interleaves column text. Add this to the prompt:
Process columns left-to-right, top-to-bottom. Complete each column fully before moving to the next.This instruction reduces column interleaving from a common failure to an occasional edge case. It is not a guarantee on all layouts, but it handles the majority of standard two-column academic PDFs.
DeepSeek-OCR plus a fast GPU is the most cost-effective way to run production document extraction today. Spheron gives you on-demand and spot access to the exact GPUs this pipeline runs on, with no minimum commitments.
