VLM-based OCR replaced classical pipelines for enterprise document work in 2026. This post compares five production-ready open-source options - PaddleOCR-VL-1.6, DeepSeek-OCR, dots.ocr, GOT-OCR 2.0, and Granite-Docling - across accuracy, VRAM, throughput, and cost per 10,000 pages on Spheron GPUs. For a deep-dive on deploying DeepSeek-OCR specifically (vLLM/SGLang containers, FastAPI endpoint, and cost-per-million-pages math), see the DeepSeek-OCR deployment guide. This post focuses on model selection: how to pick the right model for your document type, GPU budget, and throughput target.
The motivation is straightforward. AWS Textract charges $1.50 per 1,000 pages for basic text extraction. GCP Document AI is similar. Azure Document Intelligence runs higher on custom models. At 100,000 pages per month, that's $1,500 to $10,000 in API costs, every month, before table extraction fees. Self-hosting on an L40S or A100 on Spheron cuts that to $50-100 per month at comparable accuracy, with documents never leaving your GPU boundary.
Why OCR Went VLM-Native in 2026
Classical OCR worked through a sequence of steps: binarization, layout detection, text region segmentation, character classification, and language-model correction. Each step introduced error that compounded through the pipeline. On clean printed documents the approach was adequate. On photocopied tables, mixed-script invoices, handwritten annotations, or scanned legal contracts, accuracy dropped fast.
VLM-based OCR collapses that pipeline to a single forward pass. The model receives the page as a grid of image patches, processes them through a visual encoder, and generates the extracted text (or structured JSON) autoregressively through a language model decoder. Layout structure, table cell relationships, and text content are all handled in one call. The prompt controls output format, so you can ask for markdown tables, JSON key-value pairs, or plain text without post-processing logic.
The other shift is the API surface. Every VLM OCR model follows the OpenAI vision spec: you submit an image and a text prompt via the /v1/chat/completions endpoint. That makes VLM-based OCR a drop-in replacement for any integration already using GPT-4o Vision or Claude for document reading. For teams that need layout intelligence layered on top of OCR, the Docling, Marker, and MinerU guide covers the hybrid routing pattern where native PDFs go through layout parsers and scanned pages route to a VLM OCR model.
The cost case closed in 2025-2026 when models in the 1-3B range reached production accuracy on printed documents. The compute requirement for good OCR no longer requires H100-class hardware.
The Contenders: Sizes, Licenses, and Language Support
| Model | Size | License | Languages | OmniDocBench v1.6 (overall) | Hugging Face ID |
|---|---|---|---|---|---|
| PaddleOCR-VL-1.6 | 0.9B | Apache-2.0 | 100+ | 96.33% | PaddlePaddle/PaddleOCR-VL-1.6 |
| DeepSeek-OCR | ~3B MoE | MIT | ~100 | N/A | deepseek-ai/DeepSeek-OCR |
| dots.ocr | ~1.7B | MIT | ~100 | N/A | rednote-hilab/dots.ocr |
| GOT-OCR 2.0 | ~580M | Apache-2.0 | 20+ | N/A | ucaslcl/GOT-OCR2_0 |
| Granite-Docling | 258M | Apache-2.0 | Primarily EN | N/A | ibm-granite/granite-docling-258M |
PaddleOCR-VL-1.6's 96.33% OmniDocBench v1.6 overall score is its published headline result. For the remaining models, OmniDocBench v1.6 scores covering these exact checkpoints were not available at time of writing. Use the section below for a characteristic comparison by document dimension rather than a synthetic headline score.
PaddleOCR-VL-1.6 is the widest-coverage option. It pairs a native-resolution visual encoder with the ERNIE-4.5-0.3B LLM for a total of ~0.9B parameters, handling complex multilingual layouts, mixed scripts, and structured tables in a single pass. Its 96.33% OmniDocBench v1.6 score confirms it as the accuracy leader in this group. It is the default choice when document language is unknown or when the corpus spans multiple languages and scripts.
DeepSeek-OCR uses a Mixture-of-Experts decoder with about 570 million active parameters per token, giving it substantially higher throughput at the same batch size. It is MIT-licensed, making it fully open-source in the OSI sense with no restrictions on redistribution or fine-tuning. For bulk batch OCR where throughput per dollar matters most, its MoE architecture is the key advantage.
dots.ocr is published by RedNote (Xiaohongshu) at rednote-hilab/dots.ocr under the MIT license. It pairs a visual encoder with a ~1.7B LLM decoder for general multilingual document parsing across ~100 languages, with particular strength on structured documents and forms. Being MIT-licensed with no commercial-use restrictions, it is a genuine open-weight option for teams that need guaranteed redistribution rights alongside broad language coverage.
GOT-OCR 2.0 (General OCR Theory, from UCAS) targets formatted-text recognition: equations, tables, music sheets, geometric diagrams, and multi-column academic papers. At under 3 GB VRAM, it runs on consumer cards and is the right pick for memory-constrained or edge deployments. It does not handle handwriting or degraded scans as well as the larger models.
Granite-Docling is IBM's vision model built for the Docling document parsing pipeline. Its training includes TableFormer-informed table datasets, making it strong on financial tables and legal document layouts. The primary language coverage is English, and it integrates natively with Docling's DoclingDocument JSON output format.
OmniDocBench v1.6 Accuracy: Tables, Formulas, Multi-Column, Handwriting
PaddleOCR-VL-1.6's headline OmniDocBench v1.6 result is 96.33% overall. For the other models, published OmniDocBench v1.6 scores covering these exact checkpoints were not available at time of writing. The table below characterizes each model's strengths by document dimension:
| Dimension | PaddleOCR-VL-1.6 | DeepSeek-OCR | dots.ocr | GOT-OCR 2.0 | Granite-Docling |
|---|---|---|---|---|---|
| Overall (printed) | Strong (96.33%) | Strong | Good | Strong | Good |
| Table TEDS | Strong | Good | Good | Good | Strong |
| Formula recognition | Good | Good | Limited | Strong | Limited |
| Multi-column | Strong | Good | Good | Strong | Good |
| Handwriting CER | Good | Good | Limited | Limited | Limited |
| Multilingual | Strong (100+) | Good (~100) | Good (~100) | Fair (20+) | Limited (EN) |
These characterizations are based on published model documentation and benchmark references, not OmniDocBench v1.6 specific test runs for every model. Differences in practical accuracy are often small on clean printed documents. The gaps open up on degraded scans, handwriting, and complex tables.
For tables: PaddleOCR-VL-1.6 and Granite-Docling are the strongest options. Granite-Docling's training on FinTabNet and PubTabNet gives it better table cell boundary detection on financial and scientific layouts. PaddleOCR-VL-1.6 is more flexible across document types.
For multilingual: PaddleOCR-VL-1.6 is the model covering 100+ languages reliably. If your corpus includes CJK scripts, Arabic, or other non-Latin scripts, it is the right starting point. DeepSeek-OCR supports ~100 languages as well, making it viable for broad multilingual batch work.
For handwriting: Both PaddleOCR-VL-1.6 and DeepSeek-OCR handle handwritten annotations acceptably on clean scans. GOT-OCR 2.0 and Granite-Docling are not designed for handwriting.
For formulas: GOT-OCR 2.0 has specific training on mathematical notation and LaTeX generation. For scientific paper OCR with inline equations, it outperforms the other models.
VRAM Footprint and Throughput per GPU
Throughput figures below are peak decode estimates based on model architecture and VRAM sizing. They reflect decode-only rate and do not include PDF rasterization, I/O, or pre/post-processing overhead. Sustained end-to-end throughput is roughly half these figures and drives the cost-per-page numbers in the table further below. Actual results vary by DPI, batch size, output length, and serving configuration.
| Model | VRAM FP16 (GB) | Batch on L40S | Throughput L40S (pg/min, est.) | VRAM INT8 (GB) | Throughput A100 80G (pg/min, est.) |
|---|---|---|---|---|---|
| PaddleOCR-VL-1.6 | ~2 | 32+ | ~45 | ~1 | ~60 |
| DeepSeek-OCR | ~8 | 16 | ~35 | ~5 | ~45 |
| dots.ocr | ~3.5 | 24+ | ~35 | ~2 | ~48 |
| GOT-OCR 2.0 | ~3 | 32+ | ~65 | ~2 | ~80 |
| Granite-Docling | ~0.5 | 64+ | ~80 | ~0.3 | ~100 |
DeepSeek-OCR's MoE architecture activates roughly 570M parameters per decoding step from a ~3B total. PaddleOCR-VL-1.6 is itself compact at ~0.9B, so the raw decoding parameter gap is narrower than older MoE-vs-dense comparisons suggested. The throughput advantage for DeepSeek-OCR comes primarily from its routing efficiency and larger activation batch capacity at the same VRAM budget. When your pipeline needs to exceed ~45 pages/min peak decode on PaddleOCR-VL-1.6, moving to an A100 80G raises that ceiling to ~60 pages/min at about 1.5x the per-GPU cost.
For serving framework choice: use vLLM for general OCR workloads with varied document types. SGLang is worth switching to when processing large batches of identical-template documents (invoices, forms, contracts) because RadixAttention prefix caching reuses shared system-prompt tokens across requests, improving throughput by 20-40% on repeated-template workloads. The DeepSeek-OCR deployment guide covers the full vLLM and SGLang container configuration with Docker run commands and --tensor-parallel-size settings.
Deployment: vLLM and SGLang, Batching Scanned PDFs, Structured JSON Output
The vLLM launch command for PaddleOCR-VL-1.6 follows the same pattern as any vision-language model:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model PaddlePaddle/PaddleOCR-VL-1.6 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--limit-mm-per-prompt image=1 \
--served-model-name paddleocr-vlFor structured JSON output, instruct the model via system prompt. The model will follow the schema reliably on printed documents:
system_prompt = """You are a document OCR engine. Extract the document content and return ONLY valid JSON with this structure:
{
"text_blocks": [...],
"tables": [{"headers": [...], "rows": [[...]]}],
"page_language": "..."
}
Do not include any text outside the JSON object."""For batching scanned PDFs, submit pages in parallel rather than sequentially:
import asyncio
import base64
import io
from openai import AsyncOpenAI
from pdf2image import convert_from_path
async def ocr_page(client, image, page_num, system_prompt):
buf = io.BytesIO()
image.save(buf, format='JPEG')
img_b64 = base64.b64encode(buf.getvalue()).decode()
response = await client.chat.completions.create(
model="paddleocr-vl",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
{"type": "text", "text": "Extract all content from this document page."}
]}
]
)
return page_num, response.choices[0].message.content
async def batch_ocr(pdf_path: str, system_prompt: str) -> list:
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="none")
images = convert_from_path(pdf_path, dpi=150)
tasks = [ocr_page(client, img, i, system_prompt) for i, img in enumerate(images)]
return await asyncio.gather(*tasks)Convert PDFs at 150-200 DPI. Higher DPI increases token count without proportional accuracy gain on most printed documents. See docs.spheron.ai for instance provisioning steps.
Plugging OCR into a RAG Ingestion Pipeline
OCR output feeds a RAG pipeline in three stages: extraction, chunking, and embedding.
PDF Pages
|
v
VLM OCR (vLLM/SGLang)
|
v
JSON blocks (text, tables, metadata)
|
v
Chunker (section-aware, table-preserving)
|
v
Embedding Model (BGE-M3, Qwen3-Embedding)
|
v
Qdrant / MilvusThe key decision at the chunking stage is whether to treat tables as single chunks or split them by row. For financial tables that need to be retrieved as whole objects (e.g., P&L statements), keep the table as one chunk. For long tables that span multiple topics (e.g., supplier comparison tables), row-level chunking with header metadata produces better retrieval precision.
For the full RAG stack design including embedding model sizing, FAISS-GPU indexing, and LLM inference colocated on one GPU node, see the agentic RAG infrastructure guide. When your document corpus mixes native PDFs and scanned pages, the hybrid pipeline pattern for routing by page type is covered in our document intelligence guide, where native PDFs go through layout parsers and scanned pages route to a VLM OCR model.
For document corpora where cross-document reasoning matters (e.g., contract repositories or research literature), the GraphRAG deployment guide covers knowledge graph indexing on top of OCR-extracted text.
Cost per 10,000 Pages on Spheron GPUs vs Paid OCR APIs
Throughput and cost figures use PaddleOCR-VL-1.6 as the reference model since it is the most broadly applicable. DeepSeek-OCR's MoE architecture yields higher throughput at the same batch size, reducing cost per page by roughly 30-40% at equivalent GPU pricing. Throughput figures here are sustained end-to-end estimates including PDF rasterization, I/O, and processing overhead, which is roughly half of the peak decode rates shown in the VRAM table above.
| Platform | GPU | Pricing model | Throughput (pg/hr) | Per-10K-page cost |
|---|---|---|---|---|
| Spheron on-demand | L40S | $0.96/hr | ~1,320 | ~$7.27 |
| Spheron spot | L40S | $0.67/hr | ~1,320 | ~$5.08 |
| Spheron on-demand | A100 80G | $1.43/hr | ~1,800 | ~$7.95 |
| Spheron spot | A100 80G | $1.19/hr | ~1,800 | ~$6.61 |
| Spheron on-demand | H100 SXM5 | $4.06/hr | ~3,000 | ~$13.53 |
| AWS Textract | - | per-page | - | ~$15.00 |
| GCP Document AI | - | per-page | - | ~$15.00 |
| Azure Document Intelligence | - | per-page | - | ~$100.00 |
Pricing fluctuates based on GPU availability. The prices above are based on 23 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Self-hosting breaks even against Textract at roughly 50,000-100,000 pages per month. Below that threshold the fixed GPU cost exceeds the per-page savings. Above it, the advantage compounds: at 500,000 pages per month, Spheron on a single L40S GPU rental costs about $364 in GPU time versus $750 for Textract (and more once you add table extraction fees).
Recommendation Matrix by Use Case
| Use case | Recommended model | GPU | Framework | Why |
|---|---|---|---|---|
| High-accuracy IDP (financial tables, legal docs) | PaddleOCR-VL-1.6 or Granite-Docling | L40S or A100 80G | vLLM | Strong table fidelity, structured JSON output |
| Multilingual document processing | PaddleOCR-VL-1.6 | L40S | vLLM | 100+ language coverage, handles mixed scripts |
| High-volume batch processing | DeepSeek-OCR | L40S spot or A100 spot | SGLang | MoE reduces compute per page at scale |
| Structured document and form OCR | dots.ocr | L40S or any 8 GB+ GPU | vLLM | MIT license, ~1.7B scale, strong on structured forms |
| Edge or memory-constrained deployment | GOT-OCR 2.0 | RTX 4090 or any 8 GB+ GPU | vLLM or Ollama | Under 3 GB VRAM, fast on printed documents |
| Docling-native pipeline | Granite-Docling | L40S | vLLM | Native IBM Docling JSON format, TableFormer-trained |
The core decision point is accuracy vs. cost vs. multilingual coverage. PaddleOCR-VL-1.6 leads on benchmark accuracy and language breadth. DeepSeek-OCR is the cost-per-page winner for bulk batch work. GOT-OCR 2.0 is the right pick when VRAM budget is the binding constraint.
VLM-based OCR on mid-tier cloud GPUs replaces per-page API costs at scale. For most OCR workloads, an L40S or A100 on Spheron keeps cost-per-page well below managed API pricing with no minimum commitments.
L40S GPU pricing → | A100 on Spheron → | View all GPU pricing →
Quick Setup Guide
Select PaddleOCR-VL-1.6 for multilingual or mixed-layout documents with its 100+ language coverage. Use DeepSeek-OCR for high-volume batch pipelines where its MoE architecture reduces cost per page. Use dots.ocr for general multilingual document parsing at compact ~1.7B scale with MIT licensing, particularly for structured documents and forms. Pick Granite-Docling when you need tight integration with the IBM Docling pipeline and structured JSON output. Use GOT-OCR 2.0 for lightweight deployments where VRAM is constrained.
For PaddleOCR-VL-1.6 (~0.9B), an L40S (48 GB) handles large batches at FP16. For DeepSeek-OCR, L40S supports batch=16. If you need peak decode throughput above ~45 pages/min on PaddleOCR-VL, move to an A100 80G (which reaches ~60 pages/min peak) or an H100 SXM5. Note that sustained end-to-end throughput including rasterization and I/O is roughly half these peak figures. GOT-OCR 2.0 runs on cards with as little as 8 GB VRAM.
Launch the model with vLLM's OpenAI-compatible vision endpoint. For PaddleOCR-VL-1.6, set --max-model-len 8192 and --limit-mm-per-prompt image=1. For repeated-template batch workloads like invoice processing, switch to SGLang for RadixAttention prefix caching.
Convert each PDF page to an image at 150-200 DPI using pdf2image or PyMuPDF. Submit pages asynchronously using asyncio.gather for parallelism. Instruct the model via system prompt to emit JSON instead of Markdown to avoid post-processing.
Pipe OCR JSON output to a text embedding model (BGE-M3, Qwen3-Embedding) and index into Qdrant or Milvus. For document corpora with mixed native and scanned pages, route by page type before OCR to maximize throughput on native PDFs.
Frequently Asked Questions
For complex tables and structured financial documents, PaddleOCR-VL-1.6 and Granite-Docling (with TableFormer heritage) are the strongest options. PaddleOCR-VL-1.6 covers 100+ languages and handles mixed layouts well. Granite-Docling integrates tightly with the IBM Docling pipeline for structured JSON output. Both run on an L40S with 48 GB VRAM.
PaddleOCR-VL-1.6 is a 0.9B VLM (native-resolution visual encoder paired with ERNIE-4.5-0.3B LLM) that uses approximately 2 GB at FP16 for model weights. On an L40S (48 GB), you can run large batch sizes comfortably. INT8 quantization brings this down to around 1 GB, which fits on almost any modern GPU.
Yes. An L40S (48 GB VRAM) handles PaddleOCR-VL-1.6 and DeepSeek-OCR at large batch sizes. GOT-OCR 2.0 at under 3 GB fits easily on much smaller cards. L40S is the most cost-effective GPU for OCR workloads on Spheron, making it a practical choice before stepping up to an A100 or H100 for higher throughput.
AWS Textract charges roughly $1.50 per 1,000 pages for general document text, putting 10,000 pages at $15. Running PaddleOCR-VL-1.6 on an L40S on Spheron at $0.96/hr costs around $7.27 for 10,000 pages. At scale, that difference compounds fast, and self-hosted OCR also returns structured table data that Textract charges extra for.
Use vLLM for general document OCR workloads, especially mixed batches of different document types. SGLang is worth the switch when your pipeline processes large batches of similar-template documents, where RadixAttention prefix caching reuses shared prompt tokens across requests and improves throughput by 20-40% on repeated-template workloads.
