OCR gives you text, but it destroys structure. Flat OCR output fails on tables, formulas, and multi-column layouts: the columns merge, table cells lose their row/column relationships, and equations turn into gibberish strings. RAG pipelines built on that output return wrong answers on structured documents, regardless of how good your retrieval model is. Document intelligence frameworks fix this by running layout detection and structure recognition before extraction. This guide covers three production-grade open-source options: IBM Docling, Marker, and MinerU. For the broader RAG stack these feed into, see the agentic RAG infrastructure guide. For vector database setup, the self-hosted vector database guide covers Qdrant, Milvus, and Weaviate deployment on Spheron.
Why Document Intelligence Is the Missing Layer Between OCR and RAG
Classical OCR pipelines read pixels left-to-right and top-to-bottom. A multi-column paper becomes a single garbled column. A financial table turns into rows of numbers with no column headers linking them. Equations become character sequences that embed as noise. The text is technically correct at the character level, but the structural information is gone.
Document intelligence adds a layer before extraction: layout detection that identifies text blocks, tables, figures, equations, and headers; reading-order recovery that determines the correct sequence across columns and pages; table structure recognition that maps cells to their row/column positions; and formula detection that routes math to a specialized model instead of character-level OCR.
The output is not a string but a structured document object: sections with hierarchy, tables with full cell coordinates, figures with captions, and formulas in a parseable format. That structure is what makes downstream chunking meaningful. A table chunk contains a complete table. A section chunk starts and ends at semantic boundaries. You embed structure, not fragments.
For documents that are already machine-readable PDFs (text layer present), these frameworks extract structure directly from the PDF without running any OCR. OCR only fires on image-only pages or scanned documents. This distinction matters for speed: native PDF parsing runs at thousands of pages per minute, while OCR-dependent paths run at tens to hundreds of pages per minute depending on GPU throughput.
Docling vs. Marker vs. MinerU: Architecture and Model Backbones
All three frameworks run GPU-accelerated models, produce structured output, and integrate with RAG pipelines. Their differences show up at the document type level.
IBM Docling (DS4SD/docling on GitHub) uses a layout detection model (RT-DETR or the newer Heron model, depending on the release) to identify text blocks, tables, figures, code, and headers. Table structure recognition runs through TableFormer, a transformer specifically trained on DocBank, PubTabNet, and FinTabNet. Scanned pages fall back to EasyOCR or Tesseract. Output is a DoclingDocument JSON object with full provenance: every chunk knows its page, bounding box, and element type. Native input support extends to PDF, DOCX, PPTX, XLSX, and HTML.
Marker (VikParuchuri/marker) builds on the Surya model family: a layout detection model for block identification and a line-level OCR model for text. Its design prioritizes clean Markdown output with LaTeX equation blocks. Input is PDF or images. Marker is faster per page than Docling in pure throughput terms because its models are lighter, and the Markdown output is easier to chunk without post-processing. The trade-off is table fidelity on complex nested tables, where Docling's TableFormer has a structural advantage.
MinerU (opendatalab/MinerU) comes from OpenDataLab (Shanghai AI Laboratory) and uses the PDF-Extract-Kit pipeline internally. Its distinguishing capability is formula recognition via UniMERNet, which handles complex LaTeX equations in scientific papers better than either Docling or Marker. Multi-column scientific paper layout is also a strength. MinerU requires a one-time download agreement for the PDF-Extract-Kit model weights (free code license), which you click through on HuggingFace before pulling the weights.
| Criterion | Docling | Marker | MinerU |
|---|---|---|---|
| Layout model | RT-DETR / Heron | Surya | PDF-Extract-Kit |
| Table extraction | TableFormer (ACCURATE/FAST) | Basic | Good |
| Formula recognition | Limited | LaTeX via Surya | UniMERNet (strong) |
| Reading order | Yes | Yes | Yes |
| Output format | DoclingDocument JSON | Markdown + LaTeX | Markdown + JSON |
| Supported input | PDF, DOCX, PPTX, XLSX, HTML | PDF, images | PDF, images |
| Multi-language | Yes (EasyOCR) | Yes (Surya) | Yes |
| License | MIT | GPL-3.0 | Apache-2.0 with conditions |
Choose Docling for financial and legal documents with complex tables, DOCX/PPTX inputs, and pipelines that need the structured DoclingDocument object for downstream processing. Choose Marker for bulk PDF ingestion where speed and Markdown output are the priority. Note that Marker's model weights ship under a modified AI Pubs Open Rail-M license that requires a paid commercial license for organizations above $2M in funding or revenue. Factor this in if you are deploying at commercial scale. Choose MinerU for scientific literature with equations and multi-column layouts.
GPU Requirements: VRAM, Throughput, and Concurrency Targets
All three frameworks load multiple models simultaneously. Layout detection, OCR fallback, and (for Docling) the TableFormer model all reside in VRAM during processing. Plan accordingly.
Docling: The layout detection model occupies roughly 1 GB VRAM. TableFormer adds approximately 600 MB. An EasyOCR instance adds another 500 MB to 1.5 GB depending on language models loaded. Total combined footprint for a single Docling worker runs 4-6 GB VRAM, leaving room on an L40S for 6-8 parallel workers before memory pressure starts. On an H100 SXM5 you can run 10-14 workers comfortably alongside a co-located embedding model.
Marker: Surya layout plus Surya OCR runs 3-5 GB VRAM per worker. Throughput per worker is higher than Docling because the models are faster, so 5-7 workers on an L40S gives comparable aggregate throughput to Docling with fewer workers.
MinerU: The formula recognition model (UniMERNet) adds roughly 2 GB on top of the base pipeline. Plan for 6-8 GB per worker. On an L40S you can run 4-5 MinerU workers concurrently without swapping.
| GPU | VRAM | Docling pages/hr | Marker pages/hr | MinerU pages/hr | On-Demand $/hr | Cost/10K pages (Docling) |
|---|---|---|---|---|---|---|
| L40S | 48 GB | ~2,400 | ~3,000 | ~1,800 | $1.91 | ~$7.96 |
| H100 SXM5 | 80 GB | ~4,800 | ~5,500 | ~3,600 | $3.12 | ~$6.50 |
Throughput estimates assume native PDF inputs with 20% scanned pages routed to OCR fallback, at a mix of 30% simple text, 40% table-heavy, and 30% multi-column. Pure native PDF parsing runs significantly faster; heavily scanned document sets run slower.
For the cost per 10K pages: at $1.91/hr and 2,400 pages/hr with Docling on the L40S, that is ($1.91 / 2,400) * 10,000 = $7.96 per 10K pages. Azure Document Intelligence prebuilt layout charges $10 per 1,000 pages, so $100 per 10K pages. Self-hosting on Spheron is roughly 12x cheaper before volume discounts.
Pricing fluctuates based on GPU availability. The prices above are based on 28 May 2026 and may have changed. Check current GPU pricing → for live rates.
Layout Detection, Table Extraction, Formula Recognition, and Reading Order Benchmarks
On the DocLayNet benchmark, Docling's RT-DETR layout detection achieves above 85% mAP on the standard element classes (text, table, figure, section header, list, code, formula). Marker's Surya layout model is competitive on text and headers but scores lower on tables. MinerU's PDF-Extract-Kit pipeline matches Docling on layout and surpasses both on formula detection.
For table structure recognition, the FinTabNet and PubTabNet benchmarks put Docling's TableFormer at Tree Edit Distance Similarity (TEDS) above 91% on FinTabNet in ACCURATE mode. This is the number that matters for financial document RAG: cell-level accuracy determines whether you can query "what was Q3 2024 revenue" from a 10-K table and get the right number. Marker's basic table extraction TEDS is roughly 75-80% on the same benchmark, adequate for simple grid tables but not financial report complexity.
On IM2LaTeX formula recognition, MinerU's UniMERNet achieves over 90% BLEU for displayed equations in scientific papers. Docling and Marker both fall below 70% on complex multi-line integrals and matrix expressions, making MinerU the clear choice for scientific literature ingestion.
| Document type | Best framework | Reason |
|---|---|---|
| Financial reports (10-K, earnings) | Docling (ACCURATE mode) | TableFormer TEDS 91%+ on FinTabNet |
| Scientific papers | MinerU | UniMERNet formula recognition |
| Legal contracts | Docling | Reading order + section hierarchy |
| Slide decks (PPTX) | Docling | Native PPTX input support |
| Mixed scanned/native PDFs | Marker or Docling + DeepSeek-OCR routing | Speed vs. accuracy tradeoff |
Reading order accuracy on multi-column scientific PDFs is high for all three frameworks (above 90% on the ReadingBank benchmark), because all three use model-based layout detection rather than heuristic column splitting.
Multi-Language and Scanned Document Handling
Docling falls back to EasyOCR for scanned pages, which covers 80+ languages with reasonable accuracy on printed text. Marker's Surya OCR covers 90+ languages and benchmarks slightly higher than EasyOCR on non-Latin scripts in the Surya benchmark suite. MinerU uses a custom OCR model within PDF-Extract-Kit that covers CJK scripts (Chinese, Japanese, Korean) particularly well, which reflects OpenDataLab's focus on scientific literature in Chinese.
For documents that are primarily scanned, the better path is to route to DeepSeek-OCR on GPU cloud first, get the structured text output, and then pass it to Docling for layout structure assignment. This hybrid approach outperforms any built-in OCR fallback on low-quality scans, handwritten annotations, and mixed-script documents. The next section on hybrid pipelines covers the routing logic.
For multilingual corpora of native PDFs (text layer present), all three frameworks handle non-Latin PDFs without OCR. The language fallback only triggers for image-only pages.
Deploying Docling on Spheron H100 with Docker
The simplest production deployment runs Docling behind a FastAPI endpoint in Docker. The IBM research team publishes official Docker images at quay.io/ds4sd/docling.
Docker Compose setup:
version: "3.9"
services:
docling:
image: quay.io/ds4sd/docling:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
volumes:
- ./input:/app/input
- ./output:/app/output
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
api:
build: ./api
ports:
- "8080:8080"
volumes:
- ./input:/app/input
- ./output:/app/output
depends_on:
- doclingFastAPI wrapper (api/main.py):
import os
import uuid
import asyncio
import json
import threading
from pathlib import Path
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
app = FastAPI()
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
converter = DocumentConverter(
format_options={"pdf": PdfFormatOption(pipeline_options=pipeline_options)}
)
_converter_lock = threading.Lock()
def _run_convert(path: str):
with _converter_lock:
return converter.convert(path)
@app.post("/parse")
async def parse_document(file: UploadFile = File(...)):
doc_id = str(uuid.uuid4())
safe_name = Path(file.filename).name if file.filename else "upload.bin"
input_path = Path(f"/app/input/{doc_id}_{safe_name}")
content = await file.read()
try:
input_path.write_bytes(content)
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(None, _run_convert, str(input_path))
finally:
input_path.unlink(missing_ok=True)
doc = result.document
return JSONResponse({
"doc_id": doc_id,
"page_count": len(doc.pages),
"table_count": len(doc.tables),
"document": json.loads(doc.model_dump_json()),
})Kubernetes deployment with GPU resource requests:
apiVersion: apps/v1
kind: Deployment
metadata:
name: docling-api
spec:
replicas: 2
selector:
matchLabels:
app: docling-api
template:
metadata:
labels:
app: docling-api
spec:
containers:
- name: docling-api
image: your-registry/docling-api:latest
ports:
- containerPort: 8080
resources:
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
cpu: "8"
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
cpu: "4"
volumeMounts:
- name: document-storage
mountPath: /app/input
subPath: input
- name: document-storage
mountPath: /app/output
subPath: output
volumes:
- name: document-storage
persistentVolumeClaim:
claimName: docling-pvc
runtimeClassName: nvidiaDeploy this to Spheron H100 instances for maximum table extraction throughput on financial and legal documents. The H100 SXM5's memory bandwidth (3.35 TB/s) cuts TableFormer inference time on large tables roughly in half compared to A100 PCIe, which shows up at scale when processing hundreds of 10-K filings. For Spheron instance provisioning and SSH access, see the Spheron instance types docs.
Deploying Marker and MinerU on Spheron L40S
Marker deployment:
Install from PyPI and wrap in a FastAPI endpoint:
pip install marker-pdf
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121import asyncio
import tempfile, os
import threading
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
from marker.convert import convert_single_pdf
from marker.models import load_all_models
app = FastAPI()
models = load_all_models()
_models_lock = threading.Lock()
def _run_marker(path: str):
with _models_lock:
return convert_single_pdf(path, models, langs=["English"], batch_multiplier=2)
@app.post("/parse")
async def parse(file: UploadFile = File(...)):
tmp_fd, tmp_path = tempfile.mkstemp(suffix=".pdf")
try:
with os.fdopen(tmp_fd, "wb") as f:
f.write(await file.read())
loop = asyncio.get_running_loop()
full_text, images, metadata = await loop.run_in_executor(
None, lambda: _run_marker(tmp_path)
)
finally:
os.unlink(tmp_path)
return JSONResponse({
"text": full_text,
"page_count": metadata.get("page_count", 0),
"languages": metadata.get("languages", []),
})Marker's batch_multiplier=2 doubles page throughput by processing pairs of pages in each model forward pass, at the cost of roughly 30% more VRAM. On an L40S GPU rental with 48 GB, you can push batch_multiplier=4 without hitting memory limits.
MinerU deployment:
pip install magic-pdf[full]
# Download model weights (requires HuggingFace account + license click-through)
magic-pdf --helpMinerU's CLI processes single files or directories:
magic-pdf -p /input/document.pdf -o /output/document/ --method auto--method auto lets MinerU detect whether each page is native PDF text or scanned, routing accordingly. For batch processing, wrap this in a FastAPI endpoint with asyncio.subprocess.create_subprocess_exec to avoid blocking the event loop:
import asyncio
import glob
import os
import shutil
import tempfile
from fastapi import FastAPI, UploadFile, File
app = FastAPI()
@app.post("/parse")
async def parse(file: UploadFile = File(...)):
tmp_fd, tmp_path = tempfile.mkstemp(suffix=".pdf")
out_dir = None
try:
out_dir = tempfile.mkdtemp(dir="/output/")
with os.fdopen(tmp_fd, "wb") as f:
f.write(await file.read())
proc = await asyncio.create_subprocess_exec(
"magic-pdf",
"-p", tmp_path,
"-o", out_dir,
"--method", "auto",
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await proc.communicate()
markdown = ""
if proc.returncode == 0:
md_files = glob.glob(os.path.join(out_dir, "**", "*.md"), recursive=True)
if md_files:
with open(md_files[0]) as fh:
markdown = fh.read()
return {
"status": "ok" if proc.returncode == 0 else "error",
"markdown": markdown,
"log": stderr.decode("utf-8", errors="replace"),
}
finally:
try:
os.close(tmp_fd)
except OSError:
pass
os.unlink(tmp_path)
if out_dir is not None:
shutil.rmtree(out_dir, ignore_errors=True)For on-demand L40S access running MinerU at 4-5 workers, expect 1,600-2,000 pages/hr on a mix of native and scanned documents with equations. On pure scientific literature with dense formulas, throughput drops to 1,200-1,500 pages/hr as UniMERNet runs on each formula region.
End-to-End RAG Ingestion Pipeline: Document Intelligence into ColPali into Vector DB
The most common production architecture combines document intelligence (for text-layer PDFs) with visual retrieval (for chart-heavy and diagram-heavy pages) and indexes both into a multi-vector Qdrant collection.
Input PDF
│
├── Native PDF pages (has text layer)
│ │
│ ▼
│ Docling / Marker / MinerU
│ (layout detection + structure)
│ │
│ ▼
│ Text chunks with metadata
│ (table_id, section, page_num)
│ │
│ ▼
│ TEI embedding model
│ (dense vector per chunk)
│
└── Visual pages (charts, diagrams, scanned)
│
▼
ColPali / ColQwen2.5
(visual patch embeddings)
│
▼
Multi-vector index
│
▼
Qdrant multi-vector collection
(sparse + dense + colpali vectors)
│
▼
Query router
(text query → dense retrieval)
(image query → colpali retrieval)The ColPali multimodal document RAG guide covers the visual retrieval side in detail, including ColQwen2.5 deployment and the Qdrant multi-vector collection schema. For the text side, Docling's DoclingDocument export maps cleanly onto a chunking strategy:
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker import HierarchicalChunker
converter = DocumentConverter()
result = converter.convert("path/to/document.pdf")
doc = result.document
chunker = HierarchicalChunker()
chunks = list(chunker.chunk(doc))
for chunk in chunks:
# chunk.text: the text content
# chunk.meta.headings: parent section headers
# chunk.meta.doc_items: source document elements with bbox and page
payload = {
"text": chunk.text,
"page": chunk.meta.origin.page_no if chunk.meta.origin else None,
"section": " > ".join(chunk.meta.headings) if chunk.meta.headings else None,
"element_type": str(chunk.meta.doc_items[0].label) if chunk.meta.doc_items else None,
}
# embed and upsert to QdrantThe HierarchicalChunker respects section boundaries and keeps table rows together, so you never split a table across two chunks. Each chunk includes provenance metadata: page number, bounding box, section hierarchy, and element type. At query time you can filter to element_type=table to retrieve only tabular chunks, which cuts hallucination rates on quantitative questions.
For the vector database layer, the self-host vector database guide covers Qdrant multi-vector setup on Spheron, including HNSW index tuning and co-location with vLLM.
The query router is simple: classify the query as text or visual, run the corresponding retrieval path, merge results, and pass to the LLM. A basic implementation uses a small classifier trained on query type labels, or a heuristic that routes queries containing "chart", "figure", "diagram", or "graph" to the ColPali path and everything else to the dense text path.
Throughput vs. Cost: Pages per Second per Dollar on Spheron
| Solution | GPU | Pages/hr | $/hr | Cost/10K pages | Data residency |
|---|---|---|---|---|---|
| Docling on Spheron | L40S | 2,400 | $1.91 | $7.96 | Your GPU boundary |
| Docling on Spheron | H100 SXM5 | 4,800 | $3.12 | $6.50 | Your GPU boundary |
| Marker on Spheron | L40S | 3,000 | $1.91 | $6.37 | Your GPU boundary |
| MinerU on Spheron | L40S | 1,800 | $1.91 | $10.61 | Your GPU boundary |
| Azure Document Intelligence | Prebuilt layout | N/A | $10/1K pages | $100.00 | Azure region |
| AWS Textract | Forms + tables | N/A | $0.065/page | $650.00 | AWS region |
Against Azure Document Intelligence, self-hosting Docling on an L40S is roughly 12x cheaper at on-demand rates. At spot pricing on Spheron, the gap widens further. Against AWS Textract, the savings are even larger.
Data residency is the other factor that managed services cannot address. When you upload documents to Azure Document Intelligence or AWS Textract, the document leaves your infrastructure. For healthcare organizations processing clinical notes under HIPAA, for legal teams processing privileged documents, and for financial institutions under GDPR or PCI-DSS, that boundary matters. Running Docling or Marker on a Spheron GPU node keeps the document inside your compute boundary for the full parsing pipeline.
Pricing fluctuates based on GPU availability. The prices above are based on 28 May 2026 and may have changed. Check current GPU pricing → for live rates.
Hybrid Pipelines: Combining Document Intelligence with DeepSeek-OCR
Built-in OCR fallbacks in Docling (EasyOCR), Marker (Surya), and MinerU (PDF-Extract-Kit) work well for clean scanned documents. They struggle with low-quality scans, handwritten annotations, stamps, and mixed-script content. For those cases, routing to a vision-language OCR model before feeding Docling gives substantially better text quality.
The routing decision is fast: PyMuPDF's get_text() returns an empty string for image-only pages.
import fitz # PyMuPDF
from pathlib import Path
def classify_pages(pdf_path: str) -> dict[int, str]:
"""Return {page_num: 'native' | 'scanned'} for each page."""
page_types = {}
with fitz.open(pdf_path) as doc:
for page_num, page in enumerate(doc):
text = page.get_text().strip()
page_types[page_num] = "native" if len(text) > 50 else "scanned"
return page_types
async def hybrid_parse(pdf_path: str, docling_client, deepseek_ocr_client):
page_types = classify_pages(pdf_path)
native_pages = [p for p, t in page_types.items() if t == "native"]
scanned_pages = [p for p, t in page_types.items() if t == "scanned"]
results = {}
if native_pages:
# Docling handles native pages with full structure detection
docling_result = await docling_client.parse(pdf_path, pages=native_pages)
results["native"] = docling_result
if scanned_pages:
# DeepSeek-OCR handles scanned pages via vLLM vision endpoint
ocr_result = await deepseek_ocr_client.ocr_pages(pdf_path, pages=scanned_pages)
results["scanned"] = ocr_result
return merge_results(results, page_types)The threshold of 50 characters filters out pages with only a header or a page number, treating them as effectively image-only. Adjust based on your document set. A legal contract with a 40-character footer on an otherwise scanned page should go to the OCR path.
See the DeepSeek-OCR on GPU cloud guide for the vLLM vision endpoint configuration and batch size tuning. For the orchestration layer that manages routing across many concurrent document jobs, the agentic RAG infrastructure guide covers queue-based job distribution with worker autoscaling.
For post-retrieval graph-based reasoning over the indexed documents, the GraphRAG deployment guide shows how to add entity extraction and community detection on top of a Qdrant collection.
Choosing Between Docling, Marker, and MinerU
The framework choice comes down to document type, output format requirements, and throughput targets.
Use Docling when the documents are financial reports, legal contracts, or enterprise Office files (DOCX, PPTX, XLSX). TableFormer's structural accuracy on complex tables is meaningfully better than the alternatives, and the DoclingDocument JSON object carries enough provenance to support filtered retrieval by element type.
Use Marker when you need maximum throughput on large PDF corpora where the documents are reasonably well-structured and the output goes directly to a text embedding model. The Markdown output requires less post-processing than raw DoclingDocument JSON for simple ingestion cases.
Use MinerU when the document corpus is scientific literature with equations, or when the primary language is Chinese, Japanese, or Korean. UniMERNet's formula extraction accuracy makes a real difference in STEM research RAG applications where query results depend on reading equations correctly.
For mixed enterprise document sets that include all three types, run Docling as the default and route identified scientific papers (by metadata, filename pattern, or a fast document classifier) to MinerU. Use the hybrid OCR routing for scanned documents in any category.
Running Docling, Marker, or MinerU on Spheron keeps your document ingestion pipeline inside your own GPU cloud boundary, meeting data residency requirements while cutting per-page costs by roughly 12x compared to Azure Document Intelligence at on-demand rates. Check H100 SXM5 availability on Spheron → or view current GPU pricing → for full on-demand and spot rates.
Quick Setup Guide
Deploy an H100 SXM5 or L40S instance on Spheron. For single-framework batch processing, an L40S at 48 GB handles concurrent Docling layout detection and table extraction without paging. For mixed workloads that also run embedding models, use H100 SXM5 for its memory bandwidth headroom.
Pull the IBM Docling Docker image, mount a shared volume for input PDFs and output JSON, and configure the DocumentConverter with TableFormerMode.ACCURATE for financial tables or TableFormerMode.FAST for speed-constrained pipelines.
For Marker, install marker-pdf and run convert_single with a GPU-enabled Surya model. For MinerU, use the magic-pdf CLI with --method auto to auto-detect whether a PDF is native or scanned. Both frameworks export to structured Markdown with table fidelity.
Wrap your chosen framework in a FastAPI service that accepts PDF file uploads, routes scanned pages to DeepSeek-OCR via vLLM, and merges output chunks before returning structured JSON ready for embedding.
Pipe DoclingDocument or Marker Markdown output through a chunking strategy that respects section and table boundaries. Embed with a text embedding model via TEI, then upsert to Qdrant with metadata fields for document_type, page_number, and table_id for filtered retrieval.
Run the benchmark suite on a 100-page sample from each document type (financial report, scientific paper, legal contract). Record pages/second and table extraction F1. Compare cost per 10K pages against the Azure Document Intelligence baseline.
Frequently Asked Questions
IBM Docling is an open-source document conversion library that uses a transformer-based TableFormer model for table extraction and a layout detection model for reading-order recovery. Marker is a pipeline-based converter using Surya layout and OCR models. MinerU is OpenDataLab's document understanding toolkit with strong support for formula recognition and multi-column layouts. All three run on GPU and produce structured output for RAG ingestion.
All three frameworks load layout detection and OCR models simultaneously. Plan for at least 8-12 GB VRAM for single-stream processing. For throughput at 30+ pages per second, an L40S (48 GB) or H100 (80 GB) is recommended. The H100's higher memory bandwidth reduces table extraction latency on complex financial and legal documents.
Docling exports to DoclingDocument JSON, which preserves table structures, reading order, and section boundaries. Pass these structured chunks directly to a text embedding model served via Hugging Face TEI, then index into Qdrant or Milvus. For visual retrieval over the same documents, combine with ColPali at /blog/colpali-multimodal-document-rag-gpu-cloud/ to handle chart-heavy pages that text extraction misses.
Use a hybrid pipeline when your document set contains both machine-readable PDFs (where Docling/Marker native PDF parsing is faster and more accurate) and scanned or handwritten pages where a vision-language OCR model outperforms classical detection. Route documents by type at ingestion time and merge outputs before embedding.
On Spheron with an L40S at on-demand pricing, a throughput of roughly 2,000-3,000 pages/hour gives a per-10K-page cost of approximately $6-11 depending on document complexity. Azure Document Intelligence charges $10 per 1,000 pages (prebuilt layout model), putting 10K pages at $100. Self-hosting with Docling or Marker on Spheron is 10-15x cheaper at scale, and documents never leave your GPU boundary.
