Self-Host Document Intelligence on GPU Cloud: Docling, Marker, and MinerU Production Setup Guide for RAG Ingestion (2026)

OCR gives you text, but it destroys structure. Flat OCR output fails on tables, formulas, and multi-column layouts: the columns merge, table cells lose their row/column relationships, and equations turn into gibberish strings. RAG pipelines built on that output return wrong answers on structured documents, regardless of how good your retrieval model is. Document intelligence frameworks fix this by running layout detection and structure recognition before extraction. This guide covers three production-grade open-source options: IBM Docling, Marker, and MinerU. For the broader RAG stack these feed into, see the agentic RAG infrastructure guide. For vector database setup, the self-hosted vector database guide covers Qdrant, Milvus, and Weaviate deployment on Spheron.

Why Document Intelligence Is the Missing Layer Between OCR and RAG

Classical OCR pipelines read pixels left-to-right and top-to-bottom. A multi-column paper becomes a single garbled column. A financial table turns into rows of numbers with no column headers linking them. Equations become character sequences that embed as noise. The text is technically correct at the character level, but the structural information is gone.

Document intelligence adds a layer before extraction: layout detection that identifies text blocks, tables, figures, equations, and headers; reading-order recovery that determines the correct sequence across columns and pages; table structure recognition that maps cells to their row/column positions; and formula detection that routes math to a specialized model instead of character-level OCR.

The output is not a string but a structured document object: sections with hierarchy, tables with full cell coordinates, figures with captions, and formulas in a parseable format. That structure is what makes downstream chunking meaningful. A table chunk contains a complete table. A section chunk starts and ends at semantic boundaries. You embed structure, not fragments.

For documents that are already machine-readable PDFs (text layer present), these frameworks extract structure directly from the PDF without running any OCR. OCR only fires on image-only pages or scanned documents. This distinction matters for speed: native PDF parsing runs at thousands of pages per minute, while OCR-dependent paths run at tens to hundreds of pages per minute depending on GPU throughput.

Docling vs. Marker vs. MinerU: Architecture and Model Backbones

All three frameworks run GPU-accelerated models, produce structured output, and integrate with RAG pipelines. Their differences show up at the document type level.

IBM Docling (DS4SD/docling on GitHub) uses a layout detection model (RT-DETR or the newer Heron model, depending on the release) to identify text blocks, tables, figures, code, and headers. Table structure recognition runs through TableFormer, a transformer specifically trained on DocBank, PubTabNet, and FinTabNet. Scanned pages fall back to EasyOCR or Tesseract. Output is a DoclingDocument JSON object with full provenance: every chunk knows its page, bounding box, and element type. Native input support extends to PDF, DOCX, PPTX, XLSX, and HTML.

Marker (VikParuchuri/marker) builds on the Surya model family: a layout detection model for block identification and a line-level OCR model for text. Its design prioritizes clean Markdown output with LaTeX equation blocks. Input is PDF or images. Marker is faster per page than Docling in pure throughput terms because its models are lighter, and the Markdown output is easier to chunk without post-processing. The trade-off is table fidelity on complex nested tables, where Docling's TableFormer has a structural advantage.

MinerU (opendatalab/MinerU) comes from OpenDataLab (Shanghai AI Laboratory) and uses the PDF-Extract-Kit pipeline internally. Its distinguishing capability is formula recognition via UniMERNet, which handles complex LaTeX equations in scientific papers better than either Docling or Marker. Multi-column scientific paper layout is also a strength. MinerU requires a one-time download agreement for the PDF-Extract-Kit model weights (free code license), which you click through on HuggingFace before pulling the weights.

Criterion	Docling	Marker	MinerU
Layout model	RT-DETR / Heron	Surya	PDF-Extract-Kit
Table extraction	TableFormer (ACCURATE/FAST)	Basic	Good
Formula recognition	Limited	LaTeX via Surya	UniMERNet (strong)
Reading order	Yes	Yes	Yes
Output format	DoclingDocument JSON	Markdown + LaTeX	Markdown + JSON
Supported input	PDF, DOCX, PPTX, XLSX, HTML	PDF, images	PDF, images
Multi-language	Yes (EasyOCR)	Yes (Surya)	Yes
License	MIT	GPL-3.0	Apache-2.0 with conditions

Choose Docling for financial and legal documents with complex tables, DOCX/PPTX inputs, and pipelines that need the structured DoclingDocument object for downstream processing. Choose Marker for bulk PDF ingestion where speed and Markdown output are the priority. Note that Marker's model weights ship under a modified AI Pubs Open Rail-M license that requires a paid commercial license for organizations above $2M in funding or revenue. Factor this in if you are deploying at commercial scale. Choose MinerU for scientific literature with equations and multi-column layouts.

GPU Requirements: VRAM, Throughput, and Concurrency Targets

All three frameworks load multiple models simultaneously. Layout detection, OCR fallback, and (for Docling) the TableFormer model all reside in VRAM during processing. Plan accordingly.

Docling: The layout detection model occupies roughly 1 GB VRAM. TableFormer adds approximately 600 MB. An EasyOCR instance adds another 500 MB to 1.5 GB depending on language models loaded. Total combined footprint for a single Docling worker runs 4-6 GB VRAM, leaving room on an L40S for 6-8 parallel workers before memory pressure starts. On an H100 SXM5 you can run 10-14 workers comfortably alongside a co-located embedding model.

Marker: Surya layout plus Surya OCR runs 3-5 GB VRAM per worker. Throughput per worker is higher than Docling because the models are faster, so 5-7 workers on an L40S gives comparable aggregate throughput to Docling with fewer workers.

MinerU: The formula recognition model (UniMERNet) adds roughly 2 GB on top of the base pipeline. Plan for 6-8 GB per worker. On an L40S you can run 4-5 MinerU workers concurrently without swapping.

GPU	VRAM	Docling pages/hr	Marker pages/hr	MinerU pages/hr	On-Demand $/hr	Cost/10K pages (Docling)
L40S	48 GB	~2,400	~3,000	~1,800	$1.91	~$7.96
H100 SXM5	80 GB	~4,800	~5,500	~3,600	$3.12	~$6.50

Throughput estimates assume native PDF inputs with 20% scanned pages routed to OCR fallback, at a mix of 30% simple text, 40% table-heavy, and 30% multi-column. Pure native PDF parsing runs significantly faster; heavily scanned document sets run slower.

For the cost per 10K pages: at $1.91/hr and 2,400 pages/hr with Docling on the L40S, that is ($1.91 / 2,400) * 10,000 = $7.96 per 10K pages. Azure Document Intelligence prebuilt layout charges $10 per 1,000 pages, so $100 per 10K pages. Self-hosting on Spheron is roughly 12x cheaper before volume discounts.

Pricing fluctuates based on GPU availability. The prices above are based on 28 May 2026 and may have changed. Check current GPU pricing → for live rates.

Layout Detection, Table Extraction, Formula Recognition, and Reading Order Benchmarks

On the DocLayNet benchmark, Docling's RT-DETR layout detection achieves above 85% mAP on the standard element classes (text, table, figure, section header, list, code, formula). Marker's Surya layout model is competitive on text and headers but scores lower on tables. MinerU's PDF-Extract-Kit pipeline matches Docling on layout and surpasses both on formula detection.

For table structure recognition, the FinTabNet and PubTabNet benchmarks put Docling's TableFormer at Tree Edit Distance Similarity (TEDS) above 91% on FinTabNet in ACCURATE mode. This is the number that matters for financial document RAG: cell-level accuracy determines whether you can query "what was Q3 2024 revenue" from a 10-K table and get the right number. Marker's basic table extraction TEDS is roughly 75-80% on the same benchmark, adequate for simple grid tables but not financial report complexity.

On IM2LaTeX formula recognition, MinerU's UniMERNet achieves over 90% BLEU for displayed equations in scientific papers. Docling and Marker both fall below 70% on complex multi-line integrals and matrix expressions, making MinerU the clear choice for scientific literature ingestion.

Document type	Best framework	Reason
Financial reports (10-K, earnings)	Docling (ACCURATE mode)	TableFormer TEDS 91%+ on FinTabNet
Scientific papers	MinerU	UniMERNet formula recognition
Legal contracts	Docling	Reading order + section hierarchy
Slide decks (PPTX)	Docling	Native PPTX input support
Mixed scanned/native PDFs	Marker or Docling + DeepSeek-OCR routing	Speed vs. accuracy tradeoff

Reading order accuracy on multi-column scientific PDFs is high for all three frameworks (above 90% on the ReadingBank benchmark), because all three use model-based layout detection rather than heuristic column splitting.

Multi-Language and Scanned Document Handling

Docling falls back to EasyOCR for scanned pages, which covers 80+ languages with reasonable accuracy on printed text. Marker's Surya OCR covers 90+ languages and benchmarks slightly higher than EasyOCR on non-Latin scripts in the Surya benchmark suite. MinerU uses a custom OCR model within PDF-Extract-Kit that covers CJK scripts (Chinese, Japanese, Korean) particularly well, which reflects OpenDataLab's focus on scientific literature in Chinese.

For documents that are primarily scanned, the better path is to route to DeepSeek-OCR on GPU cloud first, get the structured text output, and then pass it to Docling for layout structure assignment. This hybrid approach outperforms any built-in OCR fallback on low-quality scans, handwritten annotations, and mixed-script documents. The next section on hybrid pipelines covers the routing logic.

For multilingual corpora of native PDFs (text layer present), all three frameworks handle non-Latin PDFs without OCR. The language fallback only triggers for image-only pages.

Deploying Docling on Spheron H100 with Docker

The simplest production deployment runs Docling behind a FastAPI endpoint in Docker. The IBM research team publishes official Docker images at quay.io/ds4sd/docling.

Docker Compose setup:

yaml

version: "3.9"
services:
  docling:
    image: quay.io/ds4sd/docling:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
    volumes:
      - ./input:/app/input
      - ./output:/app/output
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
  api:
    build: ./api
    ports:
      - "8080:8080"
    volumes:
      - ./input:/app/input
      - ./output:/app/output
    depends_on:
      - docling

FastAPI wrapper (api/main.py):

python

import os
import uuid
import asyncio
import json
import threading
from pathlib import Path
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode

app = FastAPI()

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

converter = DocumentConverter(
    format_options={"pdf": PdfFormatOption(pipeline_options=pipeline_options)}
)
_converter_lock = threading.Lock()


def _run_convert(path: str):
    with _converter_lock:
        return converter.convert(path)


@app.post("/parse")
async def parse_document(file: UploadFile = File(...)):
    doc_id = str(uuid.uuid4())
    safe_name = Path(file.filename).name if file.filename else "upload.bin"
    input_path = Path(f"/app/input/{doc_id}_{safe_name}")

    content = await file.read()

    try:
        input_path.write_bytes(content)
        loop = asyncio.get_running_loop()
        result = await loop.run_in_executor(None, _run_convert, str(input_path))
    finally:
        input_path.unlink(missing_ok=True)

    doc = result.document

    return JSONResponse({
        "doc_id": doc_id,
        "page_count": len(doc.pages),
        "table_count": len(doc.tables),
        "document": json.loads(doc.model_dump_json()),
    })

Kubernetes deployment with GPU resource requests:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: docling-api
spec:
  replicas: 2
  selector:
    matchLabels:
      app: docling-api
  template:
    metadata:
      labels:
        app: docling-api
    spec:
      containers:
        - name: docling-api
          image: your-registry/docling-api:latest
          ports:
            - containerPort: 8080
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: "32Gi"
              cpu: "8"
            requests:
              nvidia.com/gpu: "1"
              memory: "16Gi"
              cpu: "4"
          volumeMounts:
            - name: document-storage
              mountPath: /app/input
              subPath: input
            - name: document-storage
              mountPath: /app/output
              subPath: output
      volumes:
        - name: document-storage
          persistentVolumeClaim:
            claimName: docling-pvc
      runtimeClassName: nvidia

Deploy this to Spheron H100 instances for maximum table extraction throughput on financial and legal documents. The H100 SXM5's memory bandwidth (3.35 TB/s) cuts TableFormer inference time on large tables roughly in half compared to A100 PCIe, which shows up at scale when processing hundreds of 10-K filings. For Spheron instance provisioning and SSH access, see the Spheron instance types docs.

Deploying Marker and MinerU on Spheron L40S

Marker deployment:

Install from PyPI and wrap in a FastAPI endpoint:

bash

pip install marker-pdf
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

python

import asyncio
import tempfile, os
import threading
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
from marker.convert import convert_single_pdf
from marker.models import load_all_models

app = FastAPI()
models = load_all_models()
_models_lock = threading.Lock()


def _run_marker(path: str):
    with _models_lock:
        return convert_single_pdf(path, models, langs=["English"], batch_multiplier=2)


@app.post("/parse")
async def parse(file: UploadFile = File(...)):
    tmp_fd, tmp_path = tempfile.mkstemp(suffix=".pdf")
    try:
        with os.fdopen(tmp_fd, "wb") as f:
            f.write(await file.read())

        loop = asyncio.get_running_loop()
        full_text, images, metadata = await loop.run_in_executor(
            None, lambda: _run_marker(tmp_path)
        )
    finally:
        os.unlink(tmp_path)

    return JSONResponse({
        "text": full_text,
        "page_count": metadata.get("page_count", 0),
        "languages": metadata.get("languages", []),
    })

Marker's batch_multiplier=2 doubles page throughput by processing pairs of pages in each model forward pass, at the cost of roughly 30% more VRAM. On an L40S GPU rental with 48 GB, you can push batch_multiplier=4 without hitting memory limits.

MinerU deployment:

bash

pip install magic-pdf[full]
# Download model weights (requires HuggingFace account + license click-through)
magic-pdf --help

MinerU's CLI processes single files or directories:

bash

magic-pdf -p /input/document.pdf -o /output/document/ --method auto

--method auto lets MinerU detect whether each page is native PDF text or scanned, routing accordingly. For batch processing, wrap this in a FastAPI endpoint with asyncio.subprocess.create_subprocess_exec to avoid blocking the event loop:

python

import asyncio
import glob
import os
import shutil
import tempfile
from fastapi import FastAPI, UploadFile, File

app = FastAPI()


@app.post("/parse")
async def parse(file: UploadFile = File(...)):
    tmp_fd, tmp_path = tempfile.mkstemp(suffix=".pdf")
    out_dir = None
    try:
        out_dir = tempfile.mkdtemp(dir="/output/")
        with os.fdopen(tmp_fd, "wb") as f:
            f.write(await file.read())

        proc = await asyncio.create_subprocess_exec(
            "magic-pdf",
            "-p", tmp_path,
            "-o", out_dir,
            "--method", "auto",
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE,
        )
        stdout, stderr = await proc.communicate()

        markdown = ""
        if proc.returncode == 0:
            md_files = glob.glob(os.path.join(out_dir, "**", "*.md"), recursive=True)
            if md_files:
                with open(md_files[0]) as fh:
                    markdown = fh.read()

        return {
            "status": "ok" if proc.returncode == 0 else "error",
            "markdown": markdown,
            "log": stderr.decode("utf-8", errors="replace"),
        }
    finally:
        try:
            os.close(tmp_fd)
        except OSError:
            pass
        os.unlink(tmp_path)
        if out_dir is not None:
            shutil.rmtree(out_dir, ignore_errors=True)

For on-demand L40S access running MinerU at 4-5 workers, expect 1,600-2,000 pages/hr on a mix of native and scanned documents with equations. On pure scientific literature with dense formulas, throughput drops to 1,200-1,500 pages/hr as UniMERNet runs on each formula region.

End-to-End RAG Ingestion Pipeline: Document Intelligence into ColPali into Vector DB

The most common production architecture combines document intelligence (for text-layer PDFs) with visual retrieval (for chart-heavy and diagram-heavy pages) and indexes both into a multi-vector Qdrant collection.

Input PDF
    │
    ├── Native PDF pages (has text layer)
    │       │
    │       ▼
    │   Docling / Marker / MinerU
    │   (layout detection + structure)
    │       │
    │       ▼
    │   Text chunks with metadata
    │   (table_id, section, page_num)
    │       │
    │       ▼
    │   TEI embedding model
    │   (dense vector per chunk)
    │
    └── Visual pages (charts, diagrams, scanned)
            │
            ▼
        ColPali / ColQwen2.5
        (visual patch embeddings)
            │
            ▼
        Multi-vector index
            │
            ▼
    Qdrant multi-vector collection
    (sparse + dense + colpali vectors)
            │
            ▼
    Query router
    (text query → dense retrieval)
    (image query → colpali retrieval)

The ColPali multimodal document RAG guide covers the visual retrieval side in detail, including ColQwen2.5 deployment and the Qdrant multi-vector collection schema. For the text side, Docling's DoclingDocument export maps cleanly onto a chunking strategy:

python

from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker import HierarchicalChunker

converter = DocumentConverter()
result = converter.convert("path/to/document.pdf")
doc = result.document

chunker = HierarchicalChunker()
chunks = list(chunker.chunk(doc))

for chunk in chunks:
    # chunk.text: the text content
    # chunk.meta.headings: parent section headers
    # chunk.meta.doc_items: source document elements with bbox and page
    payload = {
        "text": chunk.text,
        "page": chunk.meta.origin.page_no if chunk.meta.origin else None,
        "section": " > ".join(chunk.meta.headings) if chunk.meta.headings else None,
        "element_type": str(chunk.meta.doc_items[0].label) if chunk.meta.doc_items else None,
    }
    # embed and upsert to Qdrant

The HierarchicalChunker respects section boundaries and keeps table rows together, so you never split a table across two chunks. Each chunk includes provenance metadata: page number, bounding box, section hierarchy, and element type. At query time you can filter to element_type=table to retrieve only tabular chunks, which cuts hallucination rates on quantitative questions.

For the vector database layer, the self-host vector database guide covers Qdrant multi-vector setup on Spheron, including HNSW index tuning and co-location with vLLM.

The query router is simple: classify the query as text or visual, run the corresponding retrieval path, merge results, and pass to the LLM. A basic implementation uses a small classifier trained on query type labels, or a heuristic that routes queries containing "chart", "figure", "diagram", or "graph" to the ColPali path and everything else to the dense text path.

Throughput vs. Cost: Pages per Second per Dollar on Spheron

Solution	GPU	Pages/hr	$/hr	Cost/10K pages	Data residency
Docling on Spheron	L40S	2,400	$1.91	$7.96	Your GPU boundary
Docling on Spheron	H100 SXM5	4,800	$3.12	$6.50	Your GPU boundary
Marker on Spheron	L40S	3,000	$1.91	$6.37	Your GPU boundary
MinerU on Spheron	L40S	1,800	$1.91	$10.61	Your GPU boundary
Azure Document Intelligence	Prebuilt layout	N/A	$10/1K pages	$100.00	Azure region
AWS Textract	Forms + tables	N/A	$0.065/page	$650.00	AWS region

Against Azure Document Intelligence, self-hosting Docling on an L40S is roughly 12x cheaper at on-demand rates. At spot pricing on Spheron, the gap widens further. Against AWS Textract, the savings are even larger.

Data residency is the other factor that managed services cannot address. When you upload documents to Azure Document Intelligence or AWS Textract, the document leaves your infrastructure. For healthcare organizations processing clinical notes under HIPAA, for legal teams processing privileged documents, and for financial institutions under GDPR or PCI-DSS, that boundary matters. Running Docling or Marker on a Spheron GPU node keeps the document inside your compute boundary for the full parsing pipeline.

Pricing fluctuates based on GPU availability. The prices above are based on 28 May 2026 and may have changed. Check current GPU pricing → for live rates.

Hybrid Pipelines: Combining Document Intelligence with DeepSeek-OCR

Built-in OCR fallbacks in Docling (EasyOCR), Marker (Surya), and MinerU (PDF-Extract-Kit) work well for clean scanned documents. They struggle with low-quality scans, handwritten annotations, stamps, and mixed-script content. For those cases, routing to a VLM-based OCR model before feeding Docling gives substantially better text quality. For a comparison of the leading VLM OCR models (PaddleOCR-VL-1.6, DeepSeek-OCR, GOT-OCR 2.0, and Granite-Docling) including VRAM sizing and cost-per-page math, see the open-source OCR VLM comparison.

The routing decision is fast: PyMuPDF's get_text() returns an empty string for image-only pages.

python

import fitz  # PyMuPDF
from pathlib import Path


def classify_pages(pdf_path: str) -> dict[int, str]:
    """Return {page_num: 'native' | 'scanned'} for each page."""
    page_types = {}
    with fitz.open(pdf_path) as doc:
        for page_num, page in enumerate(doc):
            text = page.get_text().strip()
            page_types[page_num] = "native" if len(text) > 50 else "scanned"
    return page_types


async def hybrid_parse(pdf_path: str, docling_client, deepseek_ocr_client):
    page_types = classify_pages(pdf_path)
    native_pages = [p for p, t in page_types.items() if t == "native"]
    scanned_pages = [p for p, t in page_types.items() if t == "scanned"]

    results = {}

    if native_pages:
        # Docling handles native pages with full structure detection
        docling_result = await docling_client.parse(pdf_path, pages=native_pages)
        results["native"] = docling_result

    if scanned_pages:
        # DeepSeek-OCR handles scanned pages via vLLM vision endpoint
        ocr_result = await deepseek_ocr_client.ocr_pages(pdf_path, pages=scanned_pages)
        results["scanned"] = ocr_result

    return merge_results(results, page_types)

The threshold of 50 characters filters out pages with only a header or a page number, treating them as effectively image-only. Adjust based on your document set. A legal contract with a 40-character footer on an otherwise scanned page should go to the OCR path.

See the DeepSeek-OCR on GPU cloud guide for the vLLM vision endpoint configuration and batch size tuning. For the orchestration layer that manages routing across many concurrent document jobs, the agentic RAG infrastructure guide covers queue-based job distribution with worker autoscaling.

For post-retrieval graph-based reasoning over the indexed documents, the GraphRAG deployment guide shows how to add entity extraction and community detection on top of a Qdrant collection.

Choosing Between Docling, Marker, and MinerU

The framework choice comes down to document type, output format requirements, and throughput targets.

Use Docling when the documents are financial reports, legal contracts, or enterprise Office files (DOCX, PPTX, XLSX). TableFormer's structural accuracy on complex tables is meaningfully better than the alternatives, and the DoclingDocument JSON object carries enough provenance to support filtered retrieval by element type.

Use Marker when you need maximum throughput on large PDF corpora where the documents are reasonably well-structured and the output goes directly to a text embedding model. The Markdown output requires less post-processing than raw DoclingDocument JSON for simple ingestion cases.

Use MinerU when the document corpus is scientific literature with equations, or when the primary language is Chinese, Japanese, or Korean. UniMERNet's formula extraction accuracy makes a real difference in STEM research RAG applications where query results depend on reading equations correctly.

For mixed enterprise document sets that include all three types, run Docling as the default and route identified scientific papers (by metadata, filename pattern, or a fast document classifier) to MinerU. Use the hybrid OCR routing for scanned documents in any category.

Running Docling, Marker, or MinerU on Spheron keeps your document ingestion pipeline inside your own GPU cloud boundary, meeting data residency requirements while cutting per-page costs by roughly 12x compared to Azure Document Intelligence at on-demand rates. Check H100 SXM5 availability on Spheron → or view current GPU pricing → for full on-demand and spot rates.

STEPS / 06

Quick Setup Guide

Size your GPU node
Deploy an H100 SXM5 or L40S instance on Spheron. For single-framework batch processing, an L40S at 48 GB handles concurrent Docling layout detection and table extraction without paging. For mixed workloads that also run embedding models, use H100 SXM5 for its memory bandwidth headroom.
Deploy Docling via Docker
Pull the IBM Docling Docker image, mount a shared volume for input PDFs and output JSON, and configure the DocumentConverter with TableFormerMode.ACCURATE for financial tables or TableFormerMode.FAST for speed-constrained pipelines.
Deploy Marker or MinerU as an alternative
For Marker, install marker-pdf and run convert_single with a GPU-enabled Surya model. For MinerU, use the magic-pdf CLI with --method auto to auto-detect whether a PDF is native or scanned. Both frameworks export to structured Markdown with table fidelity.
Build the FastAPI ingestion endpoint
Wrap your chosen framework in a FastAPI service that accepts PDF file uploads, routes scanned pages to DeepSeek-OCR via vLLM, and merges output chunks before returning structured JSON ready for embedding.
Index into a vector database
Pipe DoclingDocument or Marker Markdown output through a chunking strategy that respects section and table boundaries. Embed with a text embedding model via TEI, then upsert to Qdrant with metadata fields for document_type, page_number, and table_id for filtered retrieval.
Validate throughput and quality
Run the benchmark suite on a 100-page sample from each document type (financial report, scientific paper, legal contract). Record pages/second and table extraction F1. Compare cost per 10K pages against the Azure Document Intelligence baseline.

FAQ / 05

Frequently Asked Questions

IBM Docling is an open-source document conversion library that uses a transformer-based TableFormer model for table extraction and a layout detection model for reading-order recovery. Marker is a pipeline-based converter using Surya layout and OCR models. MinerU is OpenDataLab's document understanding toolkit with strong support for formula recognition and multi-column layouts. All three run on GPU and produce structured output for RAG ingestion.

All three frameworks load layout detection and OCR models simultaneously. Plan for at least 8-12 GB VRAM for single-stream processing. For throughput at 30+ pages per second, an L40S (48 GB) or H100 (80 GB) is recommended. The H100's higher memory bandwidth reduces table extraction latency on complex financial and legal documents.

Docling exports to DoclingDocument JSON, which preserves table structures, reading order, and section boundaries. Pass these structured chunks directly to a text embedding model served via Hugging Face TEI, then index into Qdrant or Milvus. For visual retrieval over the same documents, combine with ColPali at /blog/colpali-multimodal-document-rag-gpu-cloud/ to handle chart-heavy pages that text extraction misses.

Use a hybrid pipeline when your document set contains both machine-readable PDFs (where Docling/Marker native PDF parsing is faster and more accurate) and scanned or handwritten pages where a vision-language OCR model outperforms classical detection. Route documents by type at ingestion time and merge outputs before embedding.

On Spheron with an L40S at on-demand pricing, a throughput of roughly 2,000-3,000 pages/hour gives a per-10K-page cost of approximately $6-11 depending on document complexity. Azure Document Intelligence charges $10 per 1,000 pages (prebuilt layout model), putting 10K pages at $100. Self-hosting with Docling or Marker on Spheron is 10-15x cheaper at scale, and documents never leave your GPU boundary.

Why Document Intelligence Is the Missing Layer Between OCR and RAG

Docling vs. Marker vs. MinerU: Architecture and Model Backbones

GPU Requirements: VRAM, Throughput, and Concurrency Targets

Layout Detection, Table Extraction, Formula Recognition, and Reading Order Benchmarks

Multi-Language and Scanned Document Handling

Deploying Docling on Spheron H100 with Docker

Deploying Marker and MinerU on Spheron L40S

End-to-End RAG Ingestion Pipeline: Document Intelligence into ColPali into Vector DB

Throughput vs. Cost: Pages per Second per Dollar on Spheron

Hybrid Pipelines: Combining Document Intelligence with DeepSeek-OCR

Choosing Between Docling, Marker, and MinerU

Quick Setup Guide

Size your GPU node

Deploy Docling via Docker

Deploy Marker or MinerU as an alternative

Build the FastAPI ingestion endpoint

Index into a vector database

Validate throughput and quality

Frequently Asked Questions

01What is IBM Docling and how does it differ from Marker and MinerU?

02What GPU do I need to run Docling, Marker, or MinerU in production?

03How do I integrate Docling output into a RAG pipeline?

04When should I use DeepSeek-OCR alongside Docling or Marker?

05What is the cost per 10,000 pages parsed on Spheron versus Azure Document Intelligence?

Build what's next.