Caution
Part of text in this repo written by ChatGPT. Also, I haven't yet run all pipelines because of lack of compute power.
This repository provides an overview of notable pipelines and benchmarks related to PDF/OCR document processing. Each entry includes a brief description, and useful data.
Did you know that GitHub supports table of contents by default 🤔
Pipeline | OmniDocBench Overall Edit ↓ | olmOCR Eval ELO ↑ | Marker Bench Overall Edit ↓ | READoc Overall Score ↑ | Actualize.pro Overall Score ↑ |
---|---|---|---|---|---|
MinerU | 0.150 |
1545.2 | - | 60.17 | 8 |
Marker | 0.336 | 1429.1 | 4.23916 |
63.57 | 6.5 |
Mathpix | 0.189 | - | 4.15626 | - | - |
DocLing | 0.589 | - | 3.70429 | - | 7.3 |
GOT-OCR | 0.289 | 1212.7 | - | - | - |
olmOCR | - | 1813.0 |
- | - | - |
LlamaParse | - | - | 3.97619 | - | 7.1 |
MarkItDown | - | - | - | - | 7.78 |
Nougat | 0.453 | - | - | 81.42 | - |
Zerox | - | - | - | - | 7.9 |
Unstructured | - | - | - | - | 6.2 |
Pix2Text | - | - | - | 64.39 | - |
open-parse | - | - | - | - | - |
- Bold indicates the best result for a given metric.
- "-" means the pipeline was not evaluated in that benchmark.
⚠️ means the pipeline authors are the ones who did the benchmark.
Primary Language: Python
License: AGPL-3.0
Description: MinerU is an open-source tool designed to convert PDFs into machine-readable formats, such as Markdown and JSON, facilitating seamless data extraction and further processing. Developed during the pre-training phase of InternLM, MinerU addresses symbol conversion challenges in scientific literature, making it invaluable for research and development in large language models. Key features include:
- Content Cleaning: Removes headers, footers, footnotes, and page numbers to ensure semantic coherence.
- Structure Preservation: Maintains the original document structure, including titles, paragraphs, and lists.
- Multimodal Extraction: Accurately extracts images, image descriptions, tables, and table captions.
- Formula Recognition: Converts recognized formulas into LaTeX format.
- Table Conversion: Transforms tables into LaTeX or HTML formats.
- OCR Capabilities: Detects scanned or corrupted PDFs and enables OCR functionality, supporting text recognition in 84 languages.
- Cross-Platform Compatibility: Operates on Windows, Linux, and Mac platforms, supporting both CPU and GPU environments.
Primary Language: Python
License: GPL-3.0
Description: Marker “converts PDFs and images to markdown, JSON, and HTML quickly and accurately.” It is designed to handle a wide range of document types in all languages and produce structured outputs.
Benchmark Results: https://github.com/VikParuchuri/marker?tab=readme-ov-file#performance
API Details:
- API URL: https://www.datalab.to/
- Pricing: https://www.datalab.to/plans
- Average Price: $3 per 1000 pages, at least $25 per month
Additional Notes: Demo available after registration on https://www.datalab.to/
Primary Language: Python
License: MIT
Description: MarkItDown is a Python-based utility developed by Microsoft for converting various file formats into Markdown. It supports a wide range of file types, including:
- Office Documents: Word (.docx), PowerPoint (.pptx), Excel (.xlsx)
- Media Files: Images (with EXIF metadata and OCR capabilities), Audio (with speech transcription)
- Web and Data Formats: HTML, CSV, JSON, XML
- Archives: ZIP files (with recursive content parsing)
- URLs: YouTube links
This versatility makes MarkItDown a valuable tool for tasks such as indexing, text analysis, and preparing content for Large Language Model (LLM) training. The utility offers both command-line and Python API interfaces, providing flexibility for various use cases. Additionally, MarkItDown features a plugin-based architecture, allowing for easy integration of third-party extensions to enhance its functionality.
Primary Language: Python
License: Apache-2.0
Description: olmOCR is an open-source toolkit developed by the Allen Institute for AI, designed to convert PDFs and document images into clean, plain text suitable for large language model (LLM) training and other applications. Key features include:
- High Accuracy: Preserves reading order and supports complex elements such as tables, equations, and handwriting.
- Document Anchoring: Combines text and visual information to enhance extraction accuracy.
- Structured Content Representation: Utilizes Markdown to represent structured content, including sections, lists, equations, and tables.
- Optimized Pipeline: Compatible with SGLang and vLLM inference engines, enabling efficient scaling from single to multiple GPUs.
Primary Language: Python
License: Proprietary
Description: LlamaParse is a GenAI-native document parsing platform developed by LlamaIndex. It transforms complex documents—including PDFs, PowerPoint presentations, Word documents, and spreadsheets—into structured, LLM-ready formats. LlamaParse excels in accurately extracting and formatting tables, images, and other non-standard layouts, ensuring high-quality data for downstream applications such as Retrieval-Augmented Generation (RAG) and data processing. The platform supports over 10 file types and offers features like natural language parsing instructions, JSON output, and multilingual support.
API Details:
- API URL: https://api.cloud.llamaindex.ai/api/parsing/upload
- Pricing: https://docs.cloud.llamaindex.ai/llamaparse/usage_data
- Average Price: Free Plan: 1,000 pages per day; Paid Plan: 7,000 pages per week, with additional pages at $ 3 per 1,000 pages
Primary Language: Not publicly available
License: Proprietary
Description: Mathpix offers advanced Optical Character Recognition (OCR) technology tailored for STEM content. Their services include the Convert API, which accurately digitizes images and PDFs containing complex elements such as mathematical equations, chemical diagrams, tables, and handwritten notes. The platform supports multiple output formats, including LaTeX, MathML, HTML, and Markdown, facilitating seamless integration into various applications and workflows. Additionally, Mathpix provides the Snipping Tool, a desktop application that allows users to capture and convert content from their screens into editable formats with a single keyboard shortcut.
API Details:
- API URL: https://docs.mathpix.com/
- Pricing: https://mathpix.com/pricing
- Average Price: $5 per 1000 pages
Primary Language: Python
License: MIT
Description: Nougat (Neural Optical Understanding for Academic Documents) is an open-source Visual Transformer model developed by Meta AI Research. It is designed to perform Optical Character Recognition (OCR) on scientific documents, converting PDFs into a machine-readable markup language. Nougat simplifies the extraction of complex elements such as mathematical expressions and tables, enhancing the accessibility of scientific knowledge. The model processes raw pixel data from document images and outputs structured markdown text, bridging the gap between human-readable content and machine-readable formats.
Primary Language: Python
License: Apache-2.0
Description: GOT-OCR (General OCR Theory) is an open-source, unified end-to-end model designed to advance OCR to version 2.0. It supports a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas, and sheet music. The model is highly versatile, supporting various input types and producing structured outputs, making it well-suited for complex OCR tasks.
Benchmark Results: https://github.com/Ucas-HaoranWei/GOT-OCR2.0#benchmarks
Primary Language: Python
License: MIT
Description: DocLing is an open-source document processing pipeline developed by IBM Research. It simplifies the parsing of diverse document formats—including PDF, DOCX, PPTX, HTML, and images—and provides seamless integrations with the generative AI ecosystem. Key features include advanced PDF understanding, optical character recognition (OCR) support, and plug-and-play integrations with frameworks like LangChain and LlamaIndex.
Primary Language: TypeScript
License: MIT
Description: Zerox is an OCR and document extraction tool that leverages vision models to convert PDFs and images into structured Markdown format. It excels in handling complex layouts, including tables and charts, making it ideal for AI ingestion and further text analysis.
Benchmark Results: https://getomni.ai/ocr-benchmark
API Details:
- API URL: https://getomni.ai/
- Pricing: https://getomni.ai/pricing
- Average Price: Extract structured data: 'Startup' plan at $225 per month with 5000 pages included, after that $2 per 1000 pages
Primary Language: Python
License: Apache-2.0
Description: Unstructured is an open-source library that provides components for ingesting and pre-processing unstructured data, including images and text documents such as PDFs, HTML, and Word documents. It transforms complex data into structured formats suitable for large language models and AI applications. The platform offers enterprise-grade connectors to seamlessly integrate various data sources, making it easier to extract and transform data for analysis and processing.
API Details:
- API URL: https://docs.unstructured.io/platform-api/overview
- Pricing: https://unstructured.io/developers
- Average Price: Basic Strategy: $2 per 1,000 pages, suitable for simple, text-only documents. Advanced Strategy: $20 per 1,000 pages, ideal for PDFs, images, and complex file types. Platinum/VLM Strategy: $30 per 1,000 pages, designed for challenging documents, including scanned and handwritten content with VLM API integration.
Primary Language: Python
License: MIT
Description: Pix2Text (P2T) is an open-source Python3 tool designed to recognize layouts, tables, mathematical formulas (LaTeX), and text in images, converting them into Markdown format. It serves as a free alternative to Mathpix, supporting over 80 languages, including English, Simplified Chinese, Traditional Chinese, and Vietnamese. P2T can also process entire PDF files, extracting content into structured Markdown, facilitating seamless conversion of visual content into text-based representations.
Primary Language: Python
License: MIT
Description: Open Parse is a flexible, open-source library designed to enhance document chunking for Retrieval-Augmented Generation (RAG) systems. It visually analyzes document layouts to effectively group related content, surpassing traditional text-splitting methods. Key features include:
- Visually-Driven Analysis: Understands complex layouts for superior chunking.
- Markdown Support: Extracts headings, bold, and italic text into Markdown format.
- High-Precision Table Extraction: Converts tables into clean Markdown with high accuracy.
- Extensibility: Allows implementation of custom post-processing steps.
- Intuitive Design: Offers robust editor support for seamless integration.
Primary Language: Rust
License: Apache-2.0
Description: Extractous is a high-performance, open-source library designed for efficient extraction of content and metadata from various document types, including PDF, Word, HTML, and more. Developed in Rust, it offers bindings for multiple programming languages, starting with Python. Extractous aims to provide a comprehensive solution for unstructured data extraction, enabling local and efficient processing without relying on external services or APIs. Key features include:
- High Performance: Leveraging Rust's capabilities, Extractous achieves faster processing speeds and lower memory utilization compared to traditional extraction libraries.
- Multi-Language Support: While the core is written in Rust, bindings are available for Python, with plans to support additional languages like JavaScript/TypeScript.
- Extensive Format Support: Through integration with Apache Tika, Extractous supports a wide range of file formats, ensuring versatility in data extraction tasks.
- OCR Integration: Incorporates Tesseract OCR to extract text from images and scanned documents, enhancing its ability to handle diverse content types.
Benchmark Results: https://github.com/yobix-ai/extractous-benchmarks
OmniDocBench is “a benchmark for evaluating diverse document parsing in real-world scenarios” by MinerU devs. It establishes a comprehensive evaluation standard for document content extraction methods.
Notable features: OmniDocBench covers a wide variety of document types and layouts, comprising 981 PDF pages across 9 document types, 4 layout styles, and 3 languages. It provides rich annotations: over 20k block-level elements (paragraphs, headings, tables, etc.) and 80k+ span-level elements (lines, formulas, etc.), including reading order and various attribute tags for pages, text, and tables. The dataset undergoes strict quality control (combining manual annotation, intelligent assistance, and expert review for high accuracy). OmniDocBench also comes with * evaluation code* for fair, end-to-end comparisons of document parsing methods. It supports multiple evaluation tasks ( overall extraction, layout detection, table recognition, formula recognition, OCR text recognition) and standard metrics (Normalized Edit Distance, BLEU, METEOR, TEDS, COCO mAP/mAR, etc.) to benchmark performance across different aspects of document parsing.
End-to-End Evaluation
End-to-end evaluation assesses the model's accuracy in parsing PDF page content. The evaluation uses the model's Markdown output of the entire PDF page parsing results as the prediction.
Method Type | Methods | TextEdit↓ | FormulaEdit↓ | FormulaCDM↑ | TableTEDS↑ | TableEdit↓ | Read OrderEdit↓ | OverallEdit↓ | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
EN | ZH | EN | ZH | EN | ZH | EN | ZH | EN | ZH | EN | ZH | EN | ZH | ||
Pipeline Tools | MinerU-0.9.3 | 0.061 | 0.211 | 0.278 | 0.577 | 66.9 | 49.5 | 78.6 | 62.1 | 0.180 | 0.344 | 0.079 | 0.288 | 0.150 | 0.355 |
Marker-1.2.3 | 0.080 | 0.315 | 0.530 | 0.883 | 20.1 | 16.8 | 67.6 | 49.2 | 0.619 | 0.685 | 0.114 | 0.340 | 0.336 | 0.556 | |
Mathpix | 0.101 | 0.358 | 0.306 | 0.454 | 71.4 | 72.7 | 77.0 | 67.1 | 0.243 | 0.320 | 0.105 | 0.275 | 0.189 | 0.352 | |
Docling | 0.416 | 0.987 | 0.999 | 1 | 0 | 0 | 61.3 | 25.0 | 0.627 | 0.810 | 0.313 | 0.837 | 0.589 | 0.909 | |
Expert VLMs | GOT-OCR | 0.191 | 0.315 | 0.360 | 0.528 | 81.8 | 51.4 | 53.2 | 47.2 | 0.459 | 0.520 | 0.143 | 0.280 | 0.289 | 0.411 |
Nougat | 0.367 | 0.998 | 0.488 | 0.941 | 17.4 | 16.9 | 39.9 | 0 | 0.572 | 1 | 0.384 | 0.954 | 0.453 | 0.973 | |
General VLMs | GPT4o | 0.146 | 0.409 | 0.425 | 0.606 | 76.4 | 48.2 | 72.0 | 62.9 | 0.234 | 0.329 | 0.128 | 0.251 | 0.233 | 0.399 |
Qwen2-VL-72B | 0.253 | 0.251 | 0.468 | 0.572 | 54.9 | 60.9 | 59.5 | 66.4 | 0.551 | 0.518 | 0.254 | 0.223 | 0.381 | 0.391 | |
InternVL2-76B | 0.353 | 0.29 | 0.543 | 0.701 | 69.8 | 49.6 | 63.0 | 60.2 | 0.547 | 0.555 | 0.317 | 0.228 | 0.440 | 0.443 |
Comprehensive evaluation of document parsing algorithms on OmniDocBench: performance metrics for text, formula, table, and reading order extraction, with overall scores derived from ground truth comparisons.
The olmOCR project provides an evaluation toolkit (runeval.py
) for side-by-side comparison of PDF conversion
pipeline outputs. This tool allows researchers to directly compare text extraction results from different pipeline
versions against a gold-standard reference. Also olmoOCR authors made some evalutions in
their technical report.
We then sampled 2,000 comparison pairs (same PDF, different tool). We asked 11 data researchers and engineers at Ai2 to assess which output was the higher quality representation of the original PDF, focusing on reading order, comprehensiveness of content and representation of structured information. The user interface used is similar to that in Figure 5. Exact participant instructions are listed in Appendix B.
Bootstrapped Elo Ratings (95% CI)
Model | Elo Rating ± CI | 95% CI Range |
---|---|---|
olmoOCR | 1813.0 ± 84.9 | [1605.9, 1930.0] |
MinerU | 1545.2 ± 99.7 | [1336.7, 1714.1] |
Marker | 1429.1 ± 100.7 | [1267.6, 1645.5] |
GOTOCOR | 1212.7 ± 82.0 | [1097.3, 1408.3] |
Table 7: Pairwise Win/Loss Statistics Between Models
Model Pair | Wins | Win Rate (%) |
---|---|---|
olmOCR vs. Marker | 49/31 | 61.3 |
olmOCR vs. GOTOCOR | 41/29 | 58.6 |
olmOCR vs. MinerU | 55/22 | 71.4 |
Marker vs. MinerU | 53/26 | 67.1 |
Marker vs. GOTOCOR | 45/26 | 63.4 |
GOTOCOR vs. MinerU | 38/37 | 50.7 |
Total | 452 |
The Marker repository provides benchmark results comparing various PDF processing methods, scored based on a heuristic that aligns text with ground truth text segments, and an LLM as a judge scoring method.
Method | Avg Time | Heuristic Score | LLM Score |
---|---|---|---|
marker | 2.83837 | 95.6709 | 4.23916 |
llamaparse | 23.348 | 84.2442 | 3.97619 |
mathpix | 6.36223 | 86.4281 | 4.15626 |
docling | 3.69949 | 86.7073 | 3.70429 |
Methods | Text (Concat) | Text (Vocab) | Heading (Concat) | Heading (Tree) | Formula (Embed) | Formula (Isolate) | Table (Concat) | Table (Tree) | Reading Order (Block) | Reading Order (Token) | Average |
---|---|---|---|---|---|---|---|---|---|---|---|
Baselines | |||||||||||
PyMuPDF4LLM | 66.66 | 74.27 | 27.86 | 20.77 | 0.07 | 0.02 | 23.27 | 15.83 | 87.70 | 89.09 | 40.55 |
Tesseract OCR | 78.85 | 76.51 | 1.26 | 0.30 | 0.12 | 0.00 | 0.00 | 0.00 | 96.70 | 97.59 | 35.13 |
Pipeline Tools | |||||||||||
MinerU | 84.15 | 84.76 | 62.89 | 39.15 | 62.97 | 71.02 | 0.00 | 0.00 | 98.64 | 97.72 | 60.17 |
Pix2Text | 85.85 | 83.72 | 63.23 | 34.53 | 43.18 | 37.45 | 54.08 | 47.35 | 97.68 | 96.78 | 64.39 |
Marker | 83.58 | 81.36 | 68.78 | 54.82 | 5.07 | 56.26 | 47.12 | 43.35 | 98.08 | 97.26 | 63.57 |
Expert Visual Models | |||||||||||
Nougat-small | 87.35 | 92.00 | 86.40 | 87.88 | 76.52 | 79.39 | 55.63 | 52.35 | 97.97 | 98.36 | 81.38 |
Nougat-base | 88.03 | 92.29 | 86.60 | 88.50 | 76.19 | 79.47 | 54.40 | 52.30 | 97.98 | 98.41 | 81.42 |
Vision-Language Models | |||||||||||
DeepSeek-VL-7B-Chat | 31.89 | 39.96 | 23.66 | 12.53 | 17.01 | 16.94 | 22.96 | 16.47 | 88.76 | 66.75 | 33.69 |
MiniCPM-Llama3-V2.5 | 58.91 | 70.87 | 26.33 | 7.68 | 16.70 | 17.90 | 27.89 | 24.91 | 95.26 | 93.02 | 43.95 |
LLaVa-1.6-Vicuna-13B | 27.51 | 37.09 | 8.92 | 6.27 | 17.80 | 11.68 | 23.78 | 16.23 | 76.63 | 51.68 | 27.76 |
InternVL-Chat-V1.5 | 53.06 | 68.44 | 25.03 | 13.57 | 33.13 | 24.37 | 40.44 | 34.35 | 94.61 | 91.31 | 47.83 |
GPT-4o-mini | 79.44 | 84.37 | 31.77 | 18.65 | 42.23 | 41.67 | 47.81 | 39.85 | 97.69 | 96.35 | 57.98 |
Table 3: Evaluation of various Document Structured Extraction systems on READOC-arXiv.
In the digital age, PDF documents remain a cornerstone for disseminating and archiving information. However, extracting meaningful data from these structured and unstructured formats continues to challenge modern AI systems. Our recent benchmarking study evaluated seven prominent PDF extraction tools to determine their capabilities across diverse document types and applications.
PDF Parser | Overall Score (out of 10) | Text Extraction Accuracy (Score out of 10) | Table Extraction Accuracy (Score out of 10) | Reading Order Accuracy (Score out of 10) | Markdown Conversion Accuracy (Score out of 10) | Code and Math Equations Extraction (Score out of 10) | Image Extraction Accuracy (Score out of 10) |
---|---|---|---|---|---|---|---|
MinerU | 8 | 9.3 | 7.3 | 8.7 | 8.3 | 6.5 | 7 |
Xerox | 7.9 | 8.7 | 7.7 | 9 | 8.7 | 7 | 6 |
MarkItdown | 7.78 | 9 | 6.83 | 9 | 7.67 | 7.83 | 5.83 |
Docling | 7.3 | 8.7 | 6.3 | 9 | 8 | 6.5 | 5 |
Llama parse | 7.1 | 7.3 | 7.7 | 8.7 | 7.3 | 6 | 5.3 |
Marker | 6.5 | 7.3 | 5.7 | 7.3 | 6.7 | 4.5 | 6.7 |
Unstructured | 6.2 | 7.3 | 5 | 8.3 | 6.7 | 5 | 4.7 |
Function | MinerU | PaddleOCR | Marker | Unstructured | gptpdf | Zerox | Chunkr | pdf-extract-api | Sparrow | LlamaParse | DeepDoc | MegaParse |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PDF and Image Parsing | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Parsing of Other Formats (PPT, Excel, DOCX, etc.) | ✓ | - | - | ✓ | - | ✓ | ✓ | - | ✓ | ✓ | ✓ | ✓ |
Layout Analysis | ✓ | ✓ | ✓ | - | ✓ | - | ✓ | - | - | ✓ | ✓ | - |
Text Recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Image Recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Simple (Vertical/Horizontal/Hierarchical) Tables | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Complex Tables | - | - | - | - | - | - | - | - | - | - | - | - |
Formula Recognition | - | - | - | - | - | - | - | - | - | - | - | - |
HTML Output | ✓ | - | ✓ | ✓ | - | - | ✓ | - | - | - | ✓ | - |
Markdown Output | ✓ | ✓ | ✓ | - | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | - | ✓ |
JSON Output | ✓ | - | ✓ | ✓ | - | - | ✓ | ✓ | - | ✓ | ✓ | - |
JSON Accuracy
Model Provider | JSON Accuracy (%) |
---|---|
OmniAI | 91.7% |
Gemini 2.0 Flash | 86.1% |
Azure | 85.1% |
GPT-4o | 75.5% |
AWS Textract | 74.3% |
Claude Sonnet 3.5 | 69.3% |
Google Document AI | 67.8% |
GPT-4o Mini | 64.8% |
Unstructured | 50.8% |
Cost per 1,000 Pages
Model Provider | Cost per 1,000 Pages ($) |
---|---|
GPT-4o Mini | 0.97 |
Gemini 2.0 Flash | 1.12 |
Google Document AI | 1.50 |
AWS Textract | 4.00 |
OmniAI | 10.00 |
Azure | 10.00 |
GPT-4o | 18.37 |
Claude Sonnet 3.5 | 19.93 |
Unstructured | 20.00 |
Processing Time per Page
Model Provider | Average Latency (seconds) |
---|---|
Google Document AI | 3.19 |
Azure | 4.40 |
AWS Textract | 4.86 |
Unstructured | 7.99 |
OmniAI | 9.69 |
Gemini 2.0 Flash | 10.71 |
Claude Sonnet 3.5 | 18.42 |
GPT-4o Mini | 22.73 |
GPT-4o | 24.85 |
extractous
speedup relative to unstructured-io
extractous
memory efficiency relative to unstructured-io