Artificial Intelligence

OpenDataLoader PDF

Local PDF parsing for RAG pipelines — converts PDFs to Markdown and JSON with accurate reading order, table extraction, and bounding boxes.

Overview

OpenDataLoader PDF converts PDFs into LLM-ready Markdown and JSON. It runs entirely locally with no GPU required — useful for RAG pipelines where privacy and determinism matter.

Why It Stands Out

ProblemHow It Solves It
Multi-column text reads incorrectlyXY-Cut++ algorithm preserves correct reading order
Tables lose structureBorder + cluster detection keeps rows/columns intact
Headers/footers pollute contextAuto-filtered before output
No coordinates for citationsBounding box for every element
Cloud APIs = privacy concerns100% local, no data leaves your machine
GPU requiredPure CPU, rule-based — runs anywhere

Key Features

  • Structured Output — JSON with semantic types (heading, paragraph, table, list, caption)
  • Bounding Boxes — Every element includes [x1, y1, x2, y2] coordinates for citations
  • Reading Order — XY-Cut++ algorithm handles multi-column layouts correctly
  • Noise Filtering — Headers, footers, hidden text, watermarks auto-removed
  • Deterministic — Same input always produces the same output (no LLM hallucinations)
  • Fast — 100+ pages per second on CPU

Installation

Available via multiple package managers:

pip install -U opendataloader-pdf

Quick Start

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path="document.pdf",
    output_dir="output/",
    format="markdown,json"
)

Use Cases

  • RAG pipelines — Convert PDFs to chunked markdown for vector databases
  • Document analysis — Extract tables and structured data from technical PDFs
  • Citation systems — Use bounding boxes to reference exact locations in source documents
  • Batch processing — Process large document collections locally without API costs