OpenDataLoader PDF converts PDFs into LLM-ready Markdown and JSON. It runs entirely locally with no GPU required — useful for RAG pipelines where privacy and determinism matter.
| Problem | How It Solves It |
|---|---|
| Multi-column text reads incorrectly | XY-Cut++ algorithm preserves correct reading order |
| Tables lose structure | Border + cluster detection keeps rows/columns intact |
| Headers/footers pollute context | Auto-filtered before output |
| No coordinates for citations | Bounding box for every element |
| Cloud APIs = privacy concerns | 100% local, no data leaves your machine |
| GPU required | Pure CPU, rule-based — runs anywhere |
[x1, y1, x2, y2] coordinates for citationsAvailable via multiple package managers:
pip install -U opendataloader-pdf
npm install @opendataloader/pdf
import opendataloader_pdf
opendataloader_pdf.convert(
input_path="document.pdf",
output_dir="output/",
format="markdown,json"
)