Artificial Intelligence

OpenDataLoader PDF

Local PDF parsing for RAG pipelines — converts PDFs to Markdown and JSON with accurate reading order, table extraction, and bounding boxes.

Overview

OpenDataLoader PDF converts PDFs into LLM-ready Markdown and JSON. It runs entirely locally with no GPU required — useful for RAG pipelines where privacy and determinism matter.

Why It Stands Out

Problem	How It Solves It
Multi-column text reads incorrectly	XY-Cut++ algorithm preserves correct reading order
Tables lose structure	Border + cluster detection keeps rows/columns intact
Headers/footers pollute context	Auto-filtered before output
No coordinates for citations	Bounding box for every element
Cloud APIs = privacy concerns	100% local, no data leaves your machine
GPU required	Pure CPU, rule-based — runs anywhere

Key Features

Structured Output — JSON with semantic types (heading, paragraph, table, list, caption)
Bounding Boxes — Every element includes [x1, y1, x2, y2] coordinates for citations
Reading Order — XY-Cut++ algorithm handles multi-column layouts correctly
Noise Filtering — Headers, footers, hidden text, watermarks auto-removed
Deterministic — Same input always produces the same output (no LLM hallucinations)
Fast — 100+ pages per second on CPU

Installation

Available via multiple package managers:

pip install -U opendataloader-pdf

npm install @opendataloader/pdf

Quick Start

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path="document.pdf",
    output_dir="output/",
    format="markdown,json"
)

Use Cases

RAG pipelines — Convert PDFs to chunked markdown for vector databases
Document analysis — Extract tables and structured data from technical PDFs
Citation systems — Use bounding boxes to reference exact locations in source documents
Batch processing — Process large document collections locally without API costs

Links

Claude & Codex Skills

How I configure and use Claude Code and Codex CLI with custom skills, principles, and MCP servers.

Aircraft Wing Design

Fundamentals of aircraft wing design and aerodynamic principles.