paperdown converts research papers from PDF to Markdown using Z.AI's GLM-OCR model and downloads referenced figure assets locally.
If you work with academic papers, you know that the OCR process itself is not the most difficult part. The real challenge is cleaning up the output. Tables can disappear, their structure can become jumbled, and formulas might be converted into meaningless text. This often means you spend more time correcting the output than working with it.
I used to rely on marker for PDF parsing and thought it was great. However, after converting the Batista et al. (2022) article one day, I discovered that Table 4 was missing, regardless of the settings or LLMs I used (via the --use-llm flag). I then switched to docling, and Table 4 reappeared, but all the formulas were gone. Furthermore, both tools require a GPU, and even on a Google Colab T4 instance, processing one article takes 4 to 5 minutes.
Therefore, this project was created because, while docling and marker are both good tools, they can sometimes miss tables or mix up table structures in ways that require manual correction. I wanted a simple, reliable process that produces a Markdown index file I can trust, local figures/ and optional tables/ folders, and the ability to process my entire library quickly on my laptop.
- Async OCR requests and batch PDF processing using the Z.AI API.
- Concurrent figure downloads for each PDF.
- Fast processing with separate controls for total pipeline concurrency and OCR API concurrency.
Note
This tool was designed to be used with academic papers written in English. Parsing other PDFs, heavy in tables or figures, or in other languages rather than English has not been tested.
Start by running:
paperdown --input path/to/paper.pdfMy preferred method is batch directory processing:
paperdown --input pdf/ --output md/ --workers 32 --ocr-workers 2 --overwrite--workers controls how many PDFs are processed concurrently in batch mode. --ocr-workers controls concurrent OCR API calls. Effective OCR concurrency is min(--workers, --ocr-workers).
Without --overwrite, an existing <output>/<pdf_stem>/log.jsonl marker skips the PDF. If the log marker is missing, paperdown treats the PDF as unprocessed and refreshes managed artifacts (index.md, figures/, and tables/ when --normalize-tables is enabled). With --overwrite, paperdown replaces the whole <output>/<pdf_stem>/ folder before processing.
Install from crates.io:
cargo install paperdownInstall from source (this repository):
cargo install --git https://github.com/atsyplenkov/paperdown.git$ paperdown --help
paperdown converts one PDF or a directory of PDFs into markdown output folders.
For each PDF, it creates:
- <output>/<pdf_stem>/index.md
- <output>/<pdf_stem>/figures/
- <output>/<pdf_stem>/tables/ (when `--normalize-tables` is enabled)
- <output>/<pdf_stem>/log.jsonl
API key lookup order:
1) ZAI_API_KEY from --env-file
2) ZAI_API_KEY from environment
Usage: paperdown [OPTIONS] --input <PATH>
Options:
--input <PATH> Input path: a single .pdf file or a directory containing .pdf files.
--output <OUTPUT> Output root directory for generated markdown folders. [default: md]
--env-file <ENV_FILE> Path to .env file checked first for ZAI_API_KEY, before environment fallback. [default: .env]
--timeout <TIMEOUT> HTTP timeout in seconds for OCR requests and figure downloads. [default: 180]
--max-download-bytes <MAX_DOWNLOAD_BYTES> Maximum allowed size (bytes) for each downloaded figure file. [default: 20971520]
--workers <WORKERS> Maximum number of PDFs processed concurrently in batch mode. [default: 32]
--ocr-workers <OCR_WORKERS> Maximum number of concurrent OCR API calls in batch mode; effective OCR concurrency is min(--workers, --ocr-workers). [default: 2]
-v, --verbose Enable verbose progress messages on stderr.
--overwrite Replace the whole <output>/<pdf_stem>/ folder before processing.
--normalize-tables Normalize OCR HTML tables into Markdown and store raw HTML under tables/.
-h, --help Print help (see a summary with '-h')
-V, --version Print version
paperdown first looks for ZAI_API_KEY in the --env-file. If it is not found, it then checks the environment variables. To obtain a key, create an account in the Z.AI console and generate an API key from your account settings.
The easiest method is to set ZAI_API_KEY in your shell environment.
export ZAI_API_KEY="your-api-key"
paperdown --input path/to/paper.pdfIf you prefer to use a file, create a .env file in the project's root directory.
ZAI_API_KEY=your-api-keyThen, run paperdown as usual, or specify a different file using --env-file.
Another example of a table that was parsed incorrectly from a paper is shown below. The paper by Van Rompaey et al. (2005) was converted to markdown by marker incorrectly after about 4 minutes of runtime on T4 GPUs in Google Colab. Using LLM postprocessing in marker (with the --use-llm flag and the GEMINI model), the model parsed the table correctly. However, the compute time increased to about 5 minutes and the GEMINI API call cost around $0.02. The paperdown tool parsed the table correctly, returned the files after 24 seconds, and used 46945 tokens, costing approximately $0.0014.
The tool records token usage in log.jsonl under the usage field. With pricing at $0.03 per 1,000,000 tokens (both input and output), processing an average-sized scientific paper like Batista et al., 2022 with total_tokens = 79,080 costs approximately $0.0023724. This is about 0.24 cents per article.
docling— my preference if you do not need tables, figures, or formulas.marker— good for extracting formulas with LLM post-processing.opendataloader-pdf— I have not tried this yet, but its benchmarks are very good.
MIT
