PyMuPDF4LLM is a lightweight extension for PyMuPDF that turns documents into clean, structured data with minimal setup. It includes layout analysis without any GPU requirement.
PyMuPDF4LLM makes it easy to extract document content in the format you need for LLM & RAG environments. It supports structured data extraction to Markdown, JSON and TXT , as well as LlamaIndex and LangChain integration.
- Parsing of multiple document formats.
- Export structured data as Markdown, JSON and plain text output formats.
- Support for multi-column pages.
- Support for image and vector graphics extraction.
- Layout analysis for better semantic understanding of document structure.
- Support for page chunking output.
- Integration with popular AI frameworks.
$ pip install -U pymupdf4llmThis command will automatically install or upgrade PyMuPDF as required.
import pymupdf4llm
# The remainder of the script is unchanged
md_text = pymupdf4llm.to_markdown("input.pdf")
# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_text(md_text)import pymupdf4llm
json_text = pymupdf4llm.to_json("input.pdf")
# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.json").write_text(json_text)import pymupdf4llm
plain_text = pymupdf4llm.to_text("input.pdf")
# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.txt").write_text(plain_text)Check out the PyMuPDF4LLM documentation, for details on installation, features, sample code and the full API.
Find our examples on GitHub.
For your AI application development, check out our integrations with popular frameworks.
You can get support for PyMuPDF4LLM via a number of options: