PyMuPDF4LLM

PyMuPDF4LLM is a lightweight extension for PyMuPDF that turns documents into clean, structured data with minimal setup. It includes layout analysis without any GPU requirement.

PyMuPDF4LLM makes it easy to extract document content in the format you need for LLM & RAG environments. It supports structured data extraction to Markdown, JSON and TXT , as well as LlamaIndex and LangChain integration.

Features

Parsing of multiple document formats.
Export structured data as Markdown, JSON and plain text output formats.
Support for multi-column pages.
Support for image and vector graphics extraction.
Layout analysis for better semantic understanding of document structure.
Support for page chunking output.
Integration with popular AI frameworks.

Installation

$ pip install -U pymupdf4llm

This command will automatically install or upgrade PyMuPDF as required.

Execution

Markdown

import pymupdf4llm

# The remainder of the script is unchanged
md_text = pymupdf4llm.to_markdown("input.pdf")

# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_text(md_text)

JSON

import pymupdf4llm

json_text = pymupdf4llm.to_json("input.pdf")

# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.json").write_text(json_text)

Plain Text

import pymupdf4llm

plain_text = pymupdf4llm.to_text("input.pdf")

# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.txt").write_text(plain_text)

Documentation

Check out the PyMuPDF4LLM documentation, for details on installation, features, sample code and the full API.

Examples

Find our examples on GitHub.

Integrations

For your AI application development, check out our integrations with popular frameworks.

Support

You can get support for PyMuPDF4LLM via a number of options:

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
examples		examples
pdf4llm		pdf4llm
src		src
tests		tests
.gitignore		.gitignore
CHANGES.md		CHANGES.md
LICENSE		LICENSE
README.md		README.md
pipcl.py		pipcl.py
pyproject.toml		pyproject.toml
setup.py		setup.py
wdev.py		wdev.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyMuPDF4LLM

Features

Installation

Execution

Markdown

JSON

Plain Text

Documentation

Examples

Integrations

Support

About

Uh oh!

Releases 34

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PyMuPDF4LLM

Features

Installation

Execution

Markdown

JSON

Plain Text

Documentation

Examples

Integrations

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 34

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages