Skip to content

pymupdf/pymupdf4llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

229 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyMuPDF logo

PyMuPDF4LLM

Docs License MIT PyPI Downloads Discord

PyMuPDF4LLM is a lightweight extension for PyMuPDF that turns documents into clean, structured data with minimal setup. It includes layout analysis without any GPU requirement.

PyMuPDF4LLM makes it easy to extract document content in the format you need for LLM & RAG environments. It supports structured data extraction to Markdown, JSON and TXT , as well as LlamaIndex and LangChain integration.

Features

  • Parsing of multiple document formats.
  • Export structured data as Markdown, JSON and plain text output formats.
  • Support for multi-column pages.
  • Support for image and vector graphics extraction.
  • Layout analysis for better semantic understanding of document structure.
  • Support for page chunking output.
  • Integration with popular AI frameworks.

Installation

$ pip install -U pymupdf4llm

This command will automatically install or upgrade PyMuPDF as required.

Execution

Markdown

import pymupdf4llm

# The remainder of the script is unchanged
md_text = pymupdf4llm.to_markdown("input.pdf")

# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_text(md_text)

JSON

import pymupdf4llm

json_text = pymupdf4llm.to_json("input.pdf")

# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.json").write_text(json_text)

Plain Text

import pymupdf4llm

plain_text = pymupdf4llm.to_text("input.pdf")

# now work with the output data, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.txt").write_text(plain_text)

Documentation

Check out the PyMuPDF4LLM documentation, for details on installation, features, sample code and the full API.

Examples

Find our examples on GitHub.

Integrations

For your AI application development, check out our integrations with popular frameworks.

Support

You can get support for PyMuPDF4LLM via a number of options:

Packages

 
 
 

Contributors

Languages