📄 PDF Extractor: Text, Tables, Images (Multi-Page)

This project extracts text, tables (as CSV), and images (as JPG) from PDF files using Python and OpenDataLab's pdf-extraction-kit. The extracted outputs are organized by PDF name, page number, and file type.

🚀 Features

✅ Extracts text from each page as .txt
✅ Extracts tables as .csv, maintaining structure
✅ Extracts images from each page as .jpg
✅ Organizes output by PDF file name and page
✅ Automatically processes all PDFs in the input/ folder

📁 Project Structure

Pdfextract2/
├── input/                   # Put your PDFs here
├── output/                  # Generated text, images, tables (auto-created)
├── pdf-extraction-kit/      # Contains the OpenDataLab pipeline logic
├── main.py                  # Main script to run the extraction
├── README.md                # This file
├── .gitignore               # Ignore output/ and temp files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 PDF Extractor: Text, Tables, Images (Multi-Page)

🚀 Features

📁 Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
pdf-extraction-kit		pdf-extraction-kit
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📄 PDF Extractor: Text, Tables, Images (Multi-Page)

🚀 Features

📁 Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages