Skip to content

DebdeepGhosh2511/Pdfextract_kit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 PDF Extractor: Text, Tables, Images (Multi-Page)

This project extracts text, tables (as CSV), and images (as JPG) from PDF files using Python and OpenDataLab's pdf-extraction-kit. The extracted outputs are organized by PDF name, page number, and file type.


🚀 Features

  • ✅ Extracts text from each page as .txt
  • ✅ Extracts tables as .csv, maintaining structure
  • ✅ Extracts images from each page as .jpg
  • ✅ Organizes output by PDF file name and page
  • ✅ Automatically processes all PDFs in the input/ folder

📁 Project Structure

Pdfextract2/
├── input/                   # Put your PDFs here
├── output/                  # Generated text, images, tables (auto-created)
├── pdf-extraction-kit/      # Contains the OpenDataLab pipeline logic
├── main.py                  # Main script to run the extraction
├── README.md                # This file
├── .gitignore               # Ignore output/ and temp files

About

Extracts text, tables, and images from PDFs using OpenDataLab’s pdf-extraction-kit. Outputs are saved as .txt, .csv, and .jpg files, organized by page number.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages