PDF Analysis Tool - Streamlined Version

A streamlined Streamlit application for analyzing PDF documents and classifying sentences using BERT-based models with Magic PDF processing.

Features

✅ BERT Model Support

BERT, RoBERTa, E5, and BGE model architectures
Custom model loading with automatic architecture detection
SSL handling for model downloads

✅ Magic PDF Processing

High-quality PDF text extraction
Sentence segmentation with page mapping
JSON-based content organization

✅ Enhanced UI

Document viewer with sentence highlighting
Page navigation
Sentence numbering and confidence scores
File browser

Interface Screenshots

Tool Interface Overview

Analysis Results Display

Quick Start

1. Install Dependencies

pip install -r requirements_refactored.txt

2. Install Magic PDF

Follow the Magic PDF installation guide to set up the PDF processing engine.

3. Run the Application

python run.py

4. Using the Tool

Load a Model: Upload your trained BERT model file (.pth, .pt, or .bin)
Upload PDF: Select a PDF file to analyze
Process: Click "Process PDF" to extract and classify sentences
Review: Browse useful sentences with highlighting

Dependencies

streamlit>=1.28.0 - Web interface
torch>=2.0.0 - Deep learning framework
transformers>=4.30.0 - BERT models
pdf2image>=1.16.0 - PDF to image conversion
Pillow>=9.0.0 - Image processing
numpy>=1.21.0 - Numerical computing
pandas>=1.3.0 - Data manipulation
Magic PDF library - PDF text extraction

File Structure

├── run.py                    # Main entry point
├── app_fixed.py             # Streamlined application
├── step1_magic_pdf_to_json.py  # Magic PDF processing
├── step2_json_clean.py      # JSON cleaning and sentence extraction
├── requirements_refactored.txt # Dependencies
└── src/                     # UI components

How It Works

Model Loading: Load pre-trained BERT models for sentence classification
PDF Processing: Magic PDF extracts text with page and structure information
Sentence Extraction: Text is cleaned and segmented into sentences
Classification: BERT classifies each sentence as useful/not useful
Visualization: Results displayed with highlighting and confidence scores

Model Support

The application automatically detects model type from filename:

Files containing "bert" → BERT architecture
Files containing "roberta" → RoBERTa architecture
Files containing "e5" → E5 architecture
Files containing "bge" → BGE architecture

Configuration

Environment Variables

TRANSFORMERS_OFFLINE=0          # Allow online model downloads
HF_HUB_DISABLE_TELEMETRY=1     # Disable telemetry

Streamlit Settings

Maximum upload size: 10GB
Wide layout mode
Expanded sidebar

Troubleshooting

Model Loading

Ensure model filename contains architecture hint
Check GPU memory for large models
Verify model file integrity

PDF Processing

Install Magic PDF dependencies
Check PDF file validity
Ensure sufficient system memory

Performance

Use GPU for faster inference
Adjust confidence threshold for filtering
Process large documents in sections

License

This tool is provided for research and educational purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.streamlit		.streamlit
checkpoints		checkpoints
src/models		src/models
.gitignore		.gitignore
README.md		README.md
README_FIXED.md		README_FIXED.md
app_fixed.py		app_fixed.py
mineru.py		mineru.py
requirements_refactored.txt		requirements_refactored.txt
run.py		run.py
tool1.png		tool1.png
tool2.png		tool2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Analysis Tool - Streamlined Version

Features

Interface Screenshots

Tool Interface Overview

Analysis Results Display

Quick Start

1. Install Dependencies

2. Install Magic PDF

3. Run the Application

4. Using the Tool

Dependencies

File Structure

How It Works

Model Support

Configuration

Environment Variables

Streamlit Settings

Troubleshooting

Model Loading

PDF Processing

Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Analysis Tool - Streamlined Version

Features

Interface Screenshots

Tool Interface Overview

Analysis Results Display

Quick Start

1. Install Dependencies

2. Install Magic PDF

3. Run the Application

4. Using the Tool

Dependencies

File Structure

How It Works

Model Support

Configuration

Environment Variables

Streamlit Settings

Troubleshooting

Model Loading

PDF Processing

Performance

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages