Skip to content

jay-zc/SM_classification

Repository files navigation

PDF Analysis Tool - Streamlined Version

A streamlined Streamlit application for analyzing PDF documents and classifying sentences using BERT-based models with Magic PDF processing.

Features

BERT Model Support

  • BERT, RoBERTa, E5, and BGE model architectures
  • Custom model loading with automatic architecture detection
  • SSL handling for model downloads

Magic PDF Processing

  • High-quality PDF text extraction
  • Sentence segmentation with page mapping
  • JSON-based content organization

Enhanced UI

  • Document viewer with sentence highlighting
  • Page navigation
  • Sentence numbering and confidence scores
  • File browser

Interface Screenshots

Tool Interface Overview

Tool Interface 1

Analysis Results Display

Tool Interface 2

Quick Start

1. Install Dependencies

pip install -r requirements_refactored.txt

2. Install Magic PDF

Follow the Magic PDF installation guide to set up the PDF processing engine.

3. Run the Application

python run.py

4. Using the Tool

  1. Load a Model: Upload your trained BERT model file (.pth, .pt, or .bin)
  2. Upload PDF: Select a PDF file to analyze
  3. Process: Click "Process PDF" to extract and classify sentences
  4. Review: Browse useful sentences with highlighting

Dependencies

  • streamlit>=1.28.0 - Web interface
  • torch>=2.0.0 - Deep learning framework
  • transformers>=4.30.0 - BERT models
  • pdf2image>=1.16.0 - PDF to image conversion
  • Pillow>=9.0.0 - Image processing
  • numpy>=1.21.0 - Numerical computing
  • pandas>=1.3.0 - Data manipulation
  • Magic PDF library - PDF text extraction

File Structure

├── run.py                    # Main entry point
├── app_fixed.py             # Streamlined application
├── step1_magic_pdf_to_json.py  # Magic PDF processing
├── step2_json_clean.py      # JSON cleaning and sentence extraction
├── requirements_refactored.txt # Dependencies
└── src/                     # UI components

How It Works

  1. Model Loading: Load pre-trained BERT models for sentence classification
  2. PDF Processing: Magic PDF extracts text with page and structure information
  3. Sentence Extraction: Text is cleaned and segmented into sentences
  4. Classification: BERT classifies each sentence as useful/not useful
  5. Visualization: Results displayed with highlighting and confidence scores

Model Support

The application automatically detects model type from filename:

  • Files containing "bert" → BERT architecture
  • Files containing "roberta" → RoBERTa architecture
  • Files containing "e5" → E5 architecture
  • Files containing "bge" → BGE architecture

Configuration

Environment Variables

TRANSFORMERS_OFFLINE=0          # Allow online model downloads
HF_HUB_DISABLE_TELEMETRY=1     # Disable telemetry

Streamlit Settings

  • Maximum upload size: 10GB
  • Wide layout mode
  • Expanded sidebar

Troubleshooting

Model Loading

  • Ensure model filename contains architecture hint
  • Check GPU memory for large models
  • Verify model file integrity

PDF Processing

  • Install Magic PDF dependencies
  • Check PDF file validity
  • Ensure sufficient system memory

Performance

  • Use GPU for faster inference
  • Adjust confidence threshold for filtering
  • Process large documents in sections

License

This tool is provided for research and educational purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages