A streamlined Streamlit application for analyzing PDF documents and classifying sentences using BERT-based models with Magic PDF processing.
✅ BERT Model Support
- BERT, RoBERTa, E5, and BGE model architectures
- Custom model loading with automatic architecture detection
- SSL handling for model downloads
✅ Magic PDF Processing
- High-quality PDF text extraction
- Sentence segmentation with page mapping
- JSON-based content organization
✅ Enhanced UI
- Document viewer with sentence highlighting
- Page navigation
- Sentence numbering and confidence scores
- File browser
pip install -r requirements_refactored.txtFollow the Magic PDF installation guide to set up the PDF processing engine.
python run.py- Load a Model: Upload your trained BERT model file (.pth, .pt, or .bin)
- Upload PDF: Select a PDF file to analyze
- Process: Click "Process PDF" to extract and classify sentences
- Review: Browse useful sentences with highlighting
streamlit>=1.28.0- Web interfacetorch>=2.0.0- Deep learning frameworktransformers>=4.30.0- BERT modelspdf2image>=1.16.0- PDF to image conversionPillow>=9.0.0- Image processingnumpy>=1.21.0- Numerical computingpandas>=1.3.0- Data manipulation- Magic PDF library - PDF text extraction
├── run.py # Main entry point
├── app_fixed.py # Streamlined application
├── step1_magic_pdf_to_json.py # Magic PDF processing
├── step2_json_clean.py # JSON cleaning and sentence extraction
├── requirements_refactored.txt # Dependencies
└── src/ # UI components
- Model Loading: Load pre-trained BERT models for sentence classification
- PDF Processing: Magic PDF extracts text with page and structure information
- Sentence Extraction: Text is cleaned and segmented into sentences
- Classification: BERT classifies each sentence as useful/not useful
- Visualization: Results displayed with highlighting and confidence scores
The application automatically detects model type from filename:
- Files containing "bert" → BERT architecture
- Files containing "roberta" → RoBERTa architecture
- Files containing "e5" → E5 architecture
- Files containing "bge" → BGE architecture
TRANSFORMERS_OFFLINE=0 # Allow online model downloads
HF_HUB_DISABLE_TELEMETRY=1 # Disable telemetry- Maximum upload size: 10GB
- Wide layout mode
- Expanded sidebar
- Ensure model filename contains architecture hint
- Check GPU memory for large models
- Verify model file integrity
- Install Magic PDF dependencies
- Check PDF file validity
- Ensure sufficient system memory
- Use GPU for faster inference
- Adjust confidence threshold for filtering
- Process large documents in sections
This tool is provided for research and educational purposes.

