A powerful Python application for redacting sensitive information (SSNs, bank account numbers) from PDF documents. Handles both text-based and scanned image PDFs using OCR technology.
- Automatic SSN Redaction: Fully redacts Social Security Numbers (###-##-####)
- Bank Account Masking: Masks bank account numbers showing only last 4 digits
- Scanned PDF Support: Uses OCR (Tesseract) for scanned documents
- Batch Processing: Recursively process entire folder hierarchies
- Structure Preservation: Maintains original folder structure in output
- GUI Interface: User-friendly tkinter-based interface
- Detailed Logging: Generates processing logs with all actions taken
- Python 3.7 or higher
- Windows, macOS, or Linux
- Tesseract OCR: Required for processing scanned PDFs
- Download Tesseract-OCR
- On Windows: Install to
C:\Program Files\Tesseract-OCR\tesseract.exe - On Linux:
sudo apt-get install tesseract-ocr - On macOS:
brew install tesseract
cd pdf_cleaner_tool# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtDownload and install from the link above based on your operating system.
Windows: Verify the installation path matches in pdf_processor.py (default: C:\Program Files\Tesseract-OCR\tesseract.exe)
Run the graphical interface:
python gui.pySteps:
- Click "Browse" to select the input folder containing PDFs
- (Optional) Specify an output folder; defaults to "Cleaned Client Folder"
- Click "Start Processing"
- Processing log will be saved in the output folder as
processing_log.txt
from pdf_processor import process_folder
# Process entire folder recursively
process_folder("./input_folder", "./output_folder")pdf_cleaner_tool/
├── gui.py # GUI application entry point
├── pdf_processor.py # Core PDF redaction logic
├── file_utils.py # File utility functions
├── logger.py # Logging configuration
├── ocr_processor.py # OCR helper functions
├── requirements.txt # Python dependencies
├── README.md # This file
└── test_redaction.py # Testing utilities
- Pattern: XXX-XX-XXXX
- Action: Fully redacted with black box
- Example: 123-45-6789 → [REDACTED]
- Pattern: 8-16 consecutive digits
- Action: Masked to show only last 4 digits
- Example: 1234567890 → XXXXXX7890
- Redacted PDFs: Saved in output folder preserving original structure
- Non-PDF Files: Copied unchanged to output folder
- Processing Log:
processing_log.txtcontaining:- Start/end times
- Files processed successfully
- Files with errors or warnings
Ensure Tesseract-OCR is installed and the path in pdf_processor.py is correct:
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"Ensure tkinter is installed (usually included with Python):
# Windows
python -m pip install tk
# macOS/Linux
sudo apt-get install python3-tk # or brew install python-tk@3.xReinstall requirements:
pip install --upgrade -r requirements.txt- Scanned PDFs: OCR processing is slower (5-15 seconds per page depending on resolution)
- Text-based PDFs: Fast processing (< 1 second per page)
- Batch Processing: Processing time scales with number and size of PDFs
- Recommend 4GB+ RAM for processing large batches
- Processed PDFs are saved with compression (deflate)
- Original files are never modified; outputs are saved separately
- Redactions are permanent and cannot be undone
- Keep output folder in secure location
- Only detects SSNs in standard XXX-XX-XXXX format
- Bank account detection is basic (8-16 digit sequences)
- Watermarked or heavily encrypted PDFs may not process correctly
- OCR accuracy depends on scanned document quality
| OS | Python | Tesseract |
|---|---|---|
| Windows 10/11 | 3.7+ | 5.0+ |
| macOS 10.14+ | 3.7+ | 5.0+ |
| Ubuntu 18.04+ | 3.7+ | 4.0+ |
For issues or feature requests, review the processing log for detailed error messages. Check that:
- All Python dependencies are installed
- Tesseract OCR is properly installed
- Input PDFs are not corrupted
- Output folder has write permissions
[Specify your license here]
1.0.0 - January 2026