PDF Cleaner Tool

A powerful Python application for redacting sensitive information (SSNs, bank account numbers) from PDF documents. Handles both text-based and scanned image PDFs using OCR technology.

Features

Automatic SSN Redaction: Fully redacts Social Security Numbers (###-##-####)
Bank Account Masking: Masks bank account numbers showing only last 4 digits
Scanned PDF Support: Uses OCR (Tesseract) for scanned documents
Batch Processing: Recursively process entire folder hierarchies
Structure Preservation: Maintains original folder structure in output
GUI Interface: User-friendly tkinter-based interface
Detailed Logging: Generates processing logs with all actions taken

Prerequisites

System Requirements

Python 3.7 or higher
Windows, macOS, or Linux

External Dependencies

Tesseract OCR: Required for processing scanned PDFs
- Download Tesseract-OCR
- On Windows: Install to C:\Program Files\Tesseract-OCR\tesseract.exe
- On Linux: sudo apt-get install tesseract-ocr
- On macOS: brew install tesseract

Installation

1. Clone/Download the Project

cd pdf_cleaner_tool

2. Create Virtual Environment (Recommended)

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

3. Install Python Dependencies

pip install -r requirements.txt

4. Install Tesseract OCR

Download and install from the link above based on your operating system.

Windows: Verify the installation path matches in pdf_processor.py (default: C:\Program Files\Tesseract-OCR\tesseract.exe)

Usage

GUI Application (Recommended)

Run the graphical interface:

python gui.py

Steps:

Click "Browse" to select the input folder containing PDFs
(Optional) Specify an output folder; defaults to "Cleaned Client Folder"
Click "Start Processing"
Processing log will be saved in the output folder as processing_log.txt

Programmatic Usage

from pdf_processor import process_folder

# Process entire folder recursively
process_folder("./input_folder", "./output_folder")

Project Structure

pdf_cleaner_tool/
├── gui.py                 # GUI application entry point
├── pdf_processor.py       # Core PDF redaction logic
├── file_utils.py          # File utility functions
├── logger.py              # Logging configuration
├── ocr_processor.py       # OCR helper functions
├── requirements.txt       # Python dependencies
├── README.md             # This file
└── test_redaction.py     # Testing utilities

Redaction Rules

Social Security Numbers (SSN)

Pattern: XXX-XX-XXXX
Action: Fully redacted with black box
Example: 123-45-6789 → [REDACTED]

Bank Account Numbers

Pattern: 8-16 consecutive digits
Action: Masked to show only last 4 digits
Example: 1234567890 → XXXXXX7890

Output

Redacted PDFs: Saved in output folder preserving original structure
Non-PDF Files: Copied unchanged to output folder
Processing Log: processing_log.txt containing:
- Start/end times
- Files processed successfully
- Files with errors or warnings

Troubleshooting

"Tesseract is not installed" Error

Ensure Tesseract-OCR is installed and the path in pdf_processor.py is correct:

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

GUI Not Opening

Ensure tkinter is installed (usually included with Python):

# Windows
python -m pip install tk

# macOS/Linux
sudo apt-get install python3-tk  # or brew install python-tk@3.x

"ModuleNotFoundError" for PDF libraries

Reinstall requirements:

pip install --upgrade -r requirements.txt

Performance Notes

Scanned PDFs: OCR processing is slower (5-15 seconds per page depending on resolution)
Text-based PDFs: Fast processing (< 1 second per page)
Batch Processing: Processing time scales with number and size of PDFs
Recommend 4GB+ RAM for processing large batches

Security Considerations

Processed PDFs are saved with compression (deflate)
Original files are never modified; outputs are saved separately
Redactions are permanent and cannot be undone
Keep output folder in secure location

Limitations

Only detects SSNs in standard XXX-XX-XXXX format
Bank account detection is basic (8-16 digit sequences)
Watermarked or heavily encrypted PDFs may not process correctly
OCR accuracy depends on scanned document quality

System Requirements by OS

OS	Python	Tesseract
Windows 10/11	3.7+	5.0+
macOS 10.14+	3.7+	5.0+
Ubuntu 18.04+	3.7+	4.0+

Support & Feedback

For issues or feature requests, review the processing log for detailed error messages. Check that:

All Python dependencies are installed
Tesseract OCR is properly installed
Input PDFs are not corrupted
Output folder has write permissions

License

[Specify your license here]

Version

1.0.0 - January 2026

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
test_files		test_files
.gitignore		.gitignore
INSTALLATION.md		INSTALLATION.md
README.md		README.md
file_utils.py		file_utils.py
gui.py		gui.py
install.bat		install.bat
install.ps1		install.ps1
installer.ps1		installer.ps1
interactive_redactor.py		interactive_redactor.py
logger.py		logger.py
main.py		main.py
ocr_processor.py		ocr_processor.py
pdf_processor.py		pdf_processor.py
requirements.txt		requirements.txt
run.bat		run.bat
test_redaction.py		test_redaction.py
uninstaller.ps1		uninstaller.ps1

Folders and files

Latest commit

History

Repository files navigation

PDF Cleaner Tool

Features

Prerequisites

System Requirements

External Dependencies

Installation

1. Clone/Download the Project

2. Create Virtual Environment (Recommended)

3. Install Python Dependencies

4. Install Tesseract OCR

Usage

GUI Application (Recommended)

Programmatic Usage

Project Structure

Redaction Rules

Social Security Numbers (SSN)

Bank Account Numbers

Output

Troubleshooting

"Tesseract is not installed" Error

GUI Not Opening

"ModuleNotFoundError" for PDF libraries

Performance Notes

Security Considerations

Limitations

System Requirements by OS

Support & Feedback

License

Version

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages