Skip to content

mabdullah-devx/Offline-pdf-redaction-tool

Repository files navigation

PDF Cleaner Tool

A powerful Python application for redacting sensitive information (SSNs, bank account numbers) from PDF documents. Handles both text-based and scanned image PDFs using OCR technology.

Features

  • Automatic SSN Redaction: Fully redacts Social Security Numbers (###-##-####)
  • Bank Account Masking: Masks bank account numbers showing only last 4 digits
  • Scanned PDF Support: Uses OCR (Tesseract) for scanned documents
  • Batch Processing: Recursively process entire folder hierarchies
  • Structure Preservation: Maintains original folder structure in output
  • GUI Interface: User-friendly tkinter-based interface
  • Detailed Logging: Generates processing logs with all actions taken

Prerequisites

System Requirements

  • Python 3.7 or higher
  • Windows, macOS, or Linux

External Dependencies

  • Tesseract OCR: Required for processing scanned PDFs
    • Download Tesseract-OCR
    • On Windows: Install to C:\Program Files\Tesseract-OCR\tesseract.exe
    • On Linux: sudo apt-get install tesseract-ocr
    • On macOS: brew install tesseract

Installation

1. Clone/Download the Project

cd pdf_cleaner_tool

2. Create Virtual Environment (Recommended)

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

3. Install Python Dependencies

pip install -r requirements.txt

4. Install Tesseract OCR

Download and install from the link above based on your operating system.

Windows: Verify the installation path matches in pdf_processor.py (default: C:\Program Files\Tesseract-OCR\tesseract.exe)

Usage

GUI Application (Recommended)

Run the graphical interface:

python gui.py

Steps:

  1. Click "Browse" to select the input folder containing PDFs
  2. (Optional) Specify an output folder; defaults to "Cleaned Client Folder"
  3. Click "Start Processing"
  4. Processing log will be saved in the output folder as processing_log.txt

Programmatic Usage

from pdf_processor import process_folder

# Process entire folder recursively
process_folder("./input_folder", "./output_folder")

Project Structure

pdf_cleaner_tool/
├── gui.py                 # GUI application entry point
├── pdf_processor.py       # Core PDF redaction logic
├── file_utils.py          # File utility functions
├── logger.py              # Logging configuration
├── ocr_processor.py       # OCR helper functions
├── requirements.txt       # Python dependencies
├── README.md             # This file
└── test_redaction.py     # Testing utilities

Redaction Rules

Social Security Numbers (SSN)

  • Pattern: XXX-XX-XXXX
  • Action: Fully redacted with black box
  • Example: 123-45-6789 → [REDACTED]

Bank Account Numbers

  • Pattern: 8-16 consecutive digits
  • Action: Masked to show only last 4 digits
  • Example: 1234567890 → XXXXXX7890

Output

  • Redacted PDFs: Saved in output folder preserving original structure
  • Non-PDF Files: Copied unchanged to output folder
  • Processing Log: processing_log.txt containing:
    • Start/end times
    • Files processed successfully
    • Files with errors or warnings

Troubleshooting

"Tesseract is not installed" Error

Ensure Tesseract-OCR is installed and the path in pdf_processor.py is correct:

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

GUI Not Opening

Ensure tkinter is installed (usually included with Python):

# Windows
python -m pip install tk

# macOS/Linux
sudo apt-get install python3-tk  # or brew install python-tk@3.x

"ModuleNotFoundError" for PDF libraries

Reinstall requirements:

pip install --upgrade -r requirements.txt

Performance Notes

  • Scanned PDFs: OCR processing is slower (5-15 seconds per page depending on resolution)
  • Text-based PDFs: Fast processing (< 1 second per page)
  • Batch Processing: Processing time scales with number and size of PDFs
  • Recommend 4GB+ RAM for processing large batches

Security Considerations

  • Processed PDFs are saved with compression (deflate)
  • Original files are never modified; outputs are saved separately
  • Redactions are permanent and cannot be undone
  • Keep output folder in secure location

Limitations

  • Only detects SSNs in standard XXX-XX-XXXX format
  • Bank account detection is basic (8-16 digit sequences)
  • Watermarked or heavily encrypted PDFs may not process correctly
  • OCR accuracy depends on scanned document quality

System Requirements by OS

OS Python Tesseract
Windows 10/11 3.7+ 5.0+
macOS 10.14+ 3.7+ 5.0+
Ubuntu 18.04+ 3.7+ 4.0+

Support & Feedback

For issues or feature requests, review the processing log for detailed error messages. Check that:

  1. All Python dependencies are installed
  2. Tesseract OCR is properly installed
  3. Input PDFs are not corrupted
  4. Output folder has write permissions

License

[Specify your license here]

Version

1.0.0 - January 2026

About

Offline desktop PDF redaction tool with GUI and OCR. Automatically detects and redacts SSNs and masks bank account numbers while preserving folder structure.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors