Skip to content

dtubb/dejatext

Repository files navigation

DejaText

Identify and manage duplicate and near-duplicate text in files.

DejaText scans directories of .txt or .md files, identifies duplicated and near-duplicate text at the file, paragraph, and sentence level, and either reports findings or annotates files with markers for review.

How It Works

  1. Scan - Collects .txt/.md files (case-insensitive extensions, recursive)
  2. Parse - Separates YAML frontmatter from body content, splits into paragraphs and sentences
  3. Detect - Finds exact duplicates (SHA256 fingerprinting) and near-duplicates (MinHash + LSH)
  4. Act - Either generates reports or annotates files with {dup:...} markers

Commands

report - Detect and report (read-only)

python dejatext.py report ./notes --threshold 0.7

Generates dejatext_report.md and summary_report.csv without modifying source files.

cleanup - Detect and annotate

python dejatext.py cleanup ./notes --output-folder ./notes_cleanup

Copies files to output directory and wraps duplicates with markers:

{dup:exact:100%:file1.md#p3}This is an exact duplicate paragraph.{/dup}
{dup:similar:85%:file1.md#s1}This is a near-duplicate sentence with changes.{/dup}

The text is preserved - you can review before removing.

strip - Remove flagged content

# Remove all flagged content
python dejatext.py strip ./notes_cleanup

# Remove only exact duplicates
python dejatext.py strip ./notes_cleanup --only exact

# Remove only high-confidence matches (>= 90% similar)
python dejatext.py strip ./notes_cleanup --threshold 90

# Preview what would be removed
python dejatext.py strip ./notes_cleanup --dry-run

version - Show version

python dejatext.py version

Workflow

Manual Workflow

1. python dejatext.py report ./notes      # Generate report (optional, read-only)
2. python dejatext.py cleanup ./notes     # Annotates files with markers
3. Review dejatext_report.md              # See what was found
4. python dejatext.py strip ./notes_cleanup  # Remove flagged content

Automated Workflow (Shell Script)

./dejatext_workflow.sh /path/to/folder

Runs the complete workflow automatically:

  1. Generates report → {input}_report/
  2. Creates annotated copy → {input}_cleaned/
  3. Strips duplicates from cleaned copy

Works on both folders and individual files.

Options

Option Description Default
--output-folder Output directory for reports/annotated files dejatext_report or cleanup_output
--threshold Similarity threshold for near-duplicates (0.0-1.0) 0.7
--check-files/--no-check-files Check for file-level duplicates True
--check-sentences/--no-check-sentences Check for sentence-level duplicates True
--check-paragraphs/--no-check-paragraphs Check for paragraph-level duplicates True
--min-sentence-words Minimum words for sentence matching 5
--min-paragraph-words Minimum words for paragraph matching 10
--verbose Show detailed progress False

Strip-specific options

Option Description Default
--only Only strip markers of this type: exact or similar None (all)
--threshold Only strip markers with similarity >= this percentage 0 (all)
--dry-run Show what would be stripped without modifying files False

Near-Duplicate Detection

DejaText uses MinHash + Locality-Sensitive Hashing to find text that's mostly the same but with minor edits:

  • Text is broken into overlapping word trigrams (shingles)
  • MinHash signatures approximate Jaccard similarity
  • LSH finds candidate pairs in sub-linear time
  • Candidates are verified with exact Jaccard similarity

Example: Changing one word in a 25-word paragraph gives ~85% similarity. The default 70% threshold catches paragraphs with several words changed.

Position-Based Marking

The cleanup command uses position-based reconstruction to accurately mark duplicates:

  • Duplicates are tracked by their paragraph/sentence index, not by text search
  • The body is reconstructed from parsed paragraph/sentence lists with markers inserted at specific positions
  • Guarantees keeper accuracy: First occurrence (in natural sort order) remains unmarked regardless of text formatting variations
  • Handles near-duplicates with different formatting (e.g., *Ordinary Affects* vs Ordinary Affects[@citation] vs _Ordinary Affects_)

YAML Frontmatter

DejaText automatically handles YAML frontmatter in markdown files:

  • Frontmatter is separated before duplicate detection (not compared)
  • Frontmatter is preserved when files are modified
  • Uses the python-frontmatter library for robust parsing

Installation

Requirements

  • Python 3.8+
  • macOS (for clipboard scripts)

Setup

git clone https://github.com/dtubb/DejaText.git
cd DejaText
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Make shell scripts executable (macOS/Linux)
chmod +x dejatext_workflow.sh
chmod +x dejatext_clipboard_report.sh
chmod +x dejatext_clipboard_delete.sh

Dependencies

  • typer - CLI framework
  • python-frontmatter - YAML frontmatter parsing
  • pysbd - Sentence boundary detection (handles abbreviations, titles, URLs)
  • datasketch - MinHash + LSH for near-duplicate detection

Shell Script Integration

Complete Workflow Script

./dejatext_workflow.sh /path/to/folder
./dejatext_workflow.sh /path/to/file.txt

Runs the full workflow (report → cleanup → strip):

  • Creates {input}_report/ - duplicate detection report
  • Creates {input}_cleaned/ - deduplicated files
  • Works on folders or individual files

Clipboard Processing (macOS)

Annotate only (review before removing):

./dejatext_clipboard_report.sh

Processes clipboard, adds {dup:...} markers, shows diff in BBEdit for review.

Auto-strip (remove all duplicates):

./dejatext_clipboard_delete.sh

Processes clipboard, strips ALL duplicates (exact + similar), replaces clipboard with cleaned content. Runs silently.


File Support

File Type Processing
.txt, .TXT Full processing
.md, .MD Full processing (YAML preserved)
Other files Copied but not processed

Files are processed in natural sort order (file1, file2, file10).

Testing

source .venv/bin/activate
pytest tests/test_dejatext.py -v

Test Coverage: 81 tests covering:

  • Core detection algorithms (exact + near-duplicate)
  • CLI commands (report, cleanup, strip, version)
  • File parsing and YAML handling
  • Position-based marking (including keeper identification with formatting variations)
  • Full workflow integration tests

Marker Format Reference

{dup:TYPE:SCORE:REFERENCE}text{/dup}
Field Values Meaning
TYPE exact, similar Match type
SCORE 100%, 85%, etc. Similarity percentage
REFERENCE file.md#p3, file.md#s1 Where the kept version lives

About

DejaText is a Python-based command-line tool to scan directories of .txt or .md files, identify duplicated and optionally similar text segments (sentences, paragraphs, phrases, words) across or within files, and produce organized reports for easy review.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors