Identify and manage duplicate and near-duplicate text in files.
DejaText scans directories of .txt or .md files, identifies duplicated and near-duplicate text at the file, paragraph, and sentence level, and either reports findings or annotates files with markers for review.
- Scan - Collects
.txt/.mdfiles (case-insensitive extensions, recursive) - Parse - Separates YAML frontmatter from body content, splits into paragraphs and sentences
- Detect - Finds exact duplicates (SHA256 fingerprinting) and near-duplicates (MinHash + LSH)
- Act - Either generates reports or annotates files with
{dup:...}markers
python dejatext.py report ./notes --threshold 0.7Generates dejatext_report.md and summary_report.csv without modifying source files.
python dejatext.py cleanup ./notes --output-folder ./notes_cleanupCopies files to output directory and wraps duplicates with markers:
{dup:exact:100%:file1.md#p3}This is an exact duplicate paragraph.{/dup}
{dup:similar:85%:file1.md#s1}This is a near-duplicate sentence with changes.{/dup}
The text is preserved - you can review before removing.
# Remove all flagged content
python dejatext.py strip ./notes_cleanup
# Remove only exact duplicates
python dejatext.py strip ./notes_cleanup --only exact
# Remove only high-confidence matches (>= 90% similar)
python dejatext.py strip ./notes_cleanup --threshold 90
# Preview what would be removed
python dejatext.py strip ./notes_cleanup --dry-runpython dejatext.py version1. python dejatext.py report ./notes # Generate report (optional, read-only)
2. python dejatext.py cleanup ./notes # Annotates files with markers
3. Review dejatext_report.md # See what was found
4. python dejatext.py strip ./notes_cleanup # Remove flagged content./dejatext_workflow.sh /path/to/folderRuns the complete workflow automatically:
- Generates report →
{input}_report/ - Creates annotated copy →
{input}_cleaned/ - Strips duplicates from cleaned copy
Works on both folders and individual files.
| Option | Description | Default |
|---|---|---|
--output-folder |
Output directory for reports/annotated files | dejatext_report or cleanup_output |
--threshold |
Similarity threshold for near-duplicates (0.0-1.0) | 0.7 |
--check-files/--no-check-files |
Check for file-level duplicates | True |
--check-sentences/--no-check-sentences |
Check for sentence-level duplicates | True |
--check-paragraphs/--no-check-paragraphs |
Check for paragraph-level duplicates | True |
--min-sentence-words |
Minimum words for sentence matching | 5 |
--min-paragraph-words |
Minimum words for paragraph matching | 10 |
--verbose |
Show detailed progress | False |
| Option | Description | Default |
|---|---|---|
--only |
Only strip markers of this type: exact or similar |
None (all) |
--threshold |
Only strip markers with similarity >= this percentage | 0 (all) |
--dry-run |
Show what would be stripped without modifying files | False |
DejaText uses MinHash + Locality-Sensitive Hashing to find text that's mostly the same but with minor edits:
- Text is broken into overlapping word trigrams (shingles)
- MinHash signatures approximate Jaccard similarity
- LSH finds candidate pairs in sub-linear time
- Candidates are verified with exact Jaccard similarity
Example: Changing one word in a 25-word paragraph gives ~85% similarity. The default 70% threshold catches paragraphs with several words changed.
The cleanup command uses position-based reconstruction to accurately mark duplicates:
- Duplicates are tracked by their paragraph/sentence index, not by text search
- The body is reconstructed from parsed paragraph/sentence lists with markers inserted at specific positions
- Guarantees keeper accuracy: First occurrence (in natural sort order) remains unmarked regardless of text formatting variations
- Handles near-duplicates with different formatting (e.g.,
*Ordinary Affects*vsOrdinary Affects[@citation]vs_Ordinary Affects_)
DejaText automatically handles YAML frontmatter in markdown files:
- Frontmatter is separated before duplicate detection (not compared)
- Frontmatter is preserved when files are modified
- Uses the
python-frontmatterlibrary for robust parsing
- Python 3.8+
- macOS (for clipboard scripts)
git clone https://github.com/dtubb/DejaText.git
cd DejaText
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
# Make shell scripts executable (macOS/Linux)
chmod +x dejatext_workflow.sh
chmod +x dejatext_clipboard_report.sh
chmod +x dejatext_clipboard_delete.shtyper- CLI frameworkpython-frontmatter- YAML frontmatter parsingpysbd- Sentence boundary detection (handles abbreviations, titles, URLs)datasketch- MinHash + LSH for near-duplicate detection
./dejatext_workflow.sh /path/to/folder
./dejatext_workflow.sh /path/to/file.txtRuns the full workflow (report → cleanup → strip):
- Creates
{input}_report/- duplicate detection report - Creates
{input}_cleaned/- deduplicated files - Works on folders or individual files
Annotate only (review before removing):
./dejatext_clipboard_report.shProcesses clipboard, adds {dup:...} markers, shows diff in BBEdit for review.
Auto-strip (remove all duplicates):
./dejatext_clipboard_delete.shProcesses clipboard, strips ALL duplicates (exact + similar), replaces clipboard with cleaned content. Runs silently.
| File Type | Processing |
|---|---|
.txt, .TXT |
Full processing |
.md, .MD |
Full processing (YAML preserved) |
| Other files | Copied but not processed |
Files are processed in natural sort order (file1, file2, file10).
source .venv/bin/activate
pytest tests/test_dejatext.py -vTest Coverage: 81 tests covering:
- Core detection algorithms (exact + near-duplicate)
- CLI commands (report, cleanup, strip, version)
- File parsing and YAML handling
- Position-based marking (including keeper identification with formatting variations)
- Full workflow integration tests
{dup:TYPE:SCORE:REFERENCE}text{/dup}
| Field | Values | Meaning |
|---|---|---|
| TYPE | exact, similar |
Match type |
| SCORE | 100%, 85%, etc. |
Similarity percentage |
| REFERENCE | file.md#p3, file.md#s1 |
Where the kept version lives |