DejaText

Identify and manage duplicate and near-duplicate text in files.

DejaText scans directories of .txt or .md files, identifies duplicated and near-duplicate text at the file, paragraph, and sentence level, and either reports findings or annotates files with markers for review.

How It Works

Scan - Collects .txt/.md files (case-insensitive extensions, recursive)
Parse - Separates YAML frontmatter from body content, splits into paragraphs and sentences
Detect - Finds exact duplicates (SHA256 fingerprinting) and near-duplicates (MinHash + LSH)
Act - Either generates reports or annotates files with {dup:...} markers

Commands

`report` - Detect and report (read-only)

python dejatext.py report ./notes --threshold 0.7

Generates dejatext_report.md and summary_report.csv without modifying source files.

`cleanup` - Detect and annotate

python dejatext.py cleanup ./notes --output-folder ./notes_cleanup

Copies files to output directory and wraps duplicates with markers:

{dup:exact:100%:file1.md#p3}This is an exact duplicate paragraph.{/dup}
{dup:similar:85%:file1.md#s1}This is a near-duplicate sentence with changes.{/dup}

The text is preserved - you can review before removing.

`strip` - Remove flagged content

# Remove all flagged content
python dejatext.py strip ./notes_cleanup

# Remove only exact duplicates
python dejatext.py strip ./notes_cleanup --only exact

# Remove only high-confidence matches (>= 90% similar)
python dejatext.py strip ./notes_cleanup --threshold 90

# Preview what would be removed
python dejatext.py strip ./notes_cleanup --dry-run

`version` - Show version

python dejatext.py version

Workflow

Manual Workflow

1. python dejatext.py report ./notes      # Generate report (optional, read-only)
2. python dejatext.py cleanup ./notes     # Annotates files with markers
3. Review dejatext_report.md              # See what was found
4. python dejatext.py strip ./notes_cleanup  # Remove flagged content

Automated Workflow (Shell Script)

./dejatext_workflow.sh /path/to/folder

Runs the complete workflow automatically:

Generates report → {input}_report/
Creates annotated copy → {input}_cleaned/
Strips duplicates from cleaned copy

Works on both folders and individual files.

Options

Option	Description	Default
`--output-folder`	Output directory for reports/annotated files	`dejatext_report` or `cleanup_output`
`--threshold`	Similarity threshold for near-duplicates (0.0-1.0)	0.7
`--check-files/--no-check-files`	Check for file-level duplicates	True
`--check-sentences/--no-check-sentences`	Check for sentence-level duplicates	True
`--check-paragraphs/--no-check-paragraphs`	Check for paragraph-level duplicates	True
`--min-sentence-words`	Minimum words for sentence matching	5
`--min-paragraph-words`	Minimum words for paragraph matching	10
`--verbose`	Show detailed progress	False

Strip-specific options

Option	Description	Default
`--only`	Only strip markers of this type: `exact` or `similar`	None (all)
`--threshold`	Only strip markers with similarity >= this percentage	0 (all)
`--dry-run`	Show what would be stripped without modifying files	False

Near-Duplicate Detection

DejaText uses MinHash + Locality-Sensitive Hashing to find text that's mostly the same but with minor edits:

Text is broken into overlapping word trigrams (shingles)
MinHash signatures approximate Jaccard similarity
LSH finds candidate pairs in sub-linear time
Candidates are verified with exact Jaccard similarity

Example: Changing one word in a 25-word paragraph gives ~85% similarity. The default 70% threshold catches paragraphs with several words changed.

Position-Based Marking

The cleanup command uses position-based reconstruction to accurately mark duplicates:

Duplicates are tracked by their paragraph/sentence index, not by text search
The body is reconstructed from parsed paragraph/sentence lists with markers inserted at specific positions
Guarantees keeper accuracy: First occurrence (in natural sort order) remains unmarked regardless of text formatting variations
Handles near-duplicates with different formatting (e.g., *Ordinary Affects* vs Ordinary Affects[@citation] vs _Ordinary Affects_)

YAML Frontmatter

DejaText automatically handles YAML frontmatter in markdown files:

Frontmatter is separated before duplicate detection (not compared)
Frontmatter is preserved when files are modified
Uses the python-frontmatter library for robust parsing

Installation

Requirements

Python 3.8+
macOS (for clipboard scripts)

Setup

git clone https://github.com/dtubb/DejaText.git
cd DejaText
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Make shell scripts executable (macOS/Linux)
chmod +x dejatext_workflow.sh
chmod +x dejatext_clipboard_report.sh
chmod +x dejatext_clipboard_delete.sh

Dependencies

typer - CLI framework
python-frontmatter - YAML frontmatter parsing
pysbd - Sentence boundary detection (handles abbreviations, titles, URLs)
datasketch - MinHash + LSH for near-duplicate detection

Shell Script Integration

Complete Workflow Script

./dejatext_workflow.sh /path/to/folder
./dejatext_workflow.sh /path/to/file.txt

Runs the full workflow (report → cleanup → strip):

Creates {input}_report/ - duplicate detection report
Creates {input}_cleaned/ - deduplicated files
Works on folders or individual files

Clipboard Processing (macOS)

Annotate only (review before removing):

./dejatext_clipboard_report.sh

Processes clipboard, adds {dup:...} markers, shows diff in BBEdit for review.

Auto-strip (remove all duplicates):

./dejatext_clipboard_delete.sh

Processes clipboard, strips ALL duplicates (exact + similar), replaces clipboard with cleaned content. Runs silently.

File Support

File Type	Processing
`.txt`, `.TXT`	Full processing
`.md`, `.MD`	Full processing (YAML preserved)
Other files	Copied but not processed

Files are processed in natural sort order (file1, file2, file10).

Testing

source .venv/bin/activate
pytest tests/test_dejatext.py -v

Test Coverage: 81 tests covering:

Core detection algorithms (exact + near-duplicate)
CLI commands (report, cleanup, strip, version)
File parsing and YAML handling
Position-based marking (including keeper identification with formatting variations)
Full workflow integration tests

Marker Format Reference

{dup:TYPE:SCORE:REFERENCE}text{/dup}

Field	Values	Meaning
TYPE	`exact`, `similar`	Match type
SCORE	`100%`, `85%`, etc.	Similarity percentage
REFERENCE	`file.md#p3`, `file.md#s1`	Where the kept version lives

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
dejatext_clean.workflow/Contents		dejatext_clean.workflow/Contents
dejatext_report		dejatext_report
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dejatext Clean.kmmacros		Dejatext Clean.kmmacros
README.md		README.md
dejatext.py		dejatext.py
dejatext_clipboard_delete.sh		dejatext_clipboard_delete.sh
dejatext_clipboard_report.sh		dejatext_clipboard_report.sh
dejatext_core.py		dejatext_core.py
dejatext_workflow.sh		dejatext_workflow.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DejaText

How It Works

Commands

`report` - Detect and report (read-only)

`cleanup` - Detect and annotate

`strip` - Remove flagged content

`version` - Show version

Workflow

Manual Workflow

Automated Workflow (Shell Script)

Options

Strip-specific options

Near-Duplicate Detection

Position-Based Marking

YAML Frontmatter

Installation

Requirements

Setup

Dependencies

Shell Script Integration

Complete Workflow Script

Clipboard Processing (macOS)

File Support

Testing

Marker Format Reference

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DejaText

How It Works

Commands

report - Detect and report (read-only)

cleanup - Detect and annotate

strip - Remove flagged content

version - Show version

Workflow

Manual Workflow

Automated Workflow (Shell Script)

Options

Strip-specific options

Near-Duplicate Detection

Position-Based Marking

YAML Frontmatter

Installation

Requirements

Setup

Dependencies

Shell Script Integration

Complete Workflow Script

Clipboard Processing (macOS)

File Support

Testing

Marker Format Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`report` - Detect and report (read-only)

`cleanup` - Detect and annotate

`strip` - Remove flagged content

`version` - Show version

Packages