Implement the Token/Word-Level Diff Tracking from the Detecting Superseded Pending Changes investigation and make somekind UI for it. Idea is also have annotated revision database so it could be queried and used for further processing.
Links for getting word level diffs
Wikiwho API
Mediawiki REST API for diffs
Core Requirements
1. Word-Level annotation
NOTE: if you are using Wikiwho (wikiwho-api.wmcloud.org ) then its metadta can be used. Metadata format doesn't need to be literally be this.
- Annotate article revision wikitext at the word/token level with metadata:
- Stable word id which identifies the word between revisions
- Revision ID where the word was added
- User who added the word
- Word definition: Text string separated by whitespace
- Annotation rules:
- If text is moved: preserve original author attribution and id
- If word is modified: attribute to the user who modified and create new id for the word and store previous word id as metadata
- If annotating is started from middle of revision history then tag all text of the starting revision to that editor
2. Utility functions required
2.1 Annotate Article History
annotate_article(article_title, start_revision=first, end_revision=latest)
- Annotate article text from a specific revision forward to the latest version
- Initial assumption: All text in
start_revision is attributed to that revision's author
- Build attribution data through subsequent revisions
- Store annotated data to database for fetching the data
2.3 Get annotated revision
get_annotated_revision(article_title, target_revision, min_word_length=0)
- Return words from
target_revision that:
- Were added in that revision
- Were deleted in that revision
- Were moved in that revision
- Still exist in the latest version
- Support filtering by minimum word length
2.3 Get persistent content from specific Revision
get_persistent_content(article_title, target_revision, min_word_length=0)
- Return words from
target_revision that:
- Were added in that revision
- Were deleted in that revision
- Were moved in that revision
- Still exist in the latest version
- Support filtering by minimum word length
2.4 Get changed content from specific revision
get_revision_content(article_title, target_revision, min_word_length=0)
- Return words from
target_revision that were:
- Added in that revision
- Deleted in that revision
- Moved in that revision
- Support filtering by minimum word length
- No requirement that content still exists in latest version
3. Management command interface
Make Django management commands for
- annotating annotating article history
- getting annotated revisions
- getting changed content from specific revision
- getting changed content from specific revision which still exists in latest version
4. Web user Interface
Make separate page for viewing annotated data.
For web UI there are some older tools (now defunct) could be used as target example for the idea. However, we would only do the wikicode colorization and skip the coloring the rendered version. (because it is complex)
4.1 Article and Revision Selection
- Allow user to select:
- Wikipedia article
- Specific revision to display
4.2 Revision Metadata Display
- Show:
- List of users who have edited the article
- Annotated wikitext of the selected revision
4.3 Annotated text visualization
Display wikitext with words color-coded based on user-selected filter:
Filter Options:
- Words from specific revision: Highlight words originating from a chosen revision
- Words added after date: Highlight words added after a specific date
- Words by specific user: Highlight words added by a user selected from the user list of users edited the article
- Words by automatically reviewed users: Highlight words added by users who are automatically reviewed (bots, automatically reviewed users)
Color-coding display
- Words should be colored/highlighted according to the active filter
- Clear visual distinction between matching and non-matching words
Implement the Token/Word-Level Diff Tracking from the Detecting Superseded Pending Changes investigation and make somekind UI for it. Idea is also have annotated revision database so it could be queried and used for further processing.
Links for getting word level diffs
Wikiwho API
Mediawiki REST API for diffs
Core Requirements
1. Word-Level annotation
NOTE: if you are using Wikiwho (wikiwho-api.wmcloud.org ) then its metadta can be used. Metadata format doesn't need to be literally be this.
2. Utility functions required
2.1 Annotate Article History
start_revisionis attributed to that revision's author2.3 Get annotated revision
target_revisionthat:2.3 Get persistent content from specific Revision
target_revisionthat:2.4 Get changed content from specific revision
target_revisionthat were:3. Management command interface
Make Django management commands for
4. Web user Interface
Make separate page for viewing annotated data.
For web UI there are some older tools (now defunct) could be used as target example for the idea. However, we would only do the wikicode colorization and skip the coloring the rendered version. (because it is complex)
4.1 Article and Revision Selection
4.2 Revision Metadata Display
4.3 Annotated text visualization
Display wikitext with words color-coded based on user-selected filter:
Filter Options:
Color-coding display