Skip to content

Detecting Superseded Pending Changes ( Token/Word-Level Diff Tracking) #114

@zache-fi

Description

@zache-fi

Implement the Token/Word-Level Diff Tracking from the Detecting Superseded Pending Changes investigation and make somekind UI for it. Idea is also have annotated revision database so it could be queried and used for further processing.

Links for getting word level diffs
Wikiwho API

Mediawiki REST API for diffs

Core Requirements

1. Word-Level annotation

NOTE: if you are using Wikiwho (wikiwho-api.wmcloud.org ) then its metadta can be used. Metadata format doesn't need to be literally be this.

  • Annotate article revision wikitext at the word/token level with metadata:
    • Stable word id which identifies the word between revisions
    • Revision ID where the word was added
    • User who added the word
  • Word definition: Text string separated by whitespace
  • Annotation rules:
    • If text is moved: preserve original author attribution and id
    • If word is modified: attribute to the user who modified and create new id for the word and store previous word id as metadata
  • If annotating is started from middle of revision history then tag all text of the starting revision to that editor

2. Utility functions required

2.1 Annotate Article History

annotate_article(article_title, start_revision=first, end_revision=latest)
  • Annotate article text from a specific revision forward to the latest version
  • Initial assumption: All text in start_revision is attributed to that revision's author
  • Build attribution data through subsequent revisions
  • Store annotated data to database for fetching the data

2.3 Get annotated revision

get_annotated_revision(article_title, target_revision, min_word_length=0)
  • Return words from target_revision that:
    • Were added in that revision
    • Were deleted in that revision
    • Were moved in that revision
    • Still exist in the latest version
  • Support filtering by minimum word length

2.3 Get persistent content from specific Revision

get_persistent_content(article_title, target_revision, min_word_length=0)
  • Return words from target_revision that:
    • Were added in that revision
    • Were deleted in that revision
    • Were moved in that revision
    • Still exist in the latest version
  • Support filtering by minimum word length

2.4 Get changed content from specific revision

get_revision_content(article_title, target_revision, min_word_length=0)
  • Return words from target_revision that were:
    • Added in that revision
    • Deleted in that revision
    • Moved in that revision
  • Support filtering by minimum word length
  • No requirement that content still exists in latest version

3. Management command interface

Make Django management commands for

  • annotating annotating article history
  • getting annotated revisions
  • getting changed content from specific revision
  • getting changed content from specific revision which still exists in latest version

4. Web user Interface

Make separate page for viewing annotated data.

For web UI there are some older tools (now defunct) could be used as target example for the idea. However, we would only do the wikicode colorization and skip the coloring the rendered version. (because it is complex)

4.1 Article and Revision Selection

  • Allow user to select:
    • Wikipedia article
    • Specific revision to display

4.2 Revision Metadata Display

  • Show:
    • List of users who have edited the article
    • Annotated wikitext of the selected revision

4.3 Annotated text visualization

Display wikitext with words color-coded based on user-selected filter:

Filter Options:

  1. Words from specific revision: Highlight words originating from a chosen revision
  2. Words added after date: Highlight words added after a specific date
  3. Words by specific user: Highlight words added by a user selected from the user list of users edited the article
  4. Words by automatically reviewed users: Highlight words added by users who are automatically reviewed (bots, automatically reviewed users)

Color-coding display

  • Words should be colored/highlighted according to the active filter
  • Clear visual distinction between matching and non-matching words

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions