Skip to content

Add benchmark tool for superseded additions detection (#113)#122

Open
xenacode-art wants to merge 6 commits intoWikimedia-Suomi:mainfrom
xenacode-art:feature/113-benchmark-superseded-clean
Open

Add benchmark tool for superseded additions detection (#113)#122
xenacode-art wants to merge 6 commits intoWikimedia-Suomi:mainfrom
xenacode-art:feature/113-benchmark-superseded-clean

Conversation

@xenacode-art
Copy link
Copy Markdown
Contributor

@xenacode-art xenacode-art commented Oct 27, 2025

Hi, @zache-fi ,
I've implemented the Django management command to compare two methods of detecting superseded additions in pending revisions:

  1. Current similarity-based method (using SequenceMatcher)
  2. Proposed word-level diff method (using MediaWiki REST API)

What I Added

Management Command

  • app/reviews/management/commands/benchmark_superseded.py (450 lines)
    • Compares both methods across sample revisions
    • Generates detailed statistics and JSON output
    • Provides diff URLs for manual review
    • Configurable sample size, threshold, and wiki

Documentation

  • BENCHMARK_SUPERSEDED.md (comprehensive guide)
    • Explains current implementation (autoreview.py:755-813)
    • Documents word-level diff approach
    • Usage examples and interpretation guide
    • Performance considerations and integration path

Supporting Files

  • app/reviews/management/init.py (package marker)
  • app/reviews/management/commands/init.py (package marker)
  • benchmark_results_example.json (sample output format)

Usage

python manage.py benchmark_superseded --wiki=1 --sample-size=50 --threshold=0.2 --output=results.json

Key Features

Similarity Method (Current):

  • Character-level text matching with SequenceMatcher
  • Normalizes wikitext (removes refs, templates, formatting)
  • Fast, no external dependencies

Word-Level Method (Proposed):

  • Uses MediaWiki REST API visual diff endpoint
  • Tracks word-level changes and block moves
  • More precise semantic understanding

Comparison Output:

  • Agreement rate between methods
  • Disagreement breakdown (similarity-only vs word-level-only approvals)
  • Per-revision results with diff URLs
  • JSON export for further analysis

Testing

I validated the command structure with:

  • Python AST syntax checking (passed)
  • Django package structure (proper init.py files)

Addresses issue #113

Kaja Obinna and others added 3 commits October 27, 2025 10:11
…i#113)

I've implemented a Django management command to compare two methods of detecting
superseded additions in pending revisions:

1. Current similarity-based method (using SequenceMatcher)
2. Proposed word-level diff method (using MediaWiki REST API)

## What I Added

### Management Command
- app/reviews/management/commands/benchmark_superseded.py (450 lines)
  - Compares both methods across sample revisions
  - Generates detailed statistics and JSON output
  - Provides diff URLs for manual review
  - Configurable sample size, threshold, and wiki

### Documentation
- BENCHMARK_SUPERSEDED.md (comprehensive guide)
  - Explains current implementation (autoreview.py:755-813)
  - Documents word-level diff approach
  - Usage examples and interpretation guide
  - Performance considerations and integration path

### Supporting Files
- app/reviews/management/__init__.py (package marker)
- app/reviews/management/commands/__init__.py (package marker)
- benchmark_results_example.json (sample output format)

## Usage

python manage.py benchmark_superseded --wiki=1 --sample-size=50 --threshold=0.2 --output=results.json

## Key Features

Similarity Method (Current):
- Character-level text matching with SequenceMatcher
- Normalizes wikitext (removes refs, templates, formatting)
- Fast, no external dependencies

Word-Level Method (Proposed):
- Uses MediaWiki REST API visual diff endpoint
- Tracks word-level changes and block moves
- More precise semantic understanding

Comparison Output:
- Agreement rate between methods
- Disagreement breakdown (similarity-only vs word-level-only approvals)
- Per-revision results with diff URLs
- JSON export for further analysis

## Testing

I validated the command structure with:
- Python AST syntax checking (passed)
- Django package structure (proper __init__.py files)

Addresses issue Wikimedia-Suomi#113
Copy link
Copy Markdown
Contributor

@zache-fi zache-fi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though I didn't test the results. I was able to run the command and based on that it would be useful

  • if there is some warning that it will require that data is loaded using web interface first because now user needs to guess it
  • wiki parameter now requires NUMBER as parameter. It would be better if it would require language code as parameter and resolve correct wiki based on that.

from django.core.management.base import BaseCommand
from pywikibot.comms import http

from app.reviews.models import PendingPage, PendingRevision, Wiki
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

app.reviews.models import throws ModuleNotFoundError: No module named 'app' It is easy to be fixed by removing the app part from import but i am not sure when it was required at first place.

self, revision: PendingRevision, wiki: Wiki, threshold: float
) -> dict[str, Any]:
"""Compare similarity-based vs word-level diff methods."""
from app.reviews.autoreview import (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were refactored in 3a7f185 so there is no more private _extract_additions. _get_parent_wikitext etc functions but they are now public extract_additions, get_parent_wikitext utility functions. I tested to fix these but the result AFAIK was that something what functions returned were changed too so i didn't continue to test further.

Address review comments from PR Wikimedia-Suomi#122:

1. Fix import statements - Remove 'app.' prefix
   - Changed: from app.reviews.models to reviews.models
   - Changed: from app.reviews.autoreview imports

2. Update to use refactored autoreview utility functions
   - Use extract_additions (was _extract_additions)
   - Use get_parent_wikitext (was _get_parent_wikitext)
   - Use normalize_wikitext (was _normalize_wikitext)
   - Use is_addition_superseded (was _is_addition_superseded)
   - Import from reviews.autoreview.utils.wikitext and reviews.autoreview.utils.similarity
   - Functions were refactored in commit 3a7f185

3. Change --wiki parameter to accept language code
   - Now accepts language codes (e.g., 'fi', 'en') instead of numeric Wiki ID
   - More user-friendly and intuitive
   - Provides helpful error message with available wiki codes if not found

4. Add data loading requirement warnings
   - Added note about needing to load data via web interface first
   - Improved error message when no suitable revisions found
   - Explains possible reasons for empty results

5. Update documentation
   - Updated BENCHMARK_SUPERSEDED.md to reflect all changes
   - Fixed function references (removed underscores)
   - Updated file location references for refactored code
   - Updated all usage examples to use language codes
@xenacode-art
Copy link
Copy Markdown
Contributor Author

Hi @zache-fi,

I've addressed all the review comments you provided. The command now uses the refactored utility
functions, accepts language codes instead of numeric IDs, and includes warnings about data requirements.
All imports have been fixed and the documentation has been updated accordingly. Please let me know if
there's anything else that needs adjustment.

Thanks for the detailed feedback!

@xenacode-art
Copy link
Copy Markdown
Contributor Author

Hi @zache-fi! 👋

Just wanted to follow up on this PR since I've addressed all the review feedback you provided:

✅ Fixed all import issues - Removed app. prefix
✅ Updated to refactored functions - Now using public utility functions from the new modular structure
✅ Changed --wiki parameter - Now accepts language codes (e.g., --wiki=fi) instead of numeric IDs
✅ Added data loading warnings - Users get clear messaging about prerequisites

The command is now more user-friendly and follows the current codebase architecture. All syntax checks
passing.

How it supports the roadmap:
This benchmark tool helps validate the effectiveness of superseded additions detection, which is part of
the autoreview pipeline that will run on Toolforge (Goal #1). Understanding which detection method works
best will help us optimize the production deployment.

Ready for another review when you have time! Let me know if there's anything else needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants