Skip to content

Adaptive Background Sync for Commons Metadata#440

Open
ayushshukla1807 wants to merge 7 commits intohatnote:masterfrom
ayushshukla1807:gsoc-2026-commons-sync
Open

Adaptive Background Sync for Commons Metadata#440
ayushshukla1807 wants to merge 7 commits intohatnote:masterfrom
ayushshukla1807:gsoc-2026-commons-sync

Conversation

@ayushshukla1807
Copy link
Copy Markdown

@ayushshukla1807 ayushshukla1807 commented Mar 28, 2026

GSoC 2026 Blueprint — Phase 3: Adaptive Background Sync for Commons Metadata

Status: Active architecture blueprint, cited in my GSoC 2026 proposal (Phase 3, Weeks 9–10). Tracks the async import engine design.

Problem

The current labs.py get_files() implementation executes a synchronous MySQL query against the Commons replica database. For categories with large file counts (>5,000 images), this query:

  1. Times out — the Toolforge MySQL replica enforces a query timeout, returning 0 results for large categories
  2. Blocks the WSGI thread — no other requests can be served during the query
  3. Ignores redirects — partially fixed in PR fix: follow category redirects during file import #428, but the redirect chain is still resolved in a single query

Proposed Architecture: Chunked Async Import

sequenceDiagram
    participant Coord as Coordinator
    participant API as POST /import
    participant Worker as Background Worker
    participant Commons as Commons Replica (MySQL)
    participant Cache as Import Cache (Redis/SQLite)

    Coord->>API: import_method=category, category=WLM_2024
    API-->>Coord: 202 Accepted {job_id: "abc123"}
    API->>Worker: enqueue(job_id, category)

    loop Chunked fetch (500 files/page)
        Worker->>Commons: SELECT ... LIMIT 500 OFFSET N
        Commons-->>Worker: Batch of file metadata
        Worker->>Cache: store batch
        Worker->>API: progress update (N/total)
    end

    Worker->>API: job complete
    Coord->>API: GET /import/abc123/status
    API-->>Coord: {status: "complete", files_imported: 8432}
Loading

Why This Matters for GSoC

Large Wiki Loves competitions (WLM, WLE) routinely have 10,000–50,000 submissions. The current synchronous import is the single biggest operational blocker for using Montage at full competition scale. This is the highest-impact infrastructure improvement in the proposal.

Implementation Plan

Component Technology Notes
Job queue Python threading.Thread or Celery Start simple, Celery if needed
Progress tracking SQLite job table Lightweight, no new deps
Chunked query LIMIT/OFFSET pagination 500 files per chunk
Redirect resolution Existing PR #428 logic Reuse and extend

This commit patches the admin_endpoints validation to gracefully handle empty POST bodies, and modifies MessageMiddleware to intercept MontageErrors so they correctly return 400 Bad Request JSON instead of bypassing CORS headers. Resolves Issue hatnote#357.
Bind :disabled='isLoading' on vote buttons to block concurrent clicks at DOM level before Vue's async re-render cycle can prevent them.
…atnote#325)

getRoundVotesStats was defined in jurorService but never called. Added onMounted fetch and post-vote refresh in VoteRating.vue and VoteYesNo.vue, with conditional rendering when round.show_stats is true.
@ayushshukla1807
Copy link
Copy Markdown
Author

I am closing this PR to reduce repository noise. The core fixes relevant to my GSoC Proposal are being manually consolidated into PR #454 and PR #415 to make it substantially easier for the maintainers to review my code. The larger concepts discussed here will be implemented incrementally and manually if my proposal is accepted.

@ayushshukla1807
Copy link
Copy Markdown
Author

I have stripped the AI formatting from the description and reopened this PR so I can manually improve its code over the coming days, fulfilling my promise.

@ayushshukla1807
Copy link
Copy Markdown
Author

Closing this conceptual proposal. Consolidating my Open Source footprint to prioritize high-value, locally verified bug fixes for the current review window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant