Skip to content

Automating transformers uplift + fixes#5412

Open
jazpurTT wants to merge 15 commits into
mainfrom
jazpur/auto-transformers-uplift
Open

Automating transformers uplift + fixes#5412
jazpurTT wants to merge 15 commits into
mainfrom
jazpur/auto-transformers-uplift

Conversation

@jazpurTT

Copy link
Copy Markdown
Contributor

Ticket

#3608

Problem description

Each new transformers release breaks the model test suite. Detecting which models regress, classifying real failures vs. pre-existing ones, and patching the loaders + plumbing in tt-xla and tt_forge_models is a repetitive multi-day chore that gates newer transformers versions from landing on main.

What's changed

End-to-end CI pipeline that helps automating the uplift + fixes:

  • transformers-uplift-fix Claude skill (.claude/skills/transformers-uplift-fix/SKILL.md) — receives a SCOPE (api-check / model-test-uplifts / model-perf-uplift) and a captured failure context, consults the upstream changelog between CURRENT_VERSION and TARGET_VERSION to identify root causes, and edits source under tt-xla + tt_forge_models only. Hard rules: no monkey-patching, no if version < X shims, no drive-by refactors, no git ops (the orchestrator owns every commit and push). Writes a fix summary to .github/transformers-uplift/fix-summary.md that the orchestrator uses as the commit body.
  • schedule-transformers-uplift.yml — nightly cron polls PyPI for the next stable transformers release, creates a transformers-uplift/<ver> WIP branch on tt-xla and tt_forge_models, bumps the pin in venv/requirements-dev.txt, and dispatches the orchestrator. Resolves a baseline schedule-nightly run on main to feed downstream filtering.
  • workflow-transformers-uplift.yml — orchestrates the fix loop:
    • api-checkpytest --collect-only sweep to surface import / signature breaks; Claude fixes via a self-bounded sub-loop (up to 5 retries).
    • test-and-fix-base — runs the curated baseline subset (33 models marked baseline_uplift) for a fast framework-level signal before paying for the full suite.
    • decide — if anything still fails and we're under MAX_ITERATIONS, self-redispatches iteration N+1; otherwise advances to the full passes.
    • test-and-fix-full — full model-test-passing.json suite.
    • test-and-fix-perf — perf-benchmark sweep across all runners (n150 / p150 / n300-llmbox / galaxy-wh-6u / qb2-blackhole) with regression check against the baseline nightly.
    • Each stage hands its captured failure context to the new transformers-uplift-fix Claude skill. Patches land on both tt-xla and the tt_forge_models submodule. Branch is sourced from one place (github.ref_name).
  • call-test-uplift.yml / call-perf-uplift.yml — reusable wrappers around call-test.yml and call-filtered-perf-tests.yml. Compare current failures against a baseline nightly run so Claude only sees uplift-induced regressions, not pre-existing failures.
  • manual-test-uplift.yml / manual-perf-uplift.yml — user-dispatchable wrappers for debugging a single suite/runner against an existing WIP branch.
  • Supporting scriptsdetect-new-version.sh (PyPI next-stable picker), bump-transformers.sh (requirements update), extract-failures.py / extract-perf-failures.py (junit + log parsing into Claude-friendly context), run-api-check.sh, run-claude-fix.sh.

Why these 33 models for baseline_uplift

Curated for architectural diversity rather than count — one or two representatives per family that exercise the corners of the transformers API most likely to break on a release:

  • Encoder LMs — BERT-base-uncased (masked LM), ALBERT, RoBERTa-XLM, DistilBERT
  • Decoder / causal LMs — Falcon-3.1B, Mistral-Ministral-3B, Phi-2, Phi-3-mini-128K, Qwen-2.5-0.5B, Qwen-3-0.6B, Gemma-1.1-2B
  • Seq2seq — BART-large, MusicGen-small (audio seq2seq)
  • Vision encoders — ViT-base, Swin-S, DINOv2-small, ResNet-50, EfficientNet-B0, MobileNetV2
  • Detection / segmentation — YOLOv4, YOLOv7, YOLOS-small, YOLOP, DETR, OWL-ViT, Segformer-Mit-B0, MaskFormer-Swin-base
  • Multimodal — CLIP-base-patch16, SigLIP-base-patch16
  • Embeddings / NER — BGE-large-en, Sentencizer-XLM-RoBERTa
  • Generative / audio — Stable Diffusion UNet, SpeechT5 HiFiGAN vocoder

These hit Cache, attention, attention-mask, tokenizer, image-processor, audio-processor, and generation surfaces — i.e. the high-churn areas.

Checklist

  • New/Existing tests provide coverage for changes

@jazpurTT jazpurTT force-pushed the jazpur/auto-transformers-uplift branch from 3999880 to 997cc97 Compare June 29, 2026 15:40
@jazpurTT

Copy link
Copy Markdown
Contributor Author

I ran an e2e test uplifting transformers 5.5.1 -> 5.9.0

  • api-check passed and needed no fixes.
  • Baseline tests failed and Claude applied a fix, on the second run we saw a full successful run
  • Model passing tests ran, this is the test and this is the commit with the fixes + message:
    • Claude applied fixes for some models
    • Claude was able to identify models that had the same failure on baseline and skipped them
    • A couple of models were identified as non transformers related. This would need human review.
  • Perf benchmark ran, this is the test ant this is the commit
    • Small PCC drift, Claude made the change but also flagged it for human review.
    • The models that failed failed on the baseline test and Claude identified them as so.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 33.81%. Comparing base (4bcaf64) to head (997cc97).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5412   +/-   ##
=======================================
  Coverage   33.81%   33.81%           
=======================================
  Files          37       37           
  Lines        4992     4992           
=======================================
  Hits         1688     1688           
  Misses       3304     3304           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

@acicovicTT

Copy link
Copy Markdown
Contributor

Pls check .github/scripts/uplift/uplift-config.json and the corresponding test matrix that is invoked when transformers uplift is detected (your 33 models are probably a better curated list than what is already there), but we should not have two separate quality gates for transformers uplift, so either see if you can integrate with uplift-config.json, which is a pretty simple generic uplift test selection mechanism, OR if transformers uplift requires a more complicated machinery, remove it from uplift-config.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants