Automating transformers uplift + fixes by jazpurTT · Pull Request #5412 · tenstorrent/tt-xla

jazpurTT · 2026-06-29T15:35:06Z

Ticket

Problem description

Each new transformers release breaks the model test suite. Detecting which models regress, classifying real failures vs. pre-existing ones, and patching the loaders + plumbing in tt-xla and tt_forge_models is a repetitive multi-day chore that gates newer transformers versions from landing on main.

What's changed

End-to-end CI pipeline that helps automating the uplift + fixes:

transformers-uplift-fix Claude skill (.claude/skills/transformers-uplift-fix/SKILL.md) — receives a SCOPE (api-check / model-test-uplifts / model-perf-uplift) and a captured failure context, consults the upstream changelog between CURRENT_VERSION and TARGET_VERSION to identify root causes, and edits source under tt-xla + tt_forge_models only. Hard rules: no monkey-patching, no if version < X shims, no drive-by refactors, no git ops (the orchestrator owns every commit and push). Writes a fix summary to .github/transformers-uplift/fix-summary.md that the orchestrator uses as the commit body.
schedule-transformers-uplift.yml — nightly cron polls PyPI for the next stable transformers release, creates a transformers-uplift/<ver> WIP branch on tt-xla and tt_forge_models, bumps the pin in venv/requirements-dev.txt, and dispatches the orchestrator. Resolves a baseline schedule-nightly run on main to feed downstream filtering.
workflow-transformers-uplift.yml — orchestrates the fix loop:
- api-check — pytest --collect-only sweep to surface import / signature breaks; Claude fixes via a self-bounded sub-loop (up to 5 retries).
- test-and-fix-base — runs the curated baseline subset (33 models marked baseline_uplift) for a fast framework-level signal before paying for the full suite.
- decide — if anything still fails and we're under MAX_ITERATIONS, self-redispatches iteration N+1; otherwise advances to the full passes.
- test-and-fix-full — full model-test-passing.json suite.
- test-and-fix-perf — perf-benchmark sweep across all runners (n150 / p150 / n300-llmbox / galaxy-wh-6u / qb2-blackhole) with regression check against the baseline nightly.
- Each stage hands its captured failure context to the new transformers-uplift-fix Claude skill. Patches land on both tt-xla and the tt_forge_models submodule. Branch is sourced from one place (github.ref_name).
call-test-uplift.yml / call-perf-uplift.yml — reusable wrappers around call-test.yml and call-filtered-perf-tests.yml. Compare current failures against a baseline nightly run so Claude only sees uplift-induced regressions, not pre-existing failures.
manual-test-uplift.yml / manual-perf-uplift.yml — user-dispatchable wrappers for debugging a single suite/runner against an existing WIP branch.
Supporting scripts — detect-new-version.sh (PyPI next-stable picker), bump-transformers.sh (requirements update), extract-failures.py / extract-perf-failures.py (junit + log parsing into Claude-friendly context), run-api-check.sh, run-claude-fix.sh.

Why these 33 models for `baseline_uplift`

Curated for architectural diversity rather than count — one or two representatives per family that exercise the corners of the transformers API most likely to break on a release:

Encoder LMs — BERT-base-uncased (masked LM), ALBERT, RoBERTa-XLM, DistilBERT
Decoder / causal LMs — Falcon-3.1B, Mistral-Ministral-3B, Phi-2, Phi-3-mini-128K, Qwen-2.5-0.5B, Qwen-3-0.6B, Gemma-1.1-2B
Seq2seq — BART-large, MusicGen-small (audio seq2seq)
Vision encoders — ViT-base, Swin-S, DINOv2-small, ResNet-50, EfficientNet-B0, MobileNetV2
Detection / segmentation — YOLOv4, YOLOv7, YOLOS-small, YOLOP, DETR, OWL-ViT, Segformer-Mit-B0, MaskFormer-Swin-base
Multimodal — CLIP-base-patch16, SigLIP-base-patch16
Embeddings / NER — BGE-large-en, Sentencizer-XLM-RoBERTa
Generative / audio — Stable Diffusion UNet, SpeechT5 HiFiGAN vocoder

These hit Cache, attention, attention-mask, tokenizer, image-processor, audio-processor, and generation surfaces — i.e. the high-churn areas.

Checklist

New/Existing tests provide coverage for changes

jazpurTT · 2026-06-29T15:52:48Z

I ran an e2e test uplifting transformers 5.5.1 -> 5.9.0

api-check passed and needed no fixes.
Baseline tests failed and Claude applied a fix, on the second run we saw a full successful run
Model passing tests ran, this is the test and this is the commit with the fixes + message:
- Claude applied fixes for some models
- Claude was able to identify models that had the same failure on baseline and skipped them
- A couple of models were identified as non transformers related. This would need human review.
Perf benchmark ran, this is the test ant this is the commit
- Small PCC drift, Claude made the change but also flagged it for human review.
- The models that failed failed on the baseline test and Claude identified them as so.

codecov-commenter · 2026-06-29T17:07:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 33.81%. Comparing base (4bcaf64) to head (997cc97).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5412   +/-   ##
=======================================
  Coverage   33.81%   33.81%           
=======================================
  Files          37       37           
  Lines        4992     4992           
=======================================
  Hits         1688     1688           
  Misses       3304     3304

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

acicovicTT · 2026-06-30T06:55:02Z

Pls check .github/scripts/uplift/uplift-config.json and the corresponding test matrix that is invoked when transformers uplift is detected (your 33 models are probably a better curated list than what is already there), but we should not have two separate quality gates for transformers uplift, so either see if you can integrate with uplift-config.json, which is a pretty simple generic uplift test selection mechanism, OR if transformers uplift requires a more complicated machinery, remove it from uplift-config.json.

jazpurTT added 14 commits June 26, 2026 14:05

Transformers uplift automation pipeline - v1

080404a

testing api-check

624048f

testing base-coverage

07b009c

temp

483c468

testing nightly passing models

79129e2

test perf benchmark

01c2208

temp perf

776d036

refactor

65dca27

test iternations

c2ec35b

test e2e

6fe3889

reset tt-forge-models

3a5cbb3

Transformers uplift automation pipeline - v2

cd31c36

test

f31ecd7

debug push issue

6587c0d

jazpurTT requested review from AleksKnezevic, acicovicTT, jameszianxuTT, kmabeeTT, mrakitaTT, mstojkovicTT, ndrakulicTT, nsumrakTT, nvukobratTT, sdjukicTT, sgligorijevicTT, vkovinicTT, vmilosevic and vvukomanTT as code owners June 29, 2026 15:35

Transformers uplift automation pipeline - v3

997cc97

jazpurTT force-pushed the jazpur/auto-transformers-uplift branch from 3999880 to 997cc97 Compare June 29, 2026 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automating transformers uplift + fixes#5412

Automating transformers uplift + fixes#5412
jazpurTT wants to merge 15 commits into
mainfrom
jazpur/auto-transformers-uplift

jazpurTT commented Jun 29, 2026

Uh oh!

jazpurTT commented Jun 29, 2026

Uh oh!

codecov-commenter commented Jun 29, 2026

Uh oh!

acicovicTT commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jazpurTT commented Jun 29, 2026

Ticket

Problem description

What's changed

Why these 33 models for baseline_uplift

Checklist

Uh oh!

jazpurTT commented Jun 29, 2026

Uh oh!

codecov-commenter commented Jun 29, 2026

Codecov Report

Uh oh!

acicovicTT commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Why these 33 models for `baseline_uplift`