Skip to content

[Test] Mark kokoro/Kokoro-82M single-device inference KNOWN_FAILURE_X…#5348

Draft
saiarthiraguram wants to merge 1 commit into
mainfrom
sai_arthi_raguram/kokoro
Draft

[Test] Mark kokoro/Kokoro-82M single-device inference KNOWN_FAILURE_X…#5348
saiarthiraguram wants to merge 1 commit into
mainfrom
sai_arthi_raguram/kokoro

Conversation

@saiarthiraguram

Copy link
Copy Markdown
Contributor

Ticket

Github Issue

Problem description

Bring up hexgrad/Kokoro-82M (StyleTTS2 TTS / iSTFTNet vocoder) on the tt-xla
PyTorch runner. The model now has a tt-forge-models loader (companion PR:
tenstorrent/tt-forge-models#); this PR wires it into the tt-xla single-device
inference suite and records its status.

The model compiles and runs end-to-end on n150 with trained weights and produces
a finite, sane-magnitude waveform, but waveform PCC tops out at ~0.14 (need
0.99): the iSTFTNet vocoder (sine-phase cumsum + STFT/iSTFT over a long 1-D
waveform) is sensitive to bf16 tensor storage. Decoder output PCC is already
0.948 and the vocoder collapses it to 0.138; fp32_dest_acc_en+hifi4 is
bit-identical (storage-bound, not accumulation-bound). Root cause and the full
device-vs-CPU bisect are in #5332.

What's changed

  • third_party/tt_forge_models — submodule pointer bumped to include the
    Kokoro loader (companion tt-forge-models PR). (Apply after that PR merges.)
  • tests/runner/test_config/torch/test_config_inference_single_device.yaml
    new kokoro/pytorch-hexgrad/Kokoro-82M-single_device-inference entry,
    KNOWN_FAILURE_XFAIL, reason documenting the bf16 waveform-PCC wall and
    linking Kokoro-82M: whole-model PCC collapses (-0.0019, best 0.24) — bf16 tile storage quantizes large iSTFTNet sine-phase accumulation, causing catastrophic cancellation #5332.
  • python_package/tt_torch/utils.py — Dynamo guard-repr patch fix:
    self.get(guard)self.get(guard.name) (a prerequisite that unblocked
    compilation during bringup). (Drop this hunk if it has already landed on
    main from another change.)
  • tests/torch/ops/kokoro/ — two self-contained single-device op sanities
    reproducing the underlying device numerical-robustness gap (bf16
    InstanceNorm1d variance catastrophic-cancellation on a near-constant input
    rsqrt → inf/FLT_MAX/nan): test_adain_sanity.py (minimal bare op) and
    test_adain_chain_sanity.py (Conv1d → AdaIN1d, the Kokoro structure). Both
    xfail(strict=False), #5332. *(Currently staged on branch
    sai_arthi_raguram/kokoro-instancenorm-bf16-repro)

Checklist

  • New/Existing tests provide coverage for changes — model node runs E2E on
    n150 (xfail on PCC); op sanities reproduce the root-cause gap in isolation.

Logs

debug_report.md
iter_n150_13_fp32acc_realw_run.log

…FAIL

Compiles + runs E2E on n150 with trained weights and produces a finite,
sane-magnitude waveform, but waveform PCC tops out at ~0.14 (need 0.99): the
iSTFTNet vocoder (sine-phase cumsum + STFT/iSTFT over a long 1-D waveform) is
sensitive to bf16 tensor storage. Decoder output PCC is already 0.948 and the
vocoder collapses it to 0.138; fp32_dest_acc_en+hifi4 is bit-identical
(storage-bound, not accumulation-bound). Tracked in #5332.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant