Skip to content

long_tail detection triggers too early, causing token repetition before forced EOS #519

@kiagentkronos-cell

Description

@kiagentkronos-cell

Problem

The alignment_stream_analyzer in the Multilingual model forces EOS via long_tail=tensor(True) at very low sample counts (~60-180 out of 1000 max), causing the model to repeat tokens right before being cut off. This results in duplicated phrases in the generated audio.

Reproduction

Setup:

  • Model: ChatterboxMultilingualTTS.from_pretrained(device="cuda")
  • Mode: Zero-shot voice cloning with German reference WAV
  • Hardware: CUDA (DGX Spark / NVIDIA GB10)
  • Python 3.12, Linux arm64

Trigger: Split long text into sentences and generate each individually. Multiple sentences get terminated early by the analyzer.

Observations (from logs)

Sentence 1/8: terminated at sample 60/1000
  forcing EOS token, long_tail=tensor(True), alignment_repetition=tensor(False), token_repetition=False

Sentence 2/8: terminated at sample 156/1000
  forcing EOS token, long_tail=tensor(True), alignment_repetition=tensor(False), token_repetition=False

Sentence 3/8: terminated at sample 104/1000
  forcing EOS token, long_tail=tensor(True), alignment_repetition=tensor(False), token_repetition=False

Sentence 5/8: terminated at sample 179/1000
  forcing EOS token, long_tail=tensor(True), alignment_repetition=tensor(False), token_repetition=False

Sentence 7/8: terminated at sample 143/1000
  Detected 3x repetition of token 4218
  forcing EOS token, long_tail=tensor(False), alignment_repetition=tensor(False), token_repetition=True

Impact

When Whisper transcribes the resulting audio, phrases appear duplicated at chunk boundaries where the analyzer forced EOS. This makes the output unusable for longer texts without manual post-processing.

Expected Behavior

The long_tail threshold should allow natural sentence completion before triggering. For German sentences of 50-120 characters, the model typically needs 200-400+ samples to complete. Early termination at 60-180 samples cuts off mid-generation and causes token repetition artifacts.

Suggested Fix

Increase the long_tail sensitivity thresholds in alignment_stream_analyzer or make them configurable via model.generate() parameters. A per-language calibration might also help, as German phonetics may produce different alignment patterns than English training data.

Environment

  • chatterbox pip package (latest as of May 2026)
  • CUDA 13.0, sm_121 (NVIDIA GB10 / DGX Spark)
  • Python 3.12, Linux (arm64)
  • Reference audio: ~10s German male voice clip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions