Problem
The alignment_stream_analyzer in the Multilingual model forces EOS via long_tail=tensor(True) at very low sample counts (~60-180 out of 1000 max), causing the model to repeat tokens right before being cut off. This results in duplicated phrases in the generated audio.
Reproduction
Setup:
- Model:
ChatterboxMultilingualTTS.from_pretrained(device="cuda")
- Mode: Zero-shot voice cloning with German reference WAV
- Hardware: CUDA (DGX Spark / NVIDIA GB10)
- Python 3.12, Linux arm64
Trigger: Split long text into sentences and generate each individually. Multiple sentences get terminated early by the analyzer.
Observations (from logs)
Sentence 1/8: terminated at sample 60/1000
forcing EOS token, long_tail=tensor(True), alignment_repetition=tensor(False), token_repetition=False
Sentence 2/8: terminated at sample 156/1000
forcing EOS token, long_tail=tensor(True), alignment_repetition=tensor(False), token_repetition=False
Sentence 3/8: terminated at sample 104/1000
forcing EOS token, long_tail=tensor(True), alignment_repetition=tensor(False), token_repetition=False
Sentence 5/8: terminated at sample 179/1000
forcing EOS token, long_tail=tensor(True), alignment_repetition=tensor(False), token_repetition=False
Sentence 7/8: terminated at sample 143/1000
Detected 3x repetition of token 4218
forcing EOS token, long_tail=tensor(False), alignment_repetition=tensor(False), token_repetition=True
Impact
When Whisper transcribes the resulting audio, phrases appear duplicated at chunk boundaries where the analyzer forced EOS. This makes the output unusable for longer texts without manual post-processing.
Expected Behavior
The long_tail threshold should allow natural sentence completion before triggering. For German sentences of 50-120 characters, the model typically needs 200-400+ samples to complete. Early termination at 60-180 samples cuts off mid-generation and causes token repetition artifacts.
Suggested Fix
Increase the long_tail sensitivity thresholds in alignment_stream_analyzer or make them configurable via model.generate() parameters. A per-language calibration might also help, as German phonetics may produce different alignment patterns than English training data.
Environment
chatterbox pip package (latest as of May 2026)
- CUDA 13.0, sm_121 (NVIDIA GB10 / DGX Spark)
- Python 3.12, Linux (arm64)
- Reference audio: ~10s German male voice clip
Problem
The
alignment_stream_analyzerin the Multilingual model forces EOS vialong_tail=tensor(True)at very low sample counts (~60-180 out of 1000 max), causing the model to repeat tokens right before being cut off. This results in duplicated phrases in the generated audio.Reproduction
Setup:
ChatterboxMultilingualTTS.from_pretrained(device="cuda")Trigger: Split long text into sentences and generate each individually. Multiple sentences get terminated early by the analyzer.
Observations (from logs)
Impact
When Whisper transcribes the resulting audio, phrases appear duplicated at chunk boundaries where the analyzer forced EOS. This makes the output unusable for longer texts without manual post-processing.
Expected Behavior
The
long_tailthreshold should allow natural sentence completion before triggering. For German sentences of 50-120 characters, the model typically needs 200-400+ samples to complete. Early termination at 60-180 samples cuts off mid-generation and causes token repetition artifacts.Suggested Fix
Increase the
long_tailsensitivity thresholds inalignment_stream_analyzeror make them configurable viamodel.generate()parameters. A per-language calibration might also help, as German phonetics may produce different alignment patterns than English training data.Environment
chatterboxpip package (latest as of May 2026)