You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
output_attentions=True forces eager attention for all models, blocking SDPA/flash attention optimization (unreleased fix on master)
Bug Report
Summary
In the latest PyPI release (v0.1.7 / tag v0.1.2), t3.py hardcodes output_attentions=True in both the initial forward pass and the generation loop. This forces PyTorch's eager attention implementation, disabling SDPA and flash_attention optimizations.
For English models, the AlignmentStreamAnalyzer that consumes these attention weights is None — so the attention outputs are computed but never used. This wastes GPU memory bandwidth and compute on every forward pass during autoregressive generation.
The fix is already on master (both calls changed to output_attentions=False) but has never been released, leaving all pip users affected.
Impact
Every Chatterbox user installing via pip install chatterbox-tts (which installs v0.1.7)
English models compute full attention weights that are never consumed — pure waste
Blocks SDPA/flash attention for T3 transformer, which is the most expensive component
On ROCm/GPU: forces eager attention mode, preventing AOTriton flash attention from working
On CUDA: prevents torch.nn.functional.scaled_dot_product_attention from using Flash Attention 2 or memory-efficient attention kernels
# Install latest release# pip install chatterbox-tts==0.1.7fromchatterbox.models.t3.t3importT3# In t3.py, lines ~311 and ~362 (v0.1.2 tag):# output_attentions=True, # <-- hardcoded, forces eager attention# ...# output_attentions=True, # <-- same in generation loop# With transformers >=4.36 (SDPA default), this causes:# ValueError: The `output_attentions` attribute is not supported# when using `attn_implementation` set to sdpa.# (Issue #339)# With ROCm/AOTriton flash attention built from source:# Forces fallback to eager mode, losing flash attention speedup# For English models, alignment_stream_analyzer is None, so these# attention outputs are never consumed:model=ChatterboxTTS.from_pretrained(device="cuda")
# model.alignment_stream_analyzer is None for English# → attention weights computed and discarded every step
Root Cause
In src/chatterbox/models/t3/t3.py (v0.1.2 tag), lines ~311 and ~362:
The AlignmentStreamAnalyzer that needs these weights is only instantiated for multilingual models. For English models, self.patched_model.alignment_stream_analyzer is None — but output_attentions=True is unconditional.
Fix (already on master)
The master branch already has this fixed:
output_attentions=False,
at both locations. However, this change has not been released — pip still installs the broken version.
Suggested Fix
If a new release isn't imminent, the conditional fix is more precise:
This preserves multilingual alignment functionality while enabling SDPA/flash attention for English models.
Why This Matters for Performance
The T3 transformer is memory-bound during autoregressive generation — each step is a full transformer forward pass. Computing unused attention weights:
Allocates (batch, heads, seq_len, seq_len) tensors per layer every step
Forces the eager attention kernel instead of memory-efficient SDPA/flash
The attention outputs are stored in the BaseModelOutputWithPast object but never accessed
On my testing (Radeon 8060S, gfx1151, ROCm 6.4.3):
With output_attentions=True: T3 runs at ~24 it/s (memory-bound ceiling)
With output_attentions=False: same it/s ceiling (still memory-bound), but enables SDPA/flash attention path for future optimization and reduces per-step memory allocation
On CUDA GPUs with flash attention available, the impact is larger — eager attention is significantly slower than flash for longer sequences.
output_attentions=Trueforces eager attention for all models, blocking SDPA/flash attention optimization (unreleased fix on master)Bug Report
Summary
In the latest PyPI release (v0.1.7 / tag v0.1.2),
t3.pyhardcodesoutput_attentions=Truein both the initial forward pass and the generation loop. This forces PyTorch'seagerattention implementation, disablingSDPAandflash_attentionoptimizations.For English models, the
AlignmentStreamAnalyzerthat consumes these attention weights isNone— so the attention outputs are computed but never used. This wastes GPU memory bandwidth and compute on every forward pass during autoregressive generation.The fix is already on
master(both calls changed tooutput_attentions=False) but has never been released, leaving all pip users affected.Impact
pip install chatterbox-tts(which installs v0.1.7)torch.nn.functional.scaled_dot_product_attentionfrom using Flash Attention 2 or memory-efficient attention kernelsoutput_attentionsnot supported when using voice references with transformers >=4.36 #339 (SDPA crash withoutput_attentions=True) — the root cause is the same hardcodingReproduction
Root Cause
In
src/chatterbox/models/t3/t3.py(v0.1.2 tag), lines ~311 and ~362:The
AlignmentStreamAnalyzerthat needs these weights is only instantiated for multilingual models. For English models,self.patched_model.alignment_stream_analyzerisNone— butoutput_attentions=Trueis unconditional.Fix (already on master)
The
masterbranch already has this fixed:at both locations. However, this change has not been released — pip still installs the broken version.
Suggested Fix
If a new release isn't imminent, the conditional fix is more precise:
This preserves multilingual alignment functionality while enabling SDPA/flash attention for English models.
Why This Matters for Performance
The T3 transformer is memory-bound during autoregressive generation — each step is a full transformer forward pass. Computing unused attention weights:
(batch, heads, seq_len, seq_len)tensors per layer every stepeagerattention kernel instead of memory-efficient SDPA/flashBaseModelOutputWithPastobject but never accessedOn my testing (Radeon 8060S, gfx1151, ROCm 6.4.3):
output_attentions=True: T3 runs at ~24 it/s (memory-bound ceiling)output_attentions=False: same it/s ceiling (still memory-bound), but enables SDPA/flash attention path for future optimization and reduces per-step memory allocationOn CUDA GPUs with flash attention available, the impact is larger — eager attention is significantly slower than flash for longer sequences.
Related Issues
output_attentionsnot supported when using voice references with transformers >=4.36 #339 — SDPA compatibility crash withoutput_attentions=True(same root cause, different symptom)flash_attnsupport? #359 — Request for flash attention support (directly blocked by this)Environment
cc @resemble-ai — could a patch release be cut with this fix? It's already on master.