Skip to content

mtl_tokenizer.json missing Persian characters: آ (U+0622), أ (U+0623), إ (U+0625) #527

@faridnasiri

Description

@faridnasiri

Summary

The Persian fine-tune tokenizer (mtl_tokenizer.json, 2352 BPE tokens) is missing three alef variants required for Persian text. This causes the model to produce silence or garbled output for any word containing these characters. The most critical is آ (ALEF WITH MADDA ABOVE, U+0622) — it appears in hundreds of common Persian words as the word-initial long /ɒː/ vowel.


Impact

Input | Expected | Actual -- | -- | -- آب (āb = water) | /ɒːb/ | silence + "b" آسمان (āsemān = sky) | /ɒːsemɒːn/ | only "s(e)mān" آمدن (āmadan = to come) | /ɒːmædæn/ | "m(a)d(a)n" آن (ān = that) | /ɒːn/ | silence

Interestingly, the v3/multilingual grapheme tokenizer (grapheme_mtl_merged_expanded_v1.json, 2454 tokens) does have all three: آ=idx 2356, أ=idx 2353, إ=idx 2354. The model's text_emb is [2454, 1024] (from T3Config.multilingual()), so the embeddings for these characters already exist in the weight matrix — they're just unreachable from the Persian subword tokenizer.


What we tried

  1. Embedding expansion — Added new token IDs 2352–2354, expanded text_emb to [2357, 1024], initialized new rows from alef (idx 1456). Failed: The model's text_emb is actually [2454, 1024], so our new indices fell inside existing rows (idx 2352 = € Euro sign embedding). Result: آ pronounced as random character.

  2. Token injection at v3 positions — Mapped آ→idx 2356 directly in the patched tokenizer (the v3 spot where the model already has an آ embedding). Failed: Adding tokens to a BPE vocab without updating the merge table causes the tokenizer to fragment the token into random subword pieces. Result: آ pronounced as "م و ش".


Workaround we settled on

Decompose the precomposed Unicode character into its constituent parts, both of which exist in the tokenizer:

آ (U+0622) = ا (U+0627 ALEF) + ٓ (U+0653 MADDAH ABOVE)

Both are in the tokenizer: ا at idx 1456, MADDAH ABOVE at idx 1457. The decomposition is applied as a text-level replacement before tokenization:

text = text.replace("آ", "آ")   # ALEF MADDA → ALEF + MADDAH ABOVE
text = text.replace("أ", "ا")    # ALEF HAMZA → ALEF
text = text.replace("إ", "ا")    # ALEF HAMZA BELOW → ALEF

This works because the model was trained on diacritized Persian text and understands that ALEF + MADDAH ABOVE = long /ɒː/. The result has the correct vowel quality (matching the ا in صالح).


Suggested proper fix

  1. Add آ, أ, إ to mtl_tokenizer.json with proper BPE merge rules trained on Persian text
  2. The model's text_emb is already [2454, 1024] with trained embeddings at positions 2353, 2354, 2356 — these should be usable as-is
  3. The merge table needs updating so the tokenizer can correctly tokenize words containing these characters as single coherent subwords rather than fragmenting them

Versions

  • Model: hootan09/ChatterBox  t3_fa.safetensors
  • Tokenizer: mtl_tokenizer.json (2352 tokens)
  • chatterbox-tts: 0.1.7
  • Tested on: RTX 5060 Ti (16 GB), CUDA 12.x

Note: The same fix applies to the Thomcles mirror (Thomcles/Chatterbox-TTS-Persian-Farsi) — the files are MD5-identical.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions