-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Tokenizer encode ignores pre-tokenizer when using right to left (rtl) digit splits with add_tokens #1957
Description
Hello,
I am training a custom tokenizer with right-to-left 3 digit splits, following this blog post. This is a custom tokenizer with only digits and some special tokens in its vocabulary, so no training is necessary.
If I create a file with all the numbers (0 to 999) in it and train it with the BPE trainer the tokenizer works as expected. However, I think this is problematic when training on top of a text corpus---in that case, to guarantee digit group tokens are in the vocabulary, the natural approach is to add them with .add_tokens().
When doing so, pre_tokenize_str() correctly splits 1234567890 into ['1', '234', '567', '890'] (RTL), but encode() returns ['123', '456', '789', '0'] (LTR). The pre-tokenizer is being bypassed during encoding for tokens added via add_tokens().
This suggests that added tokens are matched as literals before the pre-tokenizer runs, which is not documented in AddedTokens or add_tokens. There is no warning or documented workaround for this interaction with Split pre-tokenizers, which makes it a silent correctness issue. It is unclear whether this is intended behavior or a bug. Either way, the question remains:
What is the correct way to guarantee specific tokens exist in the vocabulary while still having the pre-tokenizer respected during encoding, when training on a mixed text+number corpus?
Minimal Example
"""Minimal reproduction of RTL digit tokenizer bug: pre_tokenizer correctly splits RTL but encode() returns LTR chunks."""
from tokenizers import Tokenizer, Regex, models, trainers, decoders
from tokenizers.pre_tokenizers import Sequence, Split, ByteLevel
from tokenizers import AddedToken, trainers
from transformers import PreTrainedTokenizerFast
# ── Config ──────────────────────────────────────────────
VOCAB_SIZE = 1024
MIN_FREQ = 1 # 1 works for the numbers
RTL_3DIGIT_PATTERN = r"\d{1,3}(?=(\d{3})*\b)"
SPECIAL_TOKENS = ["<unk>", "<s>", "</s>", "<pad>"]
# ── Build tokenizer ───────────────────────────────────────────
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
tokenizer.pre_tokenizer = Sequence([
Split(pattern=Regex(RTL_3DIGIT_PATTERN), behavior="isolated", invert=False),
ByteLevel(add_prefix_space=False, use_regex=False),
])
tokenizer.decoder = decoders.ByteLevel()
# No training data — add digit tokens directly
trainer = trainers.BpeTrainer(
vocab_size=VOCAB_SIZE,
min_frequency=MIN_FREQ,
special_tokens=SPECIAL_TOKENS,
show_progress=True,
)
digit_tokens = [
AddedToken(str(i), single_word=False, lstrip=False, rstrip=False, normalized=False, special=False)
for i in range(1000) # 0..999 covers all 3-digit groups
]
tokenizer.add_tokens(digit_tokens)
# Wrap in HF fast tokenizer
hf_tokenizer = PreTrainedTokenizerFast(
tokenizer_object=tokenizer,
unk_token="<unk>",
bos_token="<s>",
eos_token="</s>",
pad_token="<pad>",
)Issue
Then I get the following weird behavior:
print(hf_tokenizer.convert_ids_to_tokens(hf_tokenizer.encode("1234567890")))
# outputs: ['123', '456', '789', '0']
print(hf_tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str("1234567890"))
# outputs correct: [('1', (0, 1)), ('234', (1, 4)), ('567', (4, 7)), ('890', (7, 10))]encode("1234567890") should return ['1', '234', '567', '890'], consistent with what pre_tokenize_str produces.
Environment
- transformers==4.57.3
- tokenizers==0.22.1