Tokenizer encode ignores pre-tokenizer when using right to left (rtl) digit splits with add_tokens

Hello,

I am training a custom tokenizer with right-to-left 3 digit splits, following [this](https://huggingface.co/spaces/huggingface/number-tokenization-blog) blog post. This is a custom tokenizer with only digits and some special tokens in its vocabulary, so no training is necessary. 

If I create a file with all the numbers (0 to 999) in it and train it with the BPE trainer the tokenizer works as expected. However, I think this is problematic when training on top of a text corpus---in that case, to guarantee digit group tokens are in the vocabulary, the natural approach is to add them with `.add_tokens()`.

When doing so, `pre_tokenize_str()` correctly splits `1234567890` into `['1', '234', '567', '890']` (RTL), but `encode()` returns `['123', '456', '789', '0']` (LTR). The pre-tokenizer is being bypassed during encoding for tokens added via `add_tokens()`.

This suggests that added tokens are matched as literals before the pre-tokenizer runs, which is not documented in [AddedTokens](https://huggingface.co/docs/tokenizers/en/api/added-tokens) or [`add_tokens`](https://huggingface.co/docs/tokenizers/en/api/added-tokens). There is no warning or documented workaround for this interaction with `Split` pre-tokenizers, which makes it a silent correctness issue. It is unclear whether this is intended behavior or a bug. Either way, the question remains: 
> What is the correct way to guarantee specific tokens exist in the vocabulary while still having the pre-tokenizer respected during encoding, when training on a mixed text+number corpus?

## Minimal Example
```python
"""Minimal reproduction of RTL digit tokenizer bug: pre_tokenizer correctly splits RTL but encode() returns LTR chunks."""

from tokenizers import Tokenizer, Regex, models, trainers, decoders
from tokenizers.pre_tokenizers import Sequence, Split, ByteLevel
from tokenizers import AddedToken, trainers
from transformers import PreTrainedTokenizerFast

# ── Config ──────────────────────────────────────────────
VOCAB_SIZE = 1024
MIN_FREQ = 1 # 1 works for the numbers 
RTL_3DIGIT_PATTERN = r"\d{1,3}(?=(\d{3})*\b)"
SPECIAL_TOKENS = ["<unk>", "<s>", "</s>", "<pad>"]


# ── Build tokenizer ───────────────────────────────────────────
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
tokenizer.pre_tokenizer = Sequence([
    Split(pattern=Regex(RTL_3DIGIT_PATTERN), behavior="isolated", invert=False),
    ByteLevel(add_prefix_space=False, use_regex=False),
])
tokenizer.decoder = decoders.ByteLevel()

# No training data — add digit tokens directly
trainer = trainers.BpeTrainer(
        vocab_size=VOCAB_SIZE,
        min_frequency=MIN_FREQ,
        special_tokens=SPECIAL_TOKENS,
        show_progress=True,
    )

digit_tokens = [
    AddedToken(str(i), single_word=False, lstrip=False, rstrip=False, normalized=False, special=False)
    for i in range(1000)  # 0..999 covers all 3-digit groups
]
tokenizer.add_tokens(digit_tokens)

# Wrap in HF fast tokenizer
hf_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token="<unk>",
    bos_token="<s>",
    eos_token="</s>",
    pad_token="<pad>",
)
```

## Issue

Then I get the following weird behavior:
```python
print(hf_tokenizer.convert_ids_to_tokens(hf_tokenizer.encode("1234567890")))
# outputs: ['123', '456', '789', '0']

print(hf_tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str("1234567890"))
# outputs correct: [('1', (0, 1)), ('234', (1, 4)), ('567', (4, 7)), ('890', (7, 10))]
```

`encode("1234567890")` should return `['1', '234', '567', '890']`, consistent with what `pre_tokenize_str` produces.


## Environment
- transformers==4.57.3
- tokenizers==0.22.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer encode ignores pre-tokenizer when using right to left (rtl) digit splits with add_tokens #1957

Minimal Example

Issue

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tokenizer encode ignores pre-tokenizer when using right to left (rtl) digit splits with add_tokens #1957

Description

Minimal Example

Issue

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions