'[UNK]' during tokenization when word starts with 'q'

Hi, 
I faced an issue with the tokenizer while tokenizing text that start from `q`. I found that `q` is missing from vocab.txt file. (`Q` is present.)

> tokenizer.tokenize('q qnm')
> ['[UNK]', '[UNK]']

**Simple fix i tried**:  Add `q` into tokenizer using `add_tokens` method (huggingface), but it failed to produce exact/correct tokenization.

> tokenizer.add_tokens(['q'])
> tokenizer.tokenize('q qnm')
> ['q', 'q' 'n', '##m']

Here `n` should be `##n`, while due to added `q` separately, it will treat `q` as new token and will try to split it separately. Which is not a correct solution down the line.

**Solution suggested**:
Add `q` into the `vocab.txt` file, that way it will result in correct tokenization. (I added at the last of vocab.txt file and updated model embedding size, not sure how it will work with model down the line. Yet to test)

> tokenizer.tokenize('q qnm')
> ['q', 'q', '##n', '##m']

I hope you will release updated tokenizer `vocab.txt` file with added token `q`. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'[UNK]' during tokenization when word starts with 'q' #34

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

'[UNK]' during tokenization when word starts with 'q' #34

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions