Skip to content

'[UNK]' during tokenization when word starts with 'q' #34

@shubhanshu786

Description

@shubhanshu786

Hi,
I faced an issue with the tokenizer while tokenizing text that start from q. I found that q is missing from vocab.txt file. (Q is present.)

tokenizer.tokenize('q qnm')
['[UNK]', '[UNK]']

Simple fix i tried: Add q into tokenizer using add_tokens method (huggingface), but it failed to produce exact/correct tokenization.

tokenizer.add_tokens(['q'])
tokenizer.tokenize('q qnm')
['q', 'q' 'n', '##m']

Here n should be ##n, while due to added q separately, it will treat q as new token and will try to split it separately. Which is not a correct solution down the line.

Solution suggested:
Add q into the vocab.txt file, that way it will result in correct tokenization. (I added at the last of vocab.txt file and updated model embedding size, not sure how it will work with model down the line. Yet to test)

tokenizer.tokenize('q qnm')
['q', 'q', '##n', '##m']

I hope you will release updated tokenizer vocab.txt file with added token q.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions