Hi,
I faced an issue with the tokenizer while tokenizing text that start from q. I found that q is missing from vocab.txt file. (Q is present.)
tokenizer.tokenize('q qnm')
['[UNK]', '[UNK]']
Simple fix i tried: Add q into tokenizer using add_tokens method (huggingface), but it failed to produce exact/correct tokenization.
tokenizer.add_tokens(['q'])
tokenizer.tokenize('q qnm')
['q', 'q' 'n', '##m']
Here n should be ##n, while due to added q separately, it will treat q as new token and will try to split it separately. Which is not a correct solution down the line.
Solution suggested:
Add q into the vocab.txt file, that way it will result in correct tokenization. (I added at the last of vocab.txt file and updated model embedding size, not sure how it will work with model down the line. Yet to test)
tokenizer.tokenize('q qnm')
['q', 'q', '##n', '##m']
I hope you will release updated tokenizer vocab.txt file with added token q.
Hi,
I faced an issue with the tokenizer while tokenizing text that start from
q. I found thatqis missing from vocab.txt file. (Qis present.)Simple fix i tried: Add
qinto tokenizer usingadd_tokensmethod (huggingface), but it failed to produce exact/correct tokenization.Here
nshould be##n, while due to addedqseparately, it will treatqas new token and will try to split it separately. Which is not a correct solution down the line.Solution suggested:
Add
qinto thevocab.txtfile, that way it will result in correct tokenization. (I added at the last of vocab.txt file and updated model embedding size, not sure how it will work with model down the line. Yet to test)I hope you will release updated tokenizer
vocab.txtfile with added tokenq.