It would be helpful to be able to see a log of the characters added to the NLLB tokenizer during onboarding or preprocessing. The tokenization stats reports (CSV/XLS) indicate the number of characters added to the tokenizer, but doesn't provide a list of the characters added nor the corpus file where they occurred. If the onboarding script and preprocess/experiment scripts could provide this information in a log file, it would be helpful for tracing down character usage in a project that may need to be updated.
Wildebeest reports can be used to review the overall list of characters used in a corpus file, but are not NLLB-aware and don't specifically point to the non-NLLB characters in the corpus file.
It would be particularly helpful if the log of these characters included supporting details (not just the character). Some helpful additional details would be:
- Corpus file where the character occurred
- Unicode value for the character (and Unicode character name)
- Number of occurrences of the character in the corpus file
- Line number / vref ID for each occurrence (possibly with a max limit for this particular detail).
It would be helpful to be able to see a log of the characters added to the NLLB tokenizer during onboarding or preprocessing. The tokenization stats reports (CSV/XLS) indicate the number of characters added to the tokenizer, but doesn't provide a list of the characters added nor the corpus file where they occurred. If the onboarding script and preprocess/experiment scripts could provide this information in a log file, it would be helpful for tracing down character usage in a project that may need to be updated.
Wildebeest reports can be used to review the overall list of characters used in a corpus file, but are not NLLB-aware and don't specifically point to the non-NLLB characters in the corpus file.
It would be particularly helpful if the log of these characters included supporting details (not just the character). Some helpful additional details would be: