Log the new characters added to the tokenizer during preprocessing

It would be helpful to be able to see a log of the characters added to the NLLB tokenizer during onboarding or preprocessing.  The tokenization stats reports (CSV/XLS) indicate the number of characters added to the tokenizer, but doesn't provide a list of the characters added nor the corpus file where they occurred.   If the onboarding script and preprocess/experiment scripts could provide this information in a log file, it would be helpful for tracing down character usage in a project that may need to be updated.

Wildebeest reports can be used to review the overall list of characters used in a corpus file, but are not NLLB-aware and don't specifically point to the non-NLLB characters in the corpus file.

It would be particularly helpful if the log of these characters included supporting details (not just the character).  Some helpful additional details would be:
- Corpus file where the character occurred
- Unicode value for the character (and Unicode character name)
- Number of occurrences of the character in the corpus file
- Line number / vref ID for each occurrence (possibly with a max limit for this particular detail).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Log the new characters added to the tokenizer during preprocessing #989

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Log the new characters added to the tokenizer during preprocessing #989

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions