Skip to content

Log the new characters added to the tokenizer during preprocessing #989

@mmartin9684-sil

Description

@mmartin9684-sil

It would be helpful to be able to see a log of the characters added to the NLLB tokenizer during onboarding or preprocessing. The tokenization stats reports (CSV/XLS) indicate the number of characters added to the tokenizer, but doesn't provide a list of the characters added nor the corpus file where they occurred. If the onboarding script and preprocess/experiment scripts could provide this information in a log file, it would be helpful for tracing down character usage in a project that may need to be updated.

Wildebeest reports can be used to review the overall list of characters used in a corpus file, but are not NLLB-aware and don't specifically point to the non-NLLB characters in the corpus file.

It would be particularly helpful if the log of these characters included supporting details (not just the character). Some helpful additional details would be:

  • Corpus file where the character occurred
  • Unicode value for the character (and Unicode character name)
  • Number of occurrences of the character in the corpus file
  • Line number / vref ID for each occurrence (possibly with a max limit for this particular detail).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    🔖 Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions