Fix raw text paragraph break normalization#4884
Fix raw text paragraph break normalization#4884kiankyars wants to merge 2 commits intounslothai:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa85bc3ee6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| text = re.sub(r"[^\x20-\x7E\n\t]", "", text) | ||
| text = re.sub(r"[^\S\n]+", " ", text) |
There was a problem hiding this comment.
Normalize whitespace before stripping non-ASCII chars
clean_text now removes non-ASCII characters before whitespace normalization, so non-ASCII whitespace (for example NBSP \u00A0 from HTML/PDF text) is deleted instead of converted to a space. In this path, inputs like "hello\u00A0world" become "helloworld", which corrupts word boundaries and downstream tokenization; prior behavior preserved the separator because whitespace collapsing happened first.
Useful? React with 👍 / 👎.
Summary
Testing