Fix raw text paragraph break normalization by kiankyars · Pull Request #4884 · unslothai/unsloth

kiankyars · 2026-04-07T00:39:52Z

Summary

preserve paragraph breaks when cleaning raw text
normalize horizontal whitespace without collapsing newlines
add a regression check for repeated CRLF paragraph separators

Testing

python3 tests/test_raw_text.py

gemini-code-assist · 2026-04-07T00:39:58Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa85bc3ee6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-07T00:41:56Z

unsloth/dataprep/raw_text.py

+        text = re.sub(r"[^\x20-\x7E\n\t]", "", text)
+        text = re.sub(r"[^\S\n]+", " ", text)


Normalize whitespace before stripping non-ASCII chars

clean_text now removes non-ASCII characters before whitespace normalization, so non-ASCII whitespace (for example NBSP \u00A0 from HTML/PDF text) is deleted instead of converted to a space. In this path, inputs like "hello\u00A0world" become "helloworld", which corrupts word boundaries and downstream tokenization; prior behavior preserved the separator because whitespace collapsing happened first.

Useful? React with 👍 / 👎.

Fix raw text paragraph break normalization

f91ed4c

kiankyars requested review from danielhanchen and rolandtannous as code owners April 7, 2026 00:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa85bc3

for more information, see https://pre-commit.ci

chatgpt-codex-connector bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix raw text paragraph break normalization#4884

Fix raw text paragraph break normalization#4884
kiankyars wants to merge 2 commits intounslothai:mainfrom
kiankyars:fix/raw-text-clean-text-newlines

kiankyars commented Apr 7, 2026

Uh oh!

gemini-code-assist bot commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		text = re.sub(r"[^\x20-\x7E\n\t]", "", text)
		text = re.sub(r"[^\S\n]+", " ", text)

Uh oh!

Conversation

kiankyars commented Apr 7, 2026

Summary

Testing

Uh oh!

gemini-code-assist bot commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant