Skip to content

Convert Preprocess subdirectory to polars; drop keras_preprocessing; add pytest#175

Open
seanv507 wants to merge 7 commits into
reczoo:mainfrom
seanv507:preprocess_polars
Open

Convert Preprocess subdirectory to polars; drop keras_preprocessing; add pytest#175
seanv507 wants to merge 7 commits into
reczoo:mainfrom
seanv507:preprocess_polars

Conversation

@seanv507

Copy link
Copy Markdown
Contributor

I have converted pandas functions to polars in the preprocess subdirectory

Keras preprocessing is a deprecated library that only supports numpy <2.
The only function used is pad_sequences so I rewrote it in tokenizer.py using polars methods and removed the dependency.

I added pytest library to make some unit tests.

I noticed that line endings are not consistent (and at least in local git saw ^M in my line changes) between CRLF and LF
(#173)
I checked the preprocessed output of Criteo and KKBox. I see almost no differences
https://huggingface.co/datasets/seanv507/KKBox_x1_pre/tree/d93eea9 (orig pandas version)
https://huggingface.co/datasets/seanv507/KKBox_x1_pre/tree/5de1e9c (polars version)

the difference is only in that (#174) ordering is not the same amongst categories of the same frequency (ie would need to order by frequency and then alphabetical)

From memory, tokenizing the Criteo data set took 15 minutes with pandas code vs 1 minute with Polars (on 64GB machine)

@xpai

xpai commented Jun 6, 2026

Copy link
Copy Markdown
Member

Thanks for the PR! I'v resolved the LF/CRLF issues. But the PR cannot be merged due to transform issues I mentioned in #172 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants