Skip to content

paypal/gators

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

🐊 Gators

PyPI version Python versions License Coverage Documentation code style: black imports: isort

Lightning-fast data preprocessing and feature engineering for machine learning

What is Gators?

Gators is a lightning-fast data preprocessing and feature engineering library built on top of Polars, designed to streamline your entire ML workflow from raw data to production-ready models. Leveraging Polars’ blazing-fast multi-core processing.

Built by the PSP Data Team at PayPal, Gators makes data preprocessing and feature engineering both faster and simpler.

⚑ Key Features

  • πŸš€ Lightning Fast: Built on Polars for multi-core parallel processing
  • πŸ”„ Unified API: Consistent sklearn-style .fit() and .transform() interface
  • πŸ“¦ Production Ready: Deploy the same Python code from notebook to production
  • 🎯 Comprehensive: 75+ preprocessing transformers covering every use case
  • πŸ”— Pipeline Support: Chain transformers seamlessly with the Pipeline class
  • πŸŽ“ Easy to Learn: If you know sklearn, you already know Gators

πŸ› οΈ What Can Gators Do?

🧹 Data Cleaning

Clean and prepare your data with powerful transformers:

  • CastColumns - Convert column data types
  • CorrelationFilter - Remove highly correlated features
  • DropColumns - Remove specified columns
  • DropConstantColumns - Remove columns with constant values
  • DropDuplicateColumns - Remove duplicate columns
  • DropDuplicateRows - Remove duplicate rows
  • DropHighNaNRatio - Remove columns with high missing value ratio
  • DropLowCardinality - Remove low cardinality columns
  • HighCardinalityFilter - Filter high cardinality features
  • OutlierFilter - Detect and filter outliers
  • RenameColumns - Rename columns
  • Replace - Replace values in data
  • VarianceFilter - Remove low variance features

πŸ”’ Categorical Encoding

Transform categorical variables with advanced encoding techniques:

  • BinaryEncoder - Binary representation encoding
  • CatBoostEncoder - CatBoost-style encoding
  • CountEncoder - Frequency-based encoding
  • LeaveOneOutEncoder - Leave-one-out encoding
  • OneHotEncoder - Classic one-hot encoding
  • OrdinalEncoder - Order-based encoding
  • RareCategoryEncoder - Handle rare categories intelligently
  • TargetEncoder - Target-based encoding for supervised learning
  • WOEEncoder - Weight of Evidence encoding

🎯 Feature Generation - Numeric

Create powerful numeric features:

  • ComparisonFeatures - Generate comparison features
  • ConditionFeatures - Create conditional features
  • DistanceFeatures - Calculate distance features
  • GroupLagFeatures - Generate lag features by group
  • GroupScalingFeatures - Scale features within groups
  • GroupStatisticsFeatures - Calculate group statistics
  • IsNull - Generate null indicator features
  • MathFeatures - Apply mathematical operations (add, subtract, multiply, divide)
  • PlanRotationFeatures - Rotate features in feature space
  • PolynomialFeatures - Generate polynomial combinations
  • RatioFeatures - Create ratio features between columns
  • RowStatisticsFeatures - Calculate row-wise statistics
  • RuleFeatures - Apply custom business rules
  • ScalarMathFeatures - Apply scalar operations

πŸ“ Feature Generation - String

Extract insights from text data:

  • CharacterStatistics - Extract character-level statistics
  • CombineFeatures - Combine string features
  • Contains - Check if string contains pattern
  • Endswith - Check if string ends with pattern
  • ExtractSubstring - Extract substring from text
  • InteractionFeatures - Generate string interaction features
  • Length - Calculate string length
  • Lower - Convert text to lowercase
  • NGram - Generate n-gram features
  • Occurrences - Count pattern occurrences
  • PatternDetector - Detect patterns in text
  • Split - Split strings
  • SplitExtract - Split and extract from strings
  • Startswith - Check if string starts with pattern
  • Upper - Convert text to uppercase

πŸ“… Feature Generation - DateTime

Unlock temporal patterns:

  • BusinessTimeFeatures - Business hours/days calculations
  • CyclicFeatures - Circular encoding for cyclical time features
  • DiffFeatures - Calculate time differences
  • DurationToDatetime - Convert duration to datetime
  • HolidayFeatures - Detect and encode holidays
  • OrdinalFeatures - Extract year, month, day, hour, etc.
  • TimeBinFeatures - Bin times into categories
  • TimeWindowFeatures - Generate time window features

πŸ”„ Missing Value Imputation

Handle missing data intelligently:

  • BooleanImputer - Impute boolean columns
  • GroupByImputer - Group-based imputation strategies
  • NumericImputer - Impute numeric columns (mean, median, mode, constant)
  • StringImputer - Impute string columns (mode, constant)

πŸ“Š Discretization

Convert continuous variables into bins:

  • CustomDiscretizer - Custom bin edges
  • EqualLengthDiscretizer - Equal-width binning
  • EqualSizeDiscretizer - Equal-frequency binning
  • GeometricDiscretizer - Geometric progression binning
  • KMeansDiscretizer - K-means clustering-based binning
  • QuantileDiscretizer - Quantile-based binning
  • TreeBasedDiscretizer - Decision tree-based binning

βš–οΈ Feature Scaling

Normalize your features:

  • ArcsinSquarerootScaler - Arcsine square root transformation
  • ArcsinhScaler - Inverse hyperbolic sine transformation
  • BoxCox - Box-Cox power transformation
  • LogScaler - Logarithmic scaling
  • MinmaxScaler - Min-max normalization
  • PowerScaler - Power transformation
  • StandardScaler - Standardization (z-score normalization)
  • YeoJonhson - Yeo-Johnson power transformation

πŸ”— Pipeline

Chain all transformers together:

  • Pipeline - sklearn-compatible pipeline for chaining transformers

πŸš€ Quick Start

import polars as pl
from gators.data_cleaning import DropHighNaNRatio, VarianceFilter
from gators.encoders import OneHotEncoder
from gators.imputers import NumericImputer
from gators.scalers import StandardScaler
from gators.pipeline import Pipeline

# Load your data
X = pl.read_csv("data.csv")

# Build a preprocessing pipeline
pipeline = Pipeline([
    ('drop_nan', DropHighNaNRatio(threshold=0.5)),
    ('impute', NumericImputer(strategy='median')),
    ('variance', VarianceFilter(threshold=0.01)),
    ('encode', OneHotEncoder()),
    ('scale', StandardScaler())
])

# Fit and transform
X_processed = pipeline.fit_transform(X)

# Deploy the same pipeline in production!

πŸ“¦ Installation

pip install gators

Or install from source:

git clone https://github.com/paypal/gators.git
cd gators
pip install -e .

πŸ“š Documentation

For detailed documentation, tutorials, and API reference, visit:

https://paypal.github.io/gators/

🎯 Use Cases

Gators is perfect for:

  • Fraud Detection - Extensive feature engineering for anomaly detection
  • Risk Modeling - Create powerful predictive features
  • Customer Analytics - Transform complex customer data
  • Time Series - Rich datetime feature engineering
  • NLP Tasks - String feature extraction and encoding

🀝 Contributing

We welcome contributions! Please check out our contributing guidelines.

πŸ“„ License

Gators is licensed under the Apache License 2.0. See LICENSE file for details.

πŸ™ Credits

Developed by the PSP Data Team at PayPal.


Built by data scientists, for data scientists

About

Gators is a package to handle model building with big data and fast real-time pre-processing, even for a large number of QPS, using only Python.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages