A production-oriented credit risk modeling workflow with full MLflow experiment tracking, feature engineering, model selection, and threshold tuning.
This project builds a machine learning pipeline to predict the probability of default for loan applicants. It focuses on practical model development and experiment management, using MLflow to track parameters, metrics, and model artifacts across multiple experiments.
This project is:
- A complete credit-scoring workflow from raw data to trained model
- An MLflow-tracked experiment suite with reproducible runs
- A baseline for risk modeling and cost-sensitive decision thresholds
This project is not:
- A production-ready scoring service
- A real-time inference system
- A fairness or regulatory compliance audit
MLflow is used throughout the modeling workflow to:
- Track experiment parameters and metrics
- Compare candidate models (baseline and advanced)
- Store trained artifacts and assessment outputs
- Make model selection decisions reproducible
The notebooks show how each experiment is logged and compared in MLflow.
data/ # Raw and output datasets (CSV tracked with Git LFS)
models/ # Model artifacts (downloaded from Hugging Face)
notebooks/ # End-to-end workflow notebooks
src/ # Training script and utilities
pyproject.toml # Dependencies
install.sh # Helper to set up the environment
data/raw/holds the original CSVs tracked via Git LFS.data/output/is for derived datasets created by notebooks orsrc/train.py.models/stores the model downloaded from Hugging Face.mlruns/is the local MLflow tracking directory (kept with a.keepplaceholder).
- The raw CSVs are versioned with Git LFS.
- Derived datasets are written to
data/output/and should not be committed.
-
Data prep (
01_data_preparation.ipynb)- Feature selection and cleaning
- Train/test split and dataset export
-
Baseline experiments (
02_mlflow_experiments.ipynb)- Baseline models and MLflow logging
- Initial metric comparisons
-
Algorithmic models (
03_model_comparison.ipynb)- RandomForest, XGBoost, CatBoost comparisons
- Cross-validation and MLflow tracking
-
Hyperparameter optimization (
04_hyperparameter_optimization.ipynb)- Optuna tuning
- Threshold optimization for cost-sensitive decisions
- Best model tracking in MLflow
A lightweight training script replaces the old 00_* notebooks:
python src/train.pyIt builds compact feature datasets and trains a baseline XGBoost model.
The model is hosted publicly on Hugging Face and downloaded on demand.
python src/download_model.pyPython usage:
from huggingface_hub import hf_hub_download
import joblib
path = hf_hub_download(
repo_id="dworsleytonks/credit-scoring-xgb",
filename="credit_scoring_xgb.pkl",
local_dir="models",
local_dir_use_symlinks=False,
)
model = joblib.load(path)MIT License