A machine learning project predicting 30-day hospital readmission rates for BPJS (Indonesia's national health insurance) patients. Built with a production-grade ML pipeline, interactive dashboards, and web applications for model inference.
Hospital readmission is a critical healthcare indicator that impacts patient outcomes and healthcare costs. This project develops a predictive model using historical patient data from BPJS to identify high-risk patients who are likely to be readmitted within 30 days of discharge.
- Build a predictive model for 30-day readmission risk
- Analyze patient demographics, visit patterns, and clinical factors
- Identify key drivers of readmission
- Provide accessible interfaces for model inference and data exploration
- Source: Data Sample BPJS (Badan Penyelenggara Jaminan Sosial)
- Data Types:
- Peserta (Patient Demographics): Age, gender, marital status, insurance class, location
- FKTP (Primary Care Data): Outpatient visits, diagnoses, referral patterns
- FKRTL (Hospital Inpatient Data): Hospital admissions, discharge status, diagnoses, length of stay
- Target Variable:
readmitted_30d(Binary: readmission within 30 days) - Time Period: 2021 data
Interact with the model and explore the data through these applications:
| Application | URL | Type |
|---|---|---|
| EDA Dashboard | bpjs-eda-report.streamlit.app | Streamlit |
| Model Inference App | bpjs-next-app.vercel.app | Next.js |
- Overview: KPI cards, readmission distribution, dataset statistics
- Patient Demographics: Gender, age, marital status, insurance class, patient segments
- Hospital Visits (FKRTL): Admission trends, visit duration, top diagnoses, severity analysis
- Readmission Analysis: Readmission rates by patient segments, temporal trends
- Cost Analysis: Billing distribution, cost by diagnosis, high-cost cases
- Geographic Analysis: Province-level visit counts, readmission heatmap with choropleth
- Primary Care (FKTP): Visit trends, referral patterns, FKTP utilization
- Real-time model predictions
- Patient risk scoring
- Individual prediction explanations
- Responsive UI for mobile and desktop use
hospital_readmission_prediction/
├── data/
│ ├── raw/ # Original raw data files
│ ├── cleaned/ # Cleaned and standardized data
│ ├── interim/ # Intermediate processing files
│ ├── processed/ # Final train/test splits
│ └── feature_store/ # Feature matrices
├── src/
│ ├── data_loading.py # Load data from GCS
│ ├── data_validation.py # Great Expectations validation
│ ├── data_cleaning.py # Standardize column names, handle missing values
│ ├── feature_engineering.py # Create features, scaling, encoding
│ ├── model_training.py # Train XGBoost classifier
│ ├── model_evaluation.py # Evaluate metrics, confusion matrix
│ ├── model_inference.py # Model prediction pipeline
│ └── utils.py # Helper functions
├── notebooks/
│ ├── 00_data_loading.ipynb # Load and explore raw data
│ ├── 01_data_validation.ipynb # Data quality checks
│ ├── 02_data_cleaning.ipynb # Data preprocessing
│ ├── 03_eda.ipynb # Exploratory data analysis
│ ├── 04_feature_engineering.ipynb # Feature creation & engineering
│ ├── 05_baseline_models.ipynb # Baseline model comparison
│ ├── 06_xgboost_optimization.ipynb # Hyperparameter tuning
│ ├── 07_imbalance_handling.ipynb # Class imbalance techniques
│ └── 08_business_metrics.ipynb # Business impact & metrics
├── dashboard/ # Streamlit dashboard
├── gx/ # Great Expectations validation configs
├── artifacts/ # Model artifacts (model, scaler, encoders, metrics)
├── api/ # FastAPI inference server
├── dvc.yaml # DVC pipeline configuration
├── params.yaml # Model parameters and paths
└── requirements.txt # Python dependencies
The project uses Data Version Control (DVC) to orchestrate the ML pipeline:
data_loading → data_validation → data_cleaning → feature_engineering → model_training → model_evaluation
Pipeline Stages:
- data_loading: Fetch raw data from Google Cloud Storage (GCS)
- data_validation: Validate data quality using Great Expectations
- data_cleaning: Standardize column names, handle missing values
- feature_engineering: Create features, split train/test by time, scale & encode
- model_training: Train XGBoost classifier with optimized hyperparameters
- model_evaluation: Generate confusion matrix and metrics
- XGBoost: Gradient boosting classifier for binary readmission prediction
- scikit-learn: Preprocessing, scaling, encoding, model evaluation
- pandas & NumPy: Data manipulation and numerical computing
- Featuretools: Automated feature engineering (ETL)
- Great Expectations: Data quality validation rules and profiling
- pandas-profiling: Data profiling and exploratory analysis
- DVC (Data Version Control): ML pipeline orchestration and reproducibility
- MLflow: Experiment tracking and model versioning
- Optuna: Hyperparameter optimization with Bayesian search
- dagshub: Centralized experiment management
- imbalanced-learn (imblearn): SMOTE, undersampling, oversampling techniques
- SHAP: Shapley Additive exPlanations for model interpretability
- Streamlit: Interactive EDA dashboard and data exploration
- Next.js: Modern web app for model inference and predictions
- Plotly: Interactive visualizations and charts
- Google Cloud Storage (GCS): Data storage and retrieval
- Vercel: Hosting for Next.js application
- Streamlit Cloud: Hosting for Streamlit dashboard
- FastAPI: High-performance Python API for model serving
- Uvicorn: ASGI web server
- Geopandas: Geographic data analysis for province-level insights
The project creates two types of features:
-
Temporal Features:
- Time-based train/test split (80% training cutoff)
- Days between consecutive visits
- Visit duration (length of stay)
- Visit frequency per patient
-
Patient-Level Features:
- Demographics: age, gender, marital status
- Insurance & socioeconomic: insurance class, patient segment
- Healthcare utilization: visit counts, top diagnoses, primary care patterns
- Geographic factors: province, district
-
Encoding & Scaling:
- LabelEncoder for categorical variables
- StandardScaler for numerical features
- Artifacts saved for production inference
Why XGBoost?
- Handles both numerical and categorical features effectively
- Captures non-linear relationships and feature interactions
- Robust to class imbalance with
scale_pos_weightparameter - Fast training and inference
- Built-in feature importance for interpretability
- Empirically proven performance on healthcare datasets
Tuned using Optuna with cross-validation:
max_depth: Tree depth controllearning_rate: Gradient boosting step sizen_estimators: Number of boosting roundsscale_pos_weight: Class weight for imbalanced data- Regularization:
reg_alpha,reg_lambda,gamma
- Precision & Recall: Handle class imbalance
- AUC-PR: Focus on minority class (readmitted patients)
- Confusion Matrix: TP, FP, FN, TN analysis
- Business Metrics: Cost impact, patient risk segmentation
Challenge: Readmission events are rare (~10-15% of cases), causing the model to be biased toward negative class.
Solutions Implemented:
scale_pos_weightin XGBoost to penalize false negatives- SMOTE (Synthetic Minority Over-sampling Technique) for balanced training sets
- AUC-PR metric instead of accuracy (more sensitive to minority class)
- Stratified cross-validation to maintain class distribution
Challenge: Three separate datasets (Peserta, FKTP, FKRTL) with complex relationships via patient ID and visit IDs.
Solutions Implemented:
- Multi-step data cleaning with consistent column naming across files
- Temporal alignment of records by patient visit dates
- Feature engineering using Featuretools for automated multi-table feature creation
- Data validation rules to detect linking errors and missing values
Challenge: Using future information to predict current events would violate causality.
Solutions Implemented:
- Time-based train/test split (80% quantile cutoff on discharge dates)
- Careful feature engineering to only use historical information
- No future diagnosis codes or visit data in feature set
Challenge: Many patients have incomplete healthcare records, sparse diagnoses, or missing demographics.
Solutions Implemented:
- Systematic imputation strategies (mean for numerical, mode for categorical)
- Feature selection to remove low-variance or highly sparse features
- Great Expectations validation to track data quality metrics
- Robust encoding for missing categorical values
Challenge: Readmission rates vary significantly across provinces due to healthcare infrastructure differences.
Solutions Implemented:
- Province-level analysis in EDA dashboard
- Geographic features (province, district) in model
- Stratified evaluation by region to ensure model fairness
Challenge: Healthcare models must be explainable for clinical adoption.
Solutions Implemented:
- XGBoost feature importance scores
- SHAP analysis for individual prediction explanations
- Confusion matrix visualization for clinical threshold validation
- Business metrics dashboard showing patient risk segments
Challenge: Making predictions accessible to non-technical users.
Solutions Implemented:
- FastAPI backend for efficient inference
- Streamlit dashboard for data exploration without coding
- Next.js web app for modern, responsive UI
- Model artifacts (scaler, encoders) stored for reproducible predictions
Challenge: ML pipelines are complex with many interdependent steps; easy to break reproducibility.
Solutions Implemented:
- DVC for pipeline orchestration and versioning
- params.yaml for centralized configuration
- MLflow for experiment tracking and model registry
- Documented dependency graph (dvc.yaml)
- Version-controlled code and notebooks
- Python 3.9+
- Google Cloud credentials (for data loading from GCS)
- Git & DVC
-
Clone the repository
git clone https://github.com/yourusername/hospital_readmission_prediction.git cd hospital_readmission_prediction -
Create virtual environment
python -m venv venv source venv/Scripts/activate # Windows # or source venv/bin/activate # macOS/Linux
-
Install dependencies
pip install -r requirements.txt
-
Set up credentials (if using cloud data)
# Add your GCS service account key export GOOGLE_APPLICATION_CREDENTIALS="path/to/service-key.json"
# Run entire DVC pipeline
dvc repro
# Or run individual stages
dvc repro --single-stage data_loading
dvc repro --single-stage data_validation
# ... etc
# View pipeline DAG
dvc dagcd dashboard
pip install -r requirements.txt
streamlit run app.pyThe dashboard will open at http://localhost:8501
cd api
pip install -r requirements.txt
uvicorn main:app --reload --port 8000API documentation: http://localhost:8000/docs
XGBoost Classifier Results (on test set):
- ROC-AUC Score: 0.956 -> Excellent discrimination between readmitted and non-readmitted patients
- Recall (Sensitivity): 92.3% -> Identifies 923 out of 1000 actual readmission cases
- Precision: 45.1% -> Of predicted readmissions, 451 are true positives per 1000 predictions
- F1-Score (Class 1): 0.606 -> Balanced metric accounting for precision-recall tradeoff
- Macro F1-Score: 0.756 -> Overall performance across both classes
- Accuracy: 84.7% -> Correct predictions on overall dataset
Based on test set of 5,530 patients:
| Predicted No Readmission | Predicted Readmission | Total | |
|---|---|---|---|
| Actual No Readmission | 4,037 (TN) | 790 (FP) | 4,827 |
| Actual Readmission | 54 (FN) | 649 (TP) | 703 |
| Total | 4,091 | 1,439 | 5,530 |
Key Insights:
- High Recall (92.3%): The model catches 92% of patients at risk of readmission, critical for clinical early intervention
- Lower Precision (45.1%): Among patients flagged as high-risk, only 45% will actually be readmitted. This is expected and acceptable for healthcare risk models (better to over-predict and prevent harm than miss at-risk patients)
- True Positives: 649 patients correctly identified as readmission risk
- False Negatives: Only 54 missed readmission cases -> very low, prioritizing patient safety
- False Positives: 790 patients flagged as high-risk but did not readmit -> acceptable overhead for preventive care programs
- True Negatives: 4,037 patients correctly identified as low-risk
This model is well-suited for high-sensitivity screening scenarios where:
- Missing a readmission case (false negative) is more costly than over-flagging low-risk patients
- Early intervention programs can tolerate some false positives
- The 92% recall ensures most at-risk patients are identified despite moderate precision
Key insights from EDA:
- Readmission rates vary by patient demographics and geographic location
- Specific diagnoses show higher readmission risk
- Length of stay correlates with readmission probability
- Seasonal and temporal patterns in hospital utilization
- Geographic disparities in healthcare access and readmission rates
See dashboard README for interactive exploration.
The notebooks/ folder contains a complete narrative of the project:
| Notebook | Purpose |
|---|---|
00_data_loading.ipynb |
Load and explore raw data from GCS |
01_data_validation.ipynb |
Data quality checks and profiling |
02_data_cleaning.ipynb |
Standardize names, handle missing values |
03_eda.ipynb |
Statistical analysis and visualizations |
04_feature_engineering.ipynb |
Create and engineer features |
05_baseline_models.ipynb |
Compare baseline models (Logistic, RF, XGB) |
06_xgboost_optimization.ipynb |
Hyperparameter tuning with Optuna |
07_imbalance_handling.ipynb |
Test SMOTE and resampling techniques |
08_business_metrics.ipynb |
Business impact and risk segmentation |
All configuration is in params.yaml:
- Data paths and GCS locations
- Model hyperparameters (learning_rate, max_depth, etc.)
- Preprocessing parameters (scaling, encoding)
- Pipeline stage configurations
GOOGLE_APPLICATION_CREDENTIALS: Path to GCS service account keyMLFLOW_TRACKING_URI: MLflow server URL (optional)DAGSHUB_USER_TOKEN: DagHub token for experiment tracking (optional)
Data quality is ensured through Great Expectations configurations:
# Run validation
great_expectations checkpoint run checkpoint_fkrtl
great_expectations checkpoint run checkpoint_fktp
great_expectations checkpoint run checkpoint_peserta
# View validation reports
open gx/uncommitted/data_docs/local_site/index.html- Column presence and data types
- Missing value thresholds
- Statistical bounds on numerical columns
- Categorical value whitelist
- Clinical Risk Stratification: Identify high-risk patients for early intervention programs
- Resource Planning: Allocate beds and staff based on readmission forecasts
- Quality Improvement: Monitor readmission trends by facility and provider
- Cost Optimization: Target interventions to reduce costly readmissions
- Policy Analytics: Assess healthcare policies' impact on readmission rates
This project uses 2021 BPJS sample data for research and educational purposes.
- XGBoost Documentation: https://xgboost.readthedocs.io/
- DVC ML Pipeline Docs: https://dvc.org/
- Great Expectations: https://greatexpectations.io/
- Optuna Hyperparameter Tuning: https://optuna.org/
- SHAP Interpretability: https://shap.readthedocs.io/
- Streamlit Apps: https://streamlit.io/
- Healthcare ML Best Practices: https://arxiv.org/abs/2006.04185
Last Updated: March 2026