π©Ί AI-powered predictive intelligence for diabetes risk screening and education
HealthIntel is a machine learning-based health screening tool designed for diabetes risk assessment using the Behavioral Risk Factor Surveillance System (BRFSS) dataset. The project includes comprehensive data processing, feature engineering, and model development workflows through Jupyter notebooks, with a Streamlit web application for risk prediction.
- π Comprehensive Data Processing Pipeline: Complete BRFSS dataset analysis from raw data to model-ready features
- π¬ Research-Ready Notebooks: Extensive Jupyter notebooks covering EDA, feature engineering, and model development
- π₯οΈ Interactive Streamlit Web App: User-friendly interface for diabetes risk assessment
- π Model Explainability: SHAP (SHapley Additive exPlanations) integration for transparent predictions
- π Statistical Feature Selection: Chi-square and correlation-based feature selection methods
- π PDF Report Generation: Downloadable assessment reports with explanations
- π‘οΈ Privacy-First Design: All computations performed locally, no data transmission
- Python 3.8+ (tested with 3.10-3.12)
- Windows/macOS/Linux
- 8GB+ RAM recommended for model loading
- Clone the repository
git clone <your-repo-url>
cd HealthIntel- Create virtual environment
python -m venv .venv
# Windows PowerShell
.venv\Scripts\Activate.ps1
# Windows Command Prompt
.venv\Scripts\activate.bat
# macOS/Linux
source .venv/bin/activate- Install dependencies
pip install --upgrade pip
pip install -r requirements.txt-
Set up data (Optional - for notebook development)
- Download BRFSS datasets as specified in
datasetLocations.txt - Place raw data in
data/raw/directory
- Download BRFSS datasets as specified in
-
Run the application
streamlit run app.py- Open your browser to the displayed URL (typically
http://localhost:8501)
Note: The models directory is currently empty. You'll need to train models using the provided notebooks or add pre-trained models to use the Streamlit application.
HealthIntel/
βββ app.py # π₯οΈ Main Streamlit web application
βββ README.md # π Project documentation
βββ requirements.txt # π¦ Python dependencies
βββ datasetLocations.txt # π Dataset sources and structure
βββ data/ # π Data storage
β βββ raw/ # Original BRFSS CSV files (2011-2015)
β βββ processed/ # Cleaned and engineered datasets
β βββ train-test/ # Training and testing splits
βββ models/ # π€ Trained ML models (currently empty)
βββ results/ # π Model results and metrics (currently empty)
βββ notebooks/ # π Jupyter notebooks for research
βββ 01_data_merging_and_exploratory_analysis.ipynb
βββ 02_feature_engineering.ipynb
βββ 03_target_focused_dataset_preparation.ipynb
βββ 04_statistical_feature_selection.ipynb
βββ 05_final_preprocessing_and_train_test_split.ipynb (empty)
βββ 06_baseline_models.ipynb (empty)
βββ 07_hyperparameter_tuning.ipynb (empty)
βββ 08_model_evaluation.ipynb (empty)
βββ Diabetics_ModelTraining.ipynb # Comprehensive model training
βββ EDA.ipynb # Exploratory Data Analysis
βββ IT24100886_Impute_Cleaning_ChiSQR.ipynb
app.py: Main Streamlit application with diabetes risk assessment interfacedata/: BRFSS dataset storage with raw, processed, and train-test splitsnotebooks/: Research notebooks covering complete ML pipeline from EDA to model trainingrequirements.txt: All required Python packages and dependenciesdatasetLocations.txt: Documentation of data sources and folder structuremodels/: Directory for trained models (currently empty - models need to be trained)results/: Directory for model performance metrics and results
The diabetes risk model evaluates 20 key health and demographic factors:
The Streamlit application (app.py) is ready for diabetes risk assessment but requires trained models to function properly. The app is designed to work with various ML models and includes comprehensive SHAP explainability features.
The diabetes risk model is designed to evaluate 20 key health and demographic factors:
π₯ Clinical History
- Chronic condition count (0-10)
- Metabolic risk score (Low/Medium/High)
- Blood pressure medication status
- High cholesterol diagnosis
- Cardiovascular conditions count
- Kidney disease history
πͺ Health Status
- General health self-assessment (Excellent to Poor)
- Physical health bad days (categorical)
- Health-limited activity days
- Functional limitations due to health
π« Medical History
- Coronary heart disease
- Heart attack history
- Pneumonia vaccination
- Arthritis diagnosis
- Cancer screening compliance
π€ Demographics & Access
- Age (21-82 years)
- Income level (8 categories)
- Education level (6 categories)
- Last routine checkup
- Last cholesterol check
Once models are trained, the application will provide:
- π― Risk Classification: Low/High risk with probability percentage
- π Model Performance: Real-time display of sensitivity, specificity, precision, and AUC
- π SHAP Explanations:
- Waterfall plots showing feature contributions
- Force plots visualizing decision factors
- Top 5 contributing factors with impact direction
- π PDF Reports: Downloadable assessment reports with complete analysis
- Train Models First: Use the notebook pipeline to create trained models
- Save Models: Export models to the
models/directory - Run Application:
streamlit run app.py - Test Interface: Input health parameters and review risk assessment
Current State: The project includes comprehensive data processing and model development notebooks, but trained models are not yet available in the repository.
-
Data Processing Pipeline:
01_data_merging_and_exploratory_analysis.ipynb: Data merging and initial EDA02_feature_engineering.ipynb: Comprehensive feature engineering (3.2MB)03_target_focused_dataset_preparation.ipynb: Target variable preparation04_statistical_feature_selection.ipynb: Feature selection using statistical methods
-
Model Development:
Diabetics_ModelTraining.ipynb: Main model training notebook (1.1MB)EDA.ipynb: Extensive exploratory data analysis (7.4MB)IT24100886_Impute_Cleaning_ChiSQR.ipynb: Data imputation and Chi-square analysis
-
Planned Notebooks (currently empty):
05_final_preprocessing_and_train_test_split.ipynb06_baseline_models.ipynb07_hyperparameter_tuning.ipynb08_model_evaluation.ipynb
- Set up data: Follow instructions in
datasetLocations.txt - Run notebooks sequentially: Start with data processing, then model training
- Save trained models: Export to
models/directory for use with Streamlit app
Based on the Streamlit app code, the following models are supported:
- XGBoost (with TreeExplainer for SHAP)
- CatBoost
- Random Forest
- Extra Trees
- Stacking Ensemble
- Other scikit-learn compatible models
"
The notebooks/ directory contains comprehensive research workflows:
Data Processing Pipeline:
01_data_merging_and_exploratory_analysis.ipynb: Initial data merging and exploratory analysis02_feature_engineering.ipynb: Comprehensive feature engineering (largest notebook - 3.2MB)03_target_focused_dataset_preparation.ipynb: Target variable preparation and dataset focusing04_statistical_feature_selection.ipynb: Statistical feature selection methods
Model Development:
Diabetics_ModelTraining.ipynb: Main model training pipeline with comprehensive ML workflowsEDA.ipynb: Extensive Exploratory Data Analysis (7.4MB of analysis)IT24100886_Impute_Cleaning_ChiSQR.ipynb: Data imputation, cleaning, and Chi-square analysis
Future Development (empty notebooks ready for development):
05_final_preprocessing_and_train_test_split.ipynb: Final preprocessing and data splitting06_baseline_models.ipynb: Baseline model development07_hyperparameter_tuning.ipynb: Hyperparameter optimization08_model_evaluation.ipynb: Comprehensive model evaluation
- Data Collection: Survey data processing and cleaning
- Feature Engineering: Creating 20 key predictive features
- Model Training: Multiple algorithms with hyperparameter optimization (Optuna)
- Calibration: Probability calibration for reliable risk estimates
- Evaluation: Comprehensive performance metrics and validation
- Explainability: SHAP integration for model interpretability
To reproduce the complete research pipeline:
-
Data Setup:
- Download BRFSS data as specified in
datasetLocations.txt - Place raw CSV files in
data/raw/directory
- Download BRFSS data as specified in
-
Data Processing (run in sequence):
01_data_merging_and_exploratory_analysis.ipynb: Data merging and initial analysis02_feature_engineering.ipynb: Feature engineering pipeline03_target_focused_dataset_preparation.ipynb: Target preparation04_statistical_feature_selection.ipynb: Feature selection
-
Analysis & Training:
EDA.ipynb: Comprehensive exploratory data analysisIT24100886_Impute_Cleaning_ChiSQR.ipynb: Data cleaning and statistical analysisDiabetics_ModelTraining.ipynb: Model training and evaluation
-
Future Development:
- Complete remaining notebooks (05-08) for full ML pipeline
- Export trained models to
models/directory - Test with Streamlit application
- Educational Tool: Designed for learning and research purposes
- Decision Support: Aids healthcare professionals, not a replacement
- Screening Only: Not a diagnostic device or medical advice provider
- Local Processing: All computations performed on user's device
- No Data Transmission: Patient information never leaves local environment
- No PII Required: Uses anonymized health indicators only
- Temporary Storage: No persistent storage of patient data
- Model Explainability: SHAP values explain each prediction
- Performance Metrics: Clear accuracy and reliability statistics
- Feature Descriptions: Plain-language explanations of all inputs
- Probability Calibration: Reliable risk probability estimates
- Not Medical Advice: Always consult healthcare professionals
- Screening Tool Only: Requires clinical confirmation (HbA1c, glucose tests)
- Population Limitations: Trained on specific survey data (BRFSS 2011-2015)
- Individual Variation: Results may not apply to all populations
- Clinical Validation Required: High-risk predictions need medical follow-up
π Model Loading Errors
FileNotFoundError: Model file not found
- Root Cause: The
models/directory is currently empty - Solution: Train models using the provided notebooks first
- Steps:
- Run the notebook pipeline starting with data processing
- Use
Diabetics_ModelTraining.ipynbto train and save models - Export trained models to
models/directory
- Alternative: Update
app.pyto handle missing models gracefully
π¦ Import Errors
ModuleNotFoundError: No module named 'streamlit'
- Solution: Install missing dependencies
- Command:
pip install streamlit pandas joblib scikit-learn xgboost - Virtual Environment: Ensure you're in the correct environment
π§ Memory Issues
MemoryError: Unable to load model
- Solution: This may occur with large datasets during notebook execution
- Alternative: Process data in chunks or reduce dataset size for development
- Note: The
02_feature_engineering.ipynb(3.2MB) andEDA.ipynb(7.4MB) notebooks contain extensive analysis that may require adequate RAM
π SHAP Plotting Issues
- Solution: Restart Streamlit if plots don't render
- Command:
Ctrl+Cthenstreamlit run app.py - Check: Ensure
matplotlibis properly installed
π₯οΈ Windows PowerShell
- Activation: Use
. .venv\Scripts\Activate.ps1(note the dot-space) - Alternative: Use VS Code Python environment selector
- Permission: May need to set execution policy:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
- Train New Model: Use the notebook pipeline to develop additional models
- Save Model: Export as
.joblibfile tomodels/directory - Update App: Modify model loading section in
app.pyif needed - Test Integration: Verify model loading and prediction pipeline
- Data Preparation: Adapt preprocessing notebooks for new health conditions
- Feature Engineering: Develop condition-specific features using
02_feature_engineering.ipynbas template - Model Training: Train models using
Diabetics_ModelTraining.ipynbframework - UI Extension: Create new input forms and result displays
- Statistical Analysis: Leverage
04_statistical_feature_selection.ipynbfor feature analysis - Data Processing: Use the comprehensive preprocessing pipeline in notebooks
- Model Comparison: Implement multiple model evaluation using existing framework
- Research Extension: Add new notebooks following the numbered sequence (09, 10, etc.)
- Mobile Responsiveness: Optimize Streamlit app for tablet/phone usage
- Advanced Visualizations: Enhanced SHAP plots and interactive charts
- Model Selection: Allow users to choose between different trained models
- Batch Processing: Support for multiple patient assessments
- Fork Repository: Create your own copy
- Create Branch:
git checkout -b feature/your-feature-name - Make Changes: Implement improvements or fixes
- Test Locally:
- For notebooks: Test in Jupyter environment
- For app: Run
streamlit run app.py(with trained models)
- Update Documentation: Modify README if needed
- Submit PR: Include description of changes
- Readable Functions: Keep functions small and well-documented
- Descriptive Naming: Use clear variable and function names
- Error Handling: Implement proper exception handling
- No Sensitive Data: Never log or expose patient information
- Comments: Explain complex logic and model decisions
- Notebook Standards: Follow the existing numbered sequence and structure
- Complete Pipeline: Finish empty notebooks (05-08) to complete the ML pipeline
- Model Improvements: Better algorithms or feature engineering
- UI/UX Enhancements: Improved user experience for Streamlit app
- Performance Optimization: Faster data processing and model training
- Documentation: Better guides and examples
- Testing: Unit tests and validation scripts
- Data Analysis: Enhanced EDA and statistical analysis
This project is provided for research and educational purposes. For distribution or commercial use beyond educational scope, please add an appropriate LICENSE file (MIT/Apache-2.0) and ensure compliance with all third-party data source licenses.
- Centers for Disease Control and Prevention (CDC) - BRFSS dataset
- Open Source Community - Essential libraries and frameworks:
- scikit-learn, XGBoost, CatBoost (Machine Learning)
- pandas, numpy (Data Processing)
- Streamlit (Web Framework)
- SHAP (Model Explainability)
- matplotlib, plotly (Visualization)
- reportlab (PDF Generation)
Behavioral Risk Factor Surveillance System (BRFSS)
- Source: CDC National Health Survey
- Years: 2011-2015 (survey data)
- Type: De-identified population health data
- Location: See
datasetLocations.txtfor download instructions and data structure - Usage: Educational and research purposes under CDC data use guidelines
- Structure: Raw data in
data/raw/, processed data indata/processed/, train-test splits indata/train-test/
For questions, issues, or contributions:
- Issues: Create a GitHub issue for bugs or feature requests
- Discussions: Use GitHub Discussions for questions and ideas
- Documentation: Check this README and notebook comments
- Medical Questions: Always consult qualified healthcare professionals
Remember: This tool is for educational purposes only and should never replace professional medical advice or clinical testing.
HealthIntel - Empowering informed health decisions through responsible AI π±