Skip to content

bernadette-burks/DiseasePrediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

74 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Disease Prediction Model 🩺

Python Machine Learning Healthcare Status

Machine learning model of clinical symptoms to determine likelihood of certain diseases using scikit-learn classification.

Author: Bernadette Burks
Date: September 14, 2025


πŸ“‘ Table of Contents


πŸ“Œ Overview

This Disease Prediction Model is based on a GeeksforGeeks tutorial:
https://www.geeksforgeeks.org/machine-learning/disease-prediction-using-machine-learning/

The primary goal of this model is to explore how machine learning can assist in identifying diseases based on symptom patterns in datasets.


πŸ“š Python Libraries

This project incorporates several foundational libraries widely used in data science workflows:

  • Pandas – Provides efficient data structures and visualization support built on NumPy.
  • NumPy – Enables fast numerical operations and compact dataset handling.
  • SciPy – Extends NumPy with additional scientific computing functionality.
  • Matplotlib – Offers highly customizable plotting tools compatible with most ML libraries.
  • Seaborn – Builds upon Matplotlib for more complex statistical visualizations.
  • Scikit-learn – Supplies integrated machine learning algorithms and model evaluation tools.

πŸ—‚ Dataset Details

The dataset used in this project is entitled improved_disease_dataset and contains a single file with 2,000 total entries.

Features Included

The dataset consists of symptom-based predictor variables, including:

  • fever
  • headache
  • nausea
  • vomiting
  • fatigue
  • joint_pain
  • skin_rash
  • cough
  • weight_loss
  • yellow_eyes

The outcome variable is:

  • disease (the predicted diagnosis)

πŸ“Š Training & Validation Approach

This dataset is cross-validated using 5-fold stratified k-fold validation, which defaults to:

  • 80% training data (4 folds)
  • 20% validation/testing data (1 fold)

This results in:

  • Training set: 1,600 rows
  • Validation set: 400 rows

🧹 Data Preparation & Cleaning

A recommended preprocessing step involves converting disease labels into numeric values to support early visualization and detection of class imbalance. Additionally, because certain disease categories were underrepresented, the dataset benefits from applying RandomOverSampler.


🧠 Model Training Methods

For cross-validation, the project evaluates three primary machine learning classifiers:

  • DecisionTreeClassifier()
  • RandomForestClassifier()
  • SVC()

Note: I found an error in the original author's code during the confusion matrix evaluation stage: the DecisionTreeClassifier() appears to be replaced with GaussianNB(). I fixed this in my version of the assignment.


πŸ“ˆ Model Evaluation

The project produces a final predictive function, predict_disease, which accepts symptom inputs and outputs a predicted diagnosis.

Across the trained models, two of three classifiers (67%) produced the same disease prediction given identical symptom sets.

While this performance level would require significant improvement before clinical deployment, it serves as a strong educational foundation and demonstrates the potential for future refinement in healthcare-oriented machine learning applications given a more robust dataset.


πŸ§ͺ Example Prediction Workflow

Once trained, the model can be used by inputting symptom features such as:

predict_disease(
    fever=1,
    headache=1,
    nausea=0,
    vomiting=0,
    fatigue=1,
    joint_pain=0,
    skin_rash=0,
    cough=1,
    weight_loss=0,
    yellow_eyes=0
)

πŸ“Š Results

Model performance was evaluated using confusion matrices and cross-validation accuracy.

Confusion Matrix Comparison

Model Notes
Random Forest Classifier Strong overall predictive consistency across symptom categories
Support Vector Classifier (SVC) Performed similarly to Random Forest on majority classes
Decision Tree Classifier Variance in outcome was likely skewed due to "noise" in dataset

Random Forest Confusion Matrix

SVC Confusion Matrix

Decision Tree Confusion Matrix

Combined Model Confusion Matrix

Confusion matrices provide insight into:

  • Correct vs. incorrect disease classifications
  • Class-level prediction strengths
  • Potential areas of misclassification due to symptom overlap

πŸ›  Key Skills Demonstrated

This project highlights several core machine learning and healthcare analytics skills:

  • Data preprocessing and label encoding
  • Handling class imbalance with oversampling
  • Model training with cross-validation
  • Comparing classifier performance
  • Confusion matrix evaluation
  • Applying ML concepts to clinical analysis

πŸš€ Future Improvements

This project serves as an excellent starting point for continued development. Future enhancements may include:

  • Expanding evaluation metrics beyond accuracy (precision, recall, F1-score)
  • Additional algorithms such as Gradient Boosting or XGBoost
  • Feature importance tools for enhanced understanding
  • Larger clinical datasets for stronger generalization
  • Hosting the model via a GUI or dashboard

πŸ”— References

Disease Prediction Using Machine Learning. (2025). GeeksforGeeks. Retrieved September 14, 2025 from https://www.geeksforgeeks.org/machine-learning/disease-prediction-using-machine-learning/

Ly, S. (2024). 8 Python Libraries You Must Know for Data Science. Simple Analytics. Retrieved September 14, 2025 from https://simpleanalytics.co.nz/blogs/8-python-libraries-you-must-know-for-data-science

SciPy. (n.d.). Retrieved September 14, 2025 from https://scipy.org/

About

Machine learning model of clinical symptoms to determine likelihood of certain diseases using scikit-learn classification 🩺

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors