Machine learning model of clinical symptoms to determine likelihood of certain diseases using scikit-learn classification.
Author: Bernadette Burks
Date: September 14, 2025
- Overview
- Python Libraries
- Dataset Details
- Data Preparation & Cleaning
- Model Training Methods
- Model Evaluation
- Example Prediction Workflow
- Results
- Key Skills Demonstrated
- Future Improvements
- Project Structure
- References
This Disease Prediction Model is based on a GeeksforGeeks tutorial:
https://www.geeksforgeeks.org/machine-learning/disease-prediction-using-machine-learning/
The primary goal of this model is to explore how machine learning can assist in identifying diseases based on symptom patterns in datasets.
This project incorporates several foundational libraries widely used in data science workflows:
- Pandas β Provides efficient data structures and visualization support built on NumPy.
- NumPy β Enables fast numerical operations and compact dataset handling.
- SciPy β Extends NumPy with additional scientific computing functionality.
- Matplotlib β Offers highly customizable plotting tools compatible with most ML libraries.
- Seaborn β Builds upon Matplotlib for more complex statistical visualizations.
- Scikit-learn β Supplies integrated machine learning algorithms and model evaluation tools.
The dataset used in this project is entitled improved_disease_dataset and contains a single file with 2,000 total entries.
The dataset consists of symptom-based predictor variables, including:
- fever
- headache
- nausea
- vomiting
- fatigue
- joint_pain
- skin_rash
- cough
- weight_loss
- yellow_eyes
The outcome variable is:
- disease (the predicted diagnosis)
This dataset is cross-validated using 5-fold stratified k-fold validation, which defaults to:
- 80% training data (4 folds)
- 20% validation/testing data (1 fold)
This results in:
- Training set: 1,600 rows
- Validation set: 400 rows
A recommended preprocessing step involves converting disease labels into numeric values to support early visualization and detection of class imbalance. Additionally, because certain disease categories were underrepresented, the dataset benefits from applying RandomOverSampler.
For cross-validation, the project evaluates three primary machine learning classifiers:
DecisionTreeClassifier()RandomForestClassifier()SVC()
Note: I found an error in the original author's code during the confusion matrix evaluation stage: the DecisionTreeClassifier() appears to be replaced with GaussianNB(). I fixed this in my version of the assignment.
The project produces a final predictive function, predict_disease, which accepts symptom inputs and outputs a predicted diagnosis.
Across the trained models, two of three classifiers (67%) produced the same disease prediction given identical symptom sets.
While this performance level would require significant improvement before clinical deployment, it serves as a strong educational foundation and demonstrates the potential for future refinement in healthcare-oriented machine learning applications given a more robust dataset.
Once trained, the model can be used by inputting symptom features such as:
predict_disease(
fever=1,
headache=1,
nausea=0,
vomiting=0,
fatigue=1,
joint_pain=0,
skin_rash=0,
cough=1,
weight_loss=0,
yellow_eyes=0
)Model performance was evaluated using confusion matrices and cross-validation accuracy.
| Model | Notes |
|---|---|
| Random Forest Classifier | Strong overall predictive consistency across symptom categories |
| Support Vector Classifier (SVC) | Performed similarly to Random Forest on majority classes |
| Decision Tree Classifier | Variance in outcome was likely skewed due to "noise" in dataset |
Confusion matrices provide insight into:
- Correct vs. incorrect disease classifications
- Class-level prediction strengths
- Potential areas of misclassification due to symptom overlap
This project highlights several core machine learning and healthcare analytics skills:
- Data preprocessing and label encoding
- Handling class imbalance with oversampling
- Model training with cross-validation
- Comparing classifier performance
- Confusion matrix evaluation
- Applying ML concepts to clinical analysis
This project serves as an excellent starting point for continued development. Future enhancements may include:
- Expanding evaluation metrics beyond accuracy (precision, recall, F1-score)
- Additional algorithms such as Gradient Boosting or XGBoost
- Feature importance tools for enhanced understanding
- Larger clinical datasets for stronger generalization
- Hosting the model via a GUI or dashboard
Disease Prediction Using Machine Learning. (2025). GeeksforGeeks. Retrieved September 14, 2025 from https://www.geeksforgeeks.org/machine-learning/disease-prediction-using-machine-learning/
Ly, S. (2024). 8 Python Libraries You Must Know for Data Science. Simple Analytics. Retrieved September 14, 2025 from https://simpleanalytics.co.nz/blogs/8-python-libraries-you-must-know-for-data-science
SciPy. (n.d.). Retrieved September 14, 2025 from https://scipy.org/



