This project focuses on building a machine learning-based system to predict whether a person is diabetic or not based on several medical diagnostic measurements. The aim of this academic project is to apply data preprocessing, exploratory data analysis, and classification algorithms to create a reliable prediction model for early identification of diabetes.
Diabetes is a chronic disease that affects millions of people globally. Early detection plays a crucial role in preventing complications. This project uses supervised machine learning techniques to predict diabetes using a structured dataset.
The work includes:
- Data preprocessing
- Feature cleaning
- Model building
- Performance evaluation
- Confusion matrix visualization
It demonstrates practical application of ML concepts in a real-world medical prediction problem.
- To analyze the diabetes dataset and understand key health indicators
- To apply machine learning algorithms for predicting diabetes
- To evaluate model accuracy, precision, recall, and F1-score
- To build a reusable and easy-to-run ML pipeline
The dataset contains the following health-related features:
- Pregnancies
- Glucose Level
- Blood Pressure
- Skin Thickness
- Insulin Level
- BMI
- Diabetes Pedigree Function
- Age
- Outcome (0 = No Diabetes, 1 = Diabetes)
The dataset is clean, structured, and suitable for classification tasks.
The entire workflow includes:
- Reads the CSV dataset
- Handles missing and zero values
- Replaces invalid values
- Splits data into train–test sets
- Scales features for better model performance
A Logistic Regression model is used because:
- It is simple
- Performs well for binary classification
- Easy to interpret
Model performance is measured using:
- Accuracy
- Precision
- Recall
- F1 Score
- Confusion Matrix (visualized)
The model gives a balanced performance in detecting diabetic and non-diabetic cases. Evaluation metrics such as accuracy and F1-score demonstrate the model’s effectiveness.
A confusion matrix plot is also generated for better understanding of classification performance.
- Python
- NumPy
- Pandas
- Matplotlib
- Scikit-Learn
- Seaborn
- Joblib
Install dependencies:
bash pip install -r requirements.txt
Train the model:
bash python src/main.py --mode train
Evaluate the model:
bash python src/main.py --mode evaluate
This project demonstrates:
- Understanding of machine learning workflows
- Ability to clean and prepare raw data
- Implementation of classification algorithms
- Evaluation of model performance
- Knowledge of real-world medical dataset challenges
It is suitable for college submissions, machine learning courses, mini-projects, and academic presentations.
This Diabetes Prediction project successfully builds a machine learning model capable of predicting diabetes with good accuracy. The project showcases end-to-end ML development skills — from data preprocessing to final evaluation — making it an excellent academic demonstration of machine learning applications in healthcare.