This project uses the dopamine_pEC50.csv dataset to predict which dopamine receptor subtype (D1–D5) a molecule is most likely to interact with using 8 descriptive features.
To accomplish this, we built a heterogeneous ensemble of machine learning models and evaluated them using balanced accuracy, chosen as the most informative metric for interpreting multi-class confusion matrices.
The ensemble makes its final prediction through a majority vote among the models, ensuring no ties since the number of base learners is odd.
Below are all the packages needed to run this script.
packages <- c("caret", "psych", "randomForest", "kernlab",
"neuralnet", "smotefamily", "glmnet", "caretEnsemble")If not already installed, the script will install and load them during the initial run.
The dataset is hardcoded into the code, so the file dopamine_pEC50.csv must be downloaded beforehand.
The main analysis is contained in pEC50_DopamineReceptor. repository
To run the project:
- Clone or download this repository.
- Make sure the dataset
dopamine_pEC50.csvis in the project directory. - Open the file
MagroG.DA5030.Project.Rmdin RStudio. - Click Knit to generate the full report, or run the code chunks interactively.
By default, the script loads dopamine_pEC50.csv.
To use a different dataset, update the file path in the RMarkdown file where the data is loaded (currently line 56):
df <- read.csv("YourFile.csv")This project was developed as part of Northeastern University’s DA5030: Data Science course.
The dataset was obtained from Kaggle: pEC50 Prediction Dopamine ML, contributed by Bhawakshi.
Due to limitations in direct Kaggle-to-PyCharm integration, the dataset was mirrored to GitHub within this repository as dopamine_pEC50.csv for accessibility.