This project is a reproducible bioinformatics workflow built using Nextflow to process, analyze, and visualize chemical and protein features associated with biological activity (pActivity).
The pipeline integrates:
- Python-based data preprocessing
- Feature separation (protein vs chemical descriptors)
- Exploratory data analysis in R
- PCA and correlation analysis
- Automated reporting using Rmarkdown
The goal is to investigate how molecular and protein-derived features related to bioactivity and to prepare structured datasets for downstream predictive modeling.
The pipline consists fo four main stages:
- Removes non-informative and redundant columns
- Handles categorical conversion
- Outputs a cleaned dataset
- Separates dataset into:
- Protein Features
- Chemical Features
- Labels (Chemincal and target: pActivity)
- Produces multiple structured CSV files
- Descriptive statistics
- Distribution analysis
- PCA (Principal Component Analysis)
- Correlation Analysis
Multiple supervised machine learning models are implemented using the caret
framework to predict pActivity from combined chemical and protein features.
The dataset is split into training (80%) and testing (20%) subsets to evaluate
generalizability.
The following models are trained and compared:
- Random Forest (RF)
- Support Vector Machine with radial basis kernel (SVM)
- Generalized Linear Model (GLM)
- Elastic Net (ENET)
Models are trained using a repeated 5-fold cross-validation with hyperparameter tuning. Preprocessing steps including centering, scaling, adn PCA are applied where appropriate.
Performance evaluated on a held-out test using:
- Root Mean Square Error (RMSE)
- R-squared
- Mean Absolute Error (MAE)
- Pearson Correlation
While Random Forest and SVM models achieve near-perfect predictive performance, these results suggest potential overfitting. Further validation, including external testing, is required to confirm model robustness.
Bioactivity/
│
│
├── data/
│ ├── Multiple files splitting descriptive features
│ ├── raw_data.csv
│ └── target.csv
│
├── scripts/
│ ├── clean_data.py
│ ├── split_feature_sets.py
│ ├── predictive_models.py
│ ├── Predictive_Model.Rmd
│ ├── Sandbox.R
│ ├── Protein_Feature_Exploration.Rmd
│ └── Chem_Feature_Exploration.Rmd
│
├── results/
│ ├── REPORT_chemFeatureExploration/
│ ├── REPORT_proteinFeatureExploration/
│ └── REPORT_PredictiveModels/
│
│
├── main.nf # Nextflow pipeline
│
├── .nextflow (auto-generated)
├── work/ # Nextflow working directory (auto-generated)
│
└── README.md
- Java (required for Nextflow)
- Nextflow (>= 23.x recommended)
- Python (>= 3.8)
- R (>= 4.0)
Install required R packages:
install.packages(c(
"rmarkdown",
"corrplot",
"ggplot2"
))Python Packages
pip install pandas numpy
Run the full workflow using:
nextflow run main.nf
- All intermediate flies are handled automatically by Nextflow
- Each process runs in an isolated environment
- Outputs are reproducible and traceable via the
work/directory - RMarkdown reports are parameterized for workflow compatibility
- Requires correct installations of Pandoc for RMarkdown rendering
- Large datasets may require increased memory allocation in Nextflow
- Feature engineering are specific to current data.
- Add feature selection pipeline
- Containerize workflow using Docker/Singularity
Gina Magro Bioinformatics / Computation Biology Pipeline Project