Skip to content

gmagro24/Bioactivity-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bioactivity Feature Engineering & Analysis Pipeline

Overview

This project is a reproducible bioinformatics workflow built using Nextflow to process, analyze, and visualize chemical and protein features associated with biological activity (pActivity).

The pipeline integrates:

  • Python-based data preprocessing
  • Feature separation (protein vs chemical descriptors)
  • Exploratory data analysis in R
  • PCA and correlation analysis
  • Automated reporting using Rmarkdown

The goal is to investigate how molecular and protein-derived features related to bioactivity and to prepare structured datasets for downstream predictive modeling.

Workflow Structure

The pipline consists fo four main stages:

1. Data Cleaning (Python)

  • Removes non-informative and redundant columns
  • Handles categorical conversion
  • Outputs a cleaned dataset

2. Feature Splitting (Python)

  • Separates dataset into:
    • Protein Features
    • Chemical Features
    • Labels (Chemincal and target: pActivity)
  • Produces multiple structured CSV files

3. Protein and Chemical Feature Exploration (RMarkdown)

  • Descriptive statistics
  • Distribution analysis
  • PCA (Principal Component Analysis)
  • Correlation Analysis

4. Predictive Modeling

Multiple supervised machine learning models are implemented using the caret framework to predict pActivity from combined chemical and protein features. The dataset is split into training (80%) and testing (20%) subsets to evaluate generalizability.
The following models are trained and compared:

  • Random Forest (RF)
  • Support Vector Machine with radial basis kernel (SVM)
  • Generalized Linear Model (GLM)
  • Elastic Net (ENET)

Models are trained using a repeated 5-fold cross-validation with hyperparameter tuning. Preprocessing steps including centering, scaling, adn PCA are applied where appropriate.

Performance evaluated on a held-out test using:

  • Root Mean Square Error (RMSE)
  • R-squared
  • Mean Absolute Error (MAE)
  • Pearson Correlation

While Random Forest and SVM models achieve near-perfect predictive performance, these results suggest potential overfitting. Further validation, including external testing, is required to confirm model robustness.


Bioactivity/


├── data/
│ ├── Multiple files splitting descriptive features
│ ├── raw_data.csv
│ └── target.csv

├── scripts/
│ ├── clean_data.py
│ ├── split_feature_sets.py
│ ├── predictive_models.py
│ ├── Predictive_Model.Rmd
│ ├── Sandbox.R
│ ├── Protein_Feature_Exploration.Rmd
│ └── Chem_Feature_Exploration.Rmd

├── results/
│ ├── REPORT_chemFeatureExploration/ │ ├── REPORT_proteinFeatureExploration/
│ └── REPORT_PredictiveModels/


├── main.nf # Nextflow pipeline

├── .nextflow (auto-generated)
├── work/ # Nextflow working directory (auto-generated)

└── README.md


Requirements

System Dependencies

  • Java (required for Nextflow)
  • Nextflow (>= 23.x recommended)
  • Python (>= 3.8)
  • R (>= 4.0)

R Packages

Install required R packages:

install.packages(c(
    "rmarkdown",
    "corrplot", 
    "ggplot2"
))

Python Packages

pip install pandas numpy 

How to Run the Pipeline

Run the full workflow using:

nextflow run main.nf 

Notes on Reproducibility

  • All intermediate flies are handled automatically by Nextflow
  • Each process runs in an isolated environment
  • Outputs are reproducible and traceable via the work/ directory
  • RMarkdown reports are parameterized for workflow compatibility

Known Limitations

  • Requires correct installations of Pandoc for RMarkdown rendering
  • Large datasets may require increased memory allocation in Nextflow
  • Feature engineering are specific to current data.

Future Improvements

  • Add feature selection pipeline
  • Containerize workflow using Docker/Singularity

Author

Gina Magro Bioinformatics / Computation Biology Pipeline Project

About

This project is a reproducible bioinfomratics workflow built using Nextflow to process, analyze, and visualize chemical and protein features associated with biological activity (pActivity).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors