Bioactivity Feature Engineering & Analysis Pipeline

Overview

This project is a reproducible bioinformatics workflow built using Nextflow to process, analyze, and visualize chemical and protein features associated with biological activity (pActivity).

The pipeline integrates:

Python-based data preprocessing
Feature separation (protein vs chemical descriptors)
Exploratory data analysis in R
PCA and correlation analysis
Automated reporting using Rmarkdown

The goal is to investigate how molecular and protein-derived features related to bioactivity and to prepare structured datasets for downstream predictive modeling.

Workflow Structure

The pipline consists fo four main stages:

1. Data Cleaning (Python)

Removes non-informative and redundant columns
Handles categorical conversion
Outputs a cleaned dataset

2. Feature Splitting (Python)

Separates dataset into:
- Protein Features
- Chemical Features
- Labels (Chemincal and target: pActivity)
Produces multiple structured CSV files

3. Protein and Chemical Feature Exploration (RMarkdown)

Descriptive statistics
Distribution analysis
PCA (Principal Component Analysis)
Correlation Analysis

4. Predictive Modeling

Multiple supervised machine learning models are implemented using the caret framework to predict pActivity from combined chemical and protein features. The dataset is split into training (80%) and testing (20%) subsets to evaluate generalizability.
The following models are trained and compared:

Random Forest (RF)
Support Vector Machine with radial basis kernel (SVM)
Generalized Linear Model (GLM)
Elastic Net (ENET)

Models are trained using a repeated 5-fold cross-validation with hyperparameter tuning. Preprocessing steps including centering, scaling, adn PCA are applied where appropriate.

Performance evaluated on a held-out test using:

Root Mean Square Error (RMSE)
R-squared
Mean Absolute Error (MAE)
Pearson Correlation

While Random Forest and SVM models achieve near-perfect predictive performance, these results suggest potential overfitting. Further validation, including external testing, is required to confirm model robustness.

Bioactivity/
│
│
├── data/
│ ├── Multiple files splitting descriptive features
│ ├── raw_data.csv
│ └── target.csv
│
├── scripts/
│ ├── clean_data.py
│ ├── split_feature_sets.py
│ ├── predictive_models.py
│ ├── Predictive_Model.Rmd
│ ├── Sandbox.R
│ ├── Protein_Feature_Exploration.Rmd
│ └── Chem_Feature_Exploration.Rmd
│
├── results/
│ ├── REPORT_chemFeatureExploration/ │ ├── REPORT_proteinFeatureExploration/
│ └── REPORT_PredictiveModels/
│
│
├── main.nf # Nextflow pipeline
│
├── .nextflow (auto-generated)
├── work/ # Nextflow working directory (auto-generated)
│
└── README.md

Requirements

System Dependencies

Java (required for Nextflow)
Nextflow (>= 23.x recommended)
Python (>= 3.8)
R (>= 4.0)

R Packages

Install required R packages:

install.packages(c(
    "rmarkdown",
    "corrplot", 
    "ggplot2"
))

Python Packages

pip install pandas numpy

How to Run the Pipeline

Run the full workflow using:

nextflow run main.nf

Notes on Reproducibility

All intermediate flies are handled automatically by Nextflow
Each process runs in an isolated environment
Outputs are reproducible and traceable via the work/ directory
RMarkdown reports are parameterized for workflow compatibility

Known Limitations

Requires correct installations of Pandoc for RMarkdown rendering
Large datasets may require increased memory allocation in Nextflow
Feature engineering are specific to current data.

Future Improvements

Add feature selection pipeline
Containerize workflow using Docker/Singularity

Author

Gina Magro Bioinformatics / Computation Biology Pipeline Project

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
data		data
results		results
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
main.nf		main.nf
nextflow		nextflow
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bioactivity Feature Engineering & Analysis Pipeline

Overview

The goal is to investigate how molecular and protein-derived features related to bioactivity and to prepare structured datasets for downstream predictive modeling.

Workflow Structure

1. Data Cleaning (Python)

2. Feature Splitting (Python)

3. Protein and Chemical Feature Exploration (RMarkdown)

4. Predictive Modeling

Requirements

System Dependencies

R Packages

How to Run the Pipeline

Notes on Reproducibility

Known Limitations

Future Improvements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bioactivity Feature Engineering & Analysis Pipeline

Overview

The goal is to investigate how molecular and protein-derived features related to bioactivity and to prepare structured datasets for downstream predictive modeling.

Workflow Structure

1. Data Cleaning (Python)

2. Feature Splitting (Python)

3. Protein and Chemical Feature Exploration (RMarkdown)

4. Predictive Modeling

Requirements

System Dependencies

R Packages

How to Run the Pipeline

Notes on Reproducibility

Known Limitations

Future Improvements

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages