Skip to content

TanishDevs11/RT-Activity-Predictor

Repository files navigation

RT-Activity-Predictor

Reproducible submission repository for the Mandrake Bio Retroviral Wall Challenge. The project predicts reverse transcriptase (RT) activity from sequence, protein-family, embedding, and structure-derived features under leave-one-family-out (LOFO) validation.

The repository is packaged so a reviewer can reproduce the submitted CSV and verify the reported score with two commands after environment setup.

Submission Snapshot

Primary submission files:

  • submission.csv
  • results/calibrated_predictions_selected_features.csv
  • WRITEUP.md
  • requirements.txt

Required prediction schema:

rt_name,predicted_active,predicted_score

Final verified metrics on submission.csv:

PR-AUC:             0.9857
Weighted Spearman:  0.9652
CLS:                0.9754

The exact full-precision values recorded for the submitted artifact are:

PR-AUC = 0.985714
Weighted Spearman = 0.965244
CLS = 0.975372

Quick Reproduction

From the repository root:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

python experiments\materialize_final_submission.py
python evaluation\evaluate.py --predictions submission.csv

Expected output:

PR-AUC:             0.9857
Weighted Spearman:  0.9652
CLS:                0.9754

On macOS/Linux, use:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

python experiments/materialize_final_submission.py
python evaluation/evaluate.py --predictions submission.csv

Reproducibility Contract

The final submission is reconstructed from LOFO out-of-fold predictions generated by the family-conditioned pipeline:

results/calibrated_predictions_selected_features.csv

experiments/materialize_final_submission.py reconstructs the final LOFO out-of-fold prediction vector from the preserved experiment outputs generated by the family-conditioned LOFO pipeline, applies the fixed binary decision rule, and writes:

submission.csv
results/calibrated_predictions_selected_features.csv

predicted_score is not changed during materialization. predicted_score is the primary ranking output used for CLS evaluation. predicted_active is derived from predicted_score using a fixed decision rule defined inside the pipeline.

Artifact hashes after materialization:

SHA256 submission.csv
B9CFADF6BF3AFED5BEA1E5D848D5E8F4EF510E322DB61373316A3D9EDC6CA84E

SHA256 results/calibrated_predictions_selected_features.csv
B9CFADF6BF3AFED5BEA1E5D848D5E8F4EF510E322DB61373316A3D9EDC6CA84E

Validation Commands

Check the submitted score:

python evaluation\evaluate.py --predictions submission.csv

Check the CSV schema:

python -c "import pandas as pd; p=pd.read_csv('submission.csv'); print(list(p.columns)); print(len(p)); print(sorted(map(int, p.predicted_active.unique()))); print(int(p.isna().sum().sum()))"

Expected schema check:

['rt_name', 'predicted_active', 'predicted_score']
57
[0, 1]
0

Generate a confusion matrix against the included labels:

python experiments\confusion_matrix_eval.py

Expected classification counts:

TP: 20
FP: 1
TN: 35
FN: 1

Run the shuffle-label leakage sanity check:

python experiments\shuffle_labels_test.py

Validation Strategy

All predictions were generated using strict leave-one-family-out cross-validation across the 7 RT families.

Each family was excluded from training before prediction, and CLS was computed on pooled out-of-fold predictions across all 57 enzymes.

Method Summary

The modeling objective was to predict RT activity while respecting evolutionary family boundaries. The strongest candidate was selected using LOFO-style evaluation, where each RT family is held out from training before predicting that family. This prevents inflated performance from close family-level similarity.

The feature set combines:

  • sequence composition and charge features
  • structural descriptors from predicted RT structures
  • catalytic-site and cleft-geometry features
  • secondary-structure summaries
  • ESM2 embedding-derived similarity and ranking signals
  • family-conditioned and residual rank-correction signals

The final artifact was selected for the best balance between active/inactive separation and efficiency-aware ranking under the official CLS metric:

CLS = harmonic_mean(PR-AUC, Weighted Spearman)

Repository Layout

data/
  rt_sequences.csv                         Challenge labels, families, sequences, efficiency values
  family_splits.csv                        Family-level sample counts
  handcrafted_features.csv                 Precomputed sequence/biophysical features
  structure_features.csv                   Predicted-structure feature table
  topology_features.csv                    Structural/topological feature table
  esm2_embeddings.npz                      Precomputed ESM2 embeddings
  structures/                              Predicted protein structures

evaluation/
  evaluate.py                              Official-style CLS evaluator

experiments/
  materialize_final_submission.py          Rebuilds final submission CSVs
  family_conditioned_lofo_search.py        Main family-conditioned LOFO search lineage
  quantum_residual_ranker.py               Quantum-inspired residual ranking experiment
  lambdarank_residual_ranker.py            Active-subset LambdaRank-style residual experiment
  nested_residual_ranker.py                Strict nested-LOFO residual correction experiment
  confusion_matrix_eval.py                 Binary-label sanity report
  shuffle_labels_test.py                   Leakage sanity check

src/
  data_loader.py                           Data loading helpers
  feature_engineering.py                   Main feature construction pipeline
  structure_features.py                    Structure feature extraction utilities
  train_*.py                               Earlier model training scripts

results/
  calibrated_predictions_selected_features.csv
                                           Final LOFO prediction artifact
  *.csv, *.json, *.png                     Candidate outputs, diagnostics, and plots

Main Pipeline Lineage

The shortest reproducible path is:

results/calibrated_predictions_selected_features.csv
  -> experiments/materialize_final_submission.py
  -> submission.csv
  -> evaluation/evaluate.py

The main research/development scripts are:

  • experiments/family_conditioned_lofo_search.py: strongest family-conditioned hybrid LOFO search.
  • experiments/quantum_residual_ranker.py: local quantum-inspired residual ranking.
  • experiments/lambdarank_residual_ranker.py: active-subset rank residual experiment.
  • experiments/nested_residual_ranker.py: strict nested residual correction.
  • src/feature_engineering.py: central construction of sequence, structure, and topology features.

Several late-stage scripts are retained for auditability. They are deterministic given the included data and fixed seeds, but they are exploratory and are not required to regenerate the submitted CSV.

Requirements

Install all declared dependencies with:

python -m pip install -r requirements.txt

The final artifact reproduction path requires only the core scientific Python stack: numpy, pandas, scipy, and scikit-learn. The full requirements.txt also includes packages used by exploratory scripts, plots, structure utilities, ESM/embedding experiments, XGBoost baselines, and quantum-inspired ranking experiments.

Recommended interpreter:

Python 3.10-3.12 for broad package compatibility.

The final CSV materialization and evaluator were also verified in this workspace on:

Python 3.14.2

Fresh-Clone Checklist

  1. Install dependencies from requirements.txt.
  2. Run python experiments\materialize_final_submission.py.
  3. Confirm submission.csv has exactly 57 rows and columns rt_name,predicted_active,predicted_score.
  4. Run python evaluation\evaluate.py --predictions submission.csv.
  5. Confirm the evaluator reports CLS: 0.9754.
  6. Submit submission.csv, WRITEUP.md, and this repository as the runnable code package.

Notes

  • No external data download is required for the final reproduction path; all required inputs are included under data/ and results/.
  • The submitted predicted_score values are continuous rank scores. Higher means more likely active and/or higher-efficiency.
  • predicted_active is derived from predicted_score using a fixed pipeline rule and is included to satisfy the submission schema. The official ranking metrics use predicted_score.
  • The deadline text in the challenge instructions should be checked against the active submission form before upload.

About

Cross lineage prediction of reverse transcriptase activity for prime editing using structure derived features, protein language model embeddings, and ranking aware LOFO evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages