Skip to content

Panda-myj/PertDiff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PertDiff

This is the official code for "Enhancing Cross-Context Generalization in Drug Perturbation Prediction with a Multimodal Conditional Diffusion Framework". The data and result(including model.pt) can be found at 10.5281/zenodo.18427848.

1. Setup environment

conda env create -f environment.yml
conda activate PertDiff
pip install -r requirements.txt

then cd code, and install the package

pip install -e .

download sentence-transformers at "sentence-transformers/all-MiniLM-L6-v2 · Hugging Face", put it at PertDiff/sentence-transformers/all-MiniLM-L6-v2/

2. Data preparation

2.1. data overview

processed_data.h5 has 164 cell lines and 8316 drugs.

164 cell lines = 18 normal + 29 pool + 117 tumor

drug split:

train 163 cell, 4989 drug, 47142 pairs
valid 148 cell, 1664 drug, 15714 pairs
test 161 cell, 1663 drug, 15713 pairs

cell split:

train 150 cell, 8315 drug, 74229 pairs
valid 7 cell, 1350 drug, 2381 pairs
test 7 cell, 1518 drug, 1959 pairs

leave new cells out (strict cell split):

train 10 cell, 7381 drug, 23588 pairs
valid 2 cell, 6903 drug, 7034 pairs
test 3 cell, 5467 drug, 7441 pairs

2.2. drugs vectors

prepare a smiles.csv file, which contains SMILES(column name is 'smiles') strings for all drugs, and put it at the /data/smiles.csv, then run:

python scripts/GetComVec_mpg.py

This will generate the drug feature file at /data/mpg8316_768.pkl

2.3. cell lines description vectors

prepare a cell_description.xlsx file, which contains cell description(column contains 'cell_iname' and 'cell_description') for all cell lines, and put it at the /data/cell_description.xlsx, then run:

python scripts/GetCellVec_description.py

This will generate the cell lines feature file at /data/cell_164_384.pkl

3. Model performance evaluation

We evaluate the PertDiff under three data-splitting protocols to assess its robustness and generalization across drugs and cells. Model checkpoints are saved under /result/model/split/seed/description/. Prediction data outputs are stored as .h5ad files under /data/.

3.1. drug split

Purpose: Evaluate generalization to unseen drugs.

data split

python scripts/myj_split.py --split_data_type smiles_split --train_cell_count None

you will get train_data_smile.h5, valid_data_smile.h5, test_data_smile.h5

options:

seed: 978, 364039, 20250410
if_cell_vec: description, no

train

python scripts/diff_train.py --split smile_split --seed 978 --if_cell_vec description --lr_adjust True --epoch 1500 --dropout 0.1 --lr 1e-5 --weight_decay 1e-6

sample

python scripts/diff_sample.py --split smile_split --seed 978 --if_cell_vec description --model_path "../result/model/smile_split/978/description/model001500.pth" 

evaluate

python scripts/diff_evaluate.py --split smile_split --predict_data "../data/smile_split_description_978_predict.h5ad"

3.2. cell split

Purpose: Evaluate generalization to unseen cell lines.

data split

python scripts/myj_split.py --split_data_type cells_split --train_cell_count all

you will get train_data_cell.h5, valid_data_cell.h5, test_data_cell.h5

options:

seed: 978, 364039, 20250410
if_cell_vec: description, no

train

python scripts/diff_train.py --split cell_split --seed 978 --if_cell_vec description --lr_adjust True --epoch 300 --dropout 0.1 --lr 1e-5 --weight_decay 1e-6

sample

python scripts/diff_sample.py --split cell_split --seed 978 --if_cell_vec description --model_path "../result/model/cell_split/978/description/model000300.pth"

evaluate

python scripts/diff_evaluate.py --split cell_split --predict_data "../data/cell_split_description_978_predict.h5ad"

3.3. leave new cells out

Purpose: A stricter cross-cell evaluation, where entirely unseen cell types are excluded from training.

data split

python scripts/myj_split_leave_new_cells_out.py

you will get leave_new_cells_out_train_data.h5, leave_new_cells_out_valid_data.h5, leave_new_cells_out_test_data.h5

options:

seed: 20250525, 20250911, 20255202
if_cell_vec: description, no

train

python scripts/diff_train.py --split leave_new_cells_out --seed 20250525 --if_cell_vec description --lr_adjust True --epoch 300 --dropout 0.5 --lr 1e-5 --weight_decay 1e-6

sample

python scripts/diff_sample.py --split leave_new_cells_out --seed 20250525 --if_cell_vec description --model_path "../result/model/leave_new_cells_out/20250525/description/model000300.pth"

evaluate

python scripts/diff_evaluate.py --split leave_new_cells_out --predict_data "../data/leave_new_cells_out_description_20250525_predict.h5ad"

4. Downstream applications

4.1. drug response prediction

4.1.1. under cross drug

options:

if_cell_vec: description, no

data split

python scripts/myj_split_exp2.py

you will get exp_2_train_data.h5, exp_2_valid_data.h5, exp_2_test_data.h5, exp_2_external_data.h5. using exp_2_train_data.h5 to train model, and using exp_2_external_data.h5 for external data inference.

train and evaluate

python scripts/diff_train.py --split exp_2 --seed 895834 --if_cell_vec description --lr_adjust True --epoch 1500 --dropout 0.3 --lr 1e-5 --weight_decay 1e-6
python scripts/diff_sample.py --split exp_2 --seed 895834 --if_cell_vec description --model_path "../result/model/exp_2/895834/description/model001500.pth"
python scripts/diff_evaluate.py --split exp_2 --predict_data "../data/exp_2_description_895834_predict.h5ad"

external data inference

python scripts/diff_sample.py --split exp_2_external --seed 895834 --if_cell_vec no --model_path "../result/model/exp_2/895834/no/model001500.pth"

drug response prediction analysis

python drug_response_prediction.py

4.1.2. under cross cell

bulk-ic50 is drug response prediction under cross cell

data split

python scripts/myj_split_bulk_ic50.py

you will get bulk_ic50_train_data.h5, bulk_ic50_valid_data.h5, bulk_ic50_test_data.h5.

python scripts/ic50_get_h5.py

you will get bulk_ic50_infer.h5 through bulk_14cell_114smiles_1483sig.csv.

options:

if_cell_vec: description, no

bulk_ic50_cell_split_train

python scripts/GetComVec_mpg.py # csv is bulk_ic50_smiles.csv

python scripts/diff_train.py --split bulk_ic50 --seed 364039 --if_cell_vec description --lr_adjust True --epoch 300 --dropout 0.13 --lr 1e-5 --weight_decay 1e-6
python scripts/diff_sample.py --split bulk_ic50_train --seed 364039 --if_cell_vec description --model_path "../result/model/bulk_ic50/364039/description/model000300.pth"
python scripts/diff_evaluate.py --split bulk_ic50_train --predict_data "../data/bulk_ic50_train_description_364039_predict.h5ad"

bulk_ic50_cell_split_infer

python scripts/diff_sample.py --split bulk_ic50_infer --seed 364039 --if_cell_vec description --model_path "../result/model/bulk_ic50/364039/description/model000300.pth"

bulk_ic50_evaluate

python scripts/bulk_ic50_evaluate.py

4.2. drug repurposing

exp_3 is drug repurposing, using the trained cross drug model to predict PRISM drugs' perturbations.

options:

seed: 100, 1000, 10000
if_cell_vec: description, no

predict perturbations for PRISM drugs

python scripts/GetComVec_mpg.py # csv is smiles_PRISM.csv
python scripts/GetExp3_YAPC_data_h5.py

you will get exp3_test_data.h5

sample

python scripts/diff_sample.py --split exp_3 --if_cell_vec no --model_path "../result/model/smile_split/364039/no/model001500.pth" --seed 10000

drug repurposing analysis

python scripts/drug_repurposing.py

4.3. clinical validation

options

GSE: GSE25055 FILE: GSE25055_2w
GSE: GSE32646 FILE: GSE32646_5w
GSE: GSE20194 FILE: GSE20194_2w

clean data and get h5 file

python scripts/GetComVec_mpg.py # csv is smiles_clinical.csv
python scripts/GetCellVec_description.py #xlsx is GSE_description.xlsx
python scripts/clinical_get_978.py --GSE GSE25055 --FILE GSE25055_2w
python scripts/clinical_get_h5.py --GSE GSE25055

using the cell split model to predict clinical data

python scripts/diff_sample.py --split clinical --seed 364039 --if_cell_vec clinical --clinical GSE25055 --model_path "../result/model/cell_split/364039/description/model000300.pth"

using e-distance to evaluate the clinical prediction results

python scripts/clinical_evaluate.py --GSE GSE25055

About

The official code of PertDiff.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages