This repository contains the code accompanying the paper GRAFT: Gated Retrieval-Augmented Fine-Tuning for Relation Extraction. GRAFT augments biomedical relation extraction with externally retrieved context, using a learned gating mechanism to balance the contribution of input text and retrieved snippets. The framework supports both encoder backbones (BiomedBERT) and causal LLMs (Llama-3.x-Instruct).
GRAFT-RE/
├── README.md
├── requirements.txt
├── Retriever/ # MedCPT-based retrieval (entity-wise, pairwise, merge)
├── BERT-based/ # BiomedBERT fine-tuning with GRAFT
├── Llama-based/ # Llama fine-tuning with GRAFT (one script per dataset)
├── Datasets/ # CDR, BioRED, ChemProt (encrypted, see below)
├── keys/
│ └── public_key.pem # RSA-2048 public key, used for encryption
└── tools/
├── encrypt_datasets.py
└── decrypt_datasets.py
pip install -r requirements.txtThe encryption tooling additionally requires the cryptography package.
All files under Datasets/ are stored in an encrypted form (.enc
suffix). They are encrypted per-file with a hybrid scheme:
- A fresh random 256-bit AES key per file.
- File body encrypted with AES-256-GCM.
- AES key encrypted with RSA-OAEP (SHA-256) using a 2048-bit public key.
The matching RSA private key is required to decrypt. Contact the authors to request access.
The public key (keys/public_key.pem) is bundled in this repository
and is sufficient to encrypt additional files. The private key is
not in this repository.
python tools/decrypt_datasets.py \
--privkey /path/to/private_key.pem \
--root DatasetsThis writes the original files alongside the .enc files. Add
--delete-encrypted to remove the .enc files after a successful
decryption.
python tools/encrypt_datasets.py \
--pubkey keys/public_key.pem \
--root Datasets \
--delete-originalsEach encrypted file is a single binary blob:
[4 bytes BE] length N of the RSA-encrypted AES key
[N bytes] RSA-OAEP(SHA-256) encryption of the 32-byte AES key
[12 bytes] AES-GCM nonce
[...] AES-256-GCM ciphertext (16-byte authentication tag appended)
The format is self-contained — each file can be decrypted independently of the others.
python Retriever/entity_wise.py \
--dataset_path {dataset_path} \
--dataset_name {dataset_name} \
--corpus {Wikipedia|PubMed} \
--fname {output_filename}python Retriever/pair_wise.py \
--dataset_path {dataset_path} \
--dataset_name {dataset_name} \
--corpus PubMed \
--fname {output_filename}python Retriever/merge.py \
--wiki_path {wiki_path} \
--pubmed_path {pubmed_path} \
--fname {output_filename}python BERT-based/main.py \
--task {task} \
--document_path {document_path} \
--do_eval --eval_test \
--model microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract \
--do_lower_case \
--add_new_tokens \
--output_dir {output_dir} \
--num_docs {num_docs} \
--train_batch_size {batch_size} \
--eval_batch_size {batch_size} \
--learning_rate {lr} \
--num_train_epochs {num_epoch} \
--context_window {0|100} \
--max_seq_length 512 \
--drop_out {drop_out} \
--file_dir {Dataset_dir} \
--dev_file val.json \
--test_file test.jsonOne script per dataset: llm_cdr.py, llm_biored.py, llm_chemprot.py.
Example for ChemProt:
python Llama-based/llm_chemprot.py \
--output_dir {output_dir} \
--dataset_path Datasets/chemprot/chemprot_cls.py \
--document_path {document_path} \
--num_docs {num_docs} \
--per_device_train_batch_size {batch_size} \
--gradient_accumulation_steps {accumulation_steps} \
--num_train_epochs {num_epochs} \
--learning_rate {lr} \
--optim paged_adamw_8bitRun tools/decrypt_datasets.py first so the Datasets/... paths resolve
to the plaintext files the loader scripts expect.
If you use this code, please cite the GRAFT paper.