Skip to content

yuhangjiang22/GRAFT-RE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GRAFT-RE: Gated Retrieval-Augmented Fine-Tuning for Relation Extraction

This repository contains the code accompanying the paper GRAFT: Gated Retrieval-Augmented Fine-Tuning for Relation Extraction. GRAFT augments biomedical relation extraction with externally retrieved context, using a learned gating mechanism to balance the contribution of input text and retrieved snippets. The framework supports both encoder backbones (BiomedBERT) and causal LLMs (Llama-3.x-Instruct).

Repository layout

GRAFT-RE/
├── README.md
├── requirements.txt
├── Retriever/          # MedCPT-based retrieval (entity-wise, pairwise, merge)
├── BERT-based/         # BiomedBERT fine-tuning with GRAFT
├── Llama-based/        # Llama fine-tuning with GRAFT (one script per dataset)
├── Datasets/           # CDR, BioRED, ChemProt (encrypted, see below)
├── keys/
│   └── public_key.pem  # RSA-2048 public key, used for encryption
└── tools/
    ├── encrypt_datasets.py
    └── decrypt_datasets.py

Setup

pip install -r requirements.txt

The encryption tooling additionally requires the cryptography package.

Datasets

All files under Datasets/ are stored in an encrypted form (.enc suffix). They are encrypted per-file with a hybrid scheme:

  • A fresh random 256-bit AES key per file.
  • File body encrypted with AES-256-GCM.
  • AES key encrypted with RSA-OAEP (SHA-256) using a 2048-bit public key.

The matching RSA private key is required to decrypt. Contact the authors to request access.

The public key (keys/public_key.pem) is bundled in this repository and is sufficient to encrypt additional files. The private key is not in this repository.

Decrypt before use

python tools/decrypt_datasets.py \
    --privkey /path/to/private_key.pem \
    --root Datasets

This writes the original files alongside the .enc files. Add --delete-encrypted to remove the .enc files after a successful decryption.

Re-encrypt (e.g., before redistribution)

python tools/encrypt_datasets.py \
    --pubkey keys/public_key.pem \
    --root Datasets \
    --delete-originals

File format

Each encrypted file is a single binary blob:

[4 bytes BE]  length N of the RSA-encrypted AES key
[N bytes]     RSA-OAEP(SHA-256) encryption of the 32-byte AES key
[12 bytes]    AES-GCM nonce
[...]         AES-256-GCM ciphertext (16-byte authentication tag appended)

The format is self-contained — each file can be decrypted independently of the others.

Run the retriever

Entity-wise retrieval

python Retriever/entity_wise.py \
    --dataset_path  {dataset_path} \
    --dataset_name  {dataset_name} \
    --corpus        {Wikipedia|PubMed} \
    --fname         {output_filename}

Entity-pair (pairwise) retrieval

python Retriever/pair_wise.py \
    --dataset_path  {dataset_path} \
    --dataset_name  {dataset_name} \
    --corpus        PubMed \
    --fname         {output_filename}

Merge PubMed and Wikipedia results

python Retriever/merge.py \
    --wiki_path    {wiki_path} \
    --pubmed_path  {pubmed_path} \
    --fname        {output_filename}

Fine-tuning

BERT-based GRAFT

python BERT-based/main.py \
    --task                  {task} \
    --document_path         {document_path} \
    --do_eval --eval_test \
    --model                 microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract \
    --do_lower_case \
    --add_new_tokens \
    --output_dir            {output_dir} \
    --num_docs              {num_docs} \
    --train_batch_size      {batch_size} \
    --eval_batch_size       {batch_size} \
    --learning_rate         {lr} \
    --num_train_epochs      {num_epoch} \
    --context_window        {0|100} \
    --max_seq_length        512 \
    --drop_out              {drop_out} \
    --file_dir              {Dataset_dir} \
    --dev_file              val.json \
    --test_file             test.json

Llama-based GRAFT

One script per dataset: llm_cdr.py, llm_biored.py, llm_chemprot.py. Example for ChemProt:

python Llama-based/llm_chemprot.py \
    --output_dir                    {output_dir} \
    --dataset_path                  Datasets/chemprot/chemprot_cls.py \
    --document_path                 {document_path} \
    --num_docs                      {num_docs} \
    --per_device_train_batch_size   {batch_size} \
    --gradient_accumulation_steps   {accumulation_steps} \
    --num_train_epochs              {num_epochs} \
    --learning_rate                 {lr} \
    --optim                         paged_adamw_8bit

Run tools/decrypt_datasets.py first so the Datasets/... paths resolve to the plaintext files the loader scripts expect.

Citation

If you use this code, please cite the GRAFT paper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages