GRAFT-RE: Gated Retrieval-Augmented Fine-Tuning for Relation Extraction

This repository contains the code accompanying the paper GRAFT: Gated Retrieval-Augmented Fine-Tuning for Relation Extraction. GRAFT augments biomedical relation extraction with externally retrieved context, using a learned gating mechanism to balance the contribution of input text and retrieved snippets. The framework supports both encoder backbones (BiomedBERT) and causal LLMs (Llama-3.x-Instruct).

Repository layout

GRAFT-RE/
├── README.md
├── requirements.txt
├── Retriever/          # MedCPT-based retrieval (entity-wise, pairwise, merge)
├── BERT-based/         # BiomedBERT fine-tuning with GRAFT
├── Llama-based/        # Llama fine-tuning with GRAFT (one script per dataset)
├── Datasets/           # CDR, BioRED, ChemProt (encrypted, see below)
├── keys/
│   └── public_key.pem  # RSA-2048 public key, used for encryption
└── tools/
    ├── encrypt_datasets.py
    └── decrypt_datasets.py

Setup

pip install -r requirements.txt

The encryption tooling additionally requires the cryptography package.

Datasets

All files under Datasets/ are stored in an encrypted form (.enc suffix). They are encrypted per-file with a hybrid scheme:

A fresh random 256-bit AES key per file.
File body encrypted with AES-256-GCM.
AES key encrypted with RSA-OAEP (SHA-256) using a 2048-bit public key.

The matching RSA private key is required to decrypt. Contact the authors to request access.

The public key (keys/public_key.pem) is bundled in this repository and is sufficient to encrypt additional files. The private key is not in this repository.

Decrypt before use

python tools/decrypt_datasets.py \
    --privkey /path/to/private_key.pem \
    --root Datasets

This writes the original files alongside the .enc files. Add --delete-encrypted to remove the .enc files after a successful decryption.

Re-encrypt (e.g., before redistribution)

python tools/encrypt_datasets.py \
    --pubkey keys/public_key.pem \
    --root Datasets \
    --delete-originals

File format

Each encrypted file is a single binary blob:

[4 bytes BE]  length N of the RSA-encrypted AES key
[N bytes]     RSA-OAEP(SHA-256) encryption of the 32-byte AES key
[12 bytes]    AES-GCM nonce
[...]         AES-256-GCM ciphertext (16-byte authentication tag appended)

The format is self-contained — each file can be decrypted independently of the others.

Run the retriever

Entity-wise retrieval

python Retriever/entity_wise.py \
    --dataset_path  {dataset_path} \
    --dataset_name  {dataset_name} \
    --corpus        {Wikipedia|PubMed} \
    --fname         {output_filename}

Entity-pair (pairwise) retrieval

python Retriever/pair_wise.py \
    --dataset_path  {dataset_path} \
    --dataset_name  {dataset_name} \
    --corpus        PubMed \
    --fname         {output_filename}

Merge PubMed and Wikipedia results

python Retriever/merge.py \
    --wiki_path    {wiki_path} \
    --pubmed_path  {pubmed_path} \
    --fname        {output_filename}

Fine-tuning

BERT-based GRAFT

python BERT-based/main.py \
    --task                  {task} \
    --document_path         {document_path} \
    --do_eval --eval_test \
    --model                 microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract \
    --do_lower_case \
    --add_new_tokens \
    --output_dir            {output_dir} \
    --num_docs              {num_docs} \
    --train_batch_size      {batch_size} \
    --eval_batch_size       {batch_size} \
    --learning_rate         {lr} \
    --num_train_epochs      {num_epoch} \
    --context_window        {0|100} \
    --max_seq_length        512 \
    --drop_out              {drop_out} \
    --file_dir              {Dataset_dir} \
    --dev_file              val.json \
    --test_file             test.json

Llama-based GRAFT

One script per dataset: llm_cdr.py, llm_biored.py, llm_chemprot.py. Example for ChemProt:

python Llama-based/llm_chemprot.py \
    --output_dir                    {output_dir} \
    --dataset_path                  Datasets/chemprot/chemprot_cls.py \
    --document_path                 {document_path} \
    --num_docs                      {num_docs} \
    --per_device_train_batch_size   {batch_size} \
    --gradient_accumulation_steps   {accumulation_steps} \
    --num_train_epochs              {num_epochs} \
    --learning_rate                 {lr} \
    --optim                         paged_adamw_8bit

Run tools/decrypt_datasets.py first so the Datasets/... paths resolve to the plaintext files the loader scripts expect.

Citation

If you use this code, please cite the GRAFT paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRAFT-RE: Gated Retrieval-Augmented Fine-Tuning for Relation Extraction

Repository layout

Setup

Datasets

Decrypt before use

Re-encrypt (e.g., before redistribution)

File format

Run the retriever

Entity-wise retrieval

Entity-pair (pairwise) retrieval

Merge PubMed and Wikipedia results

Fine-tuning

BERT-based GRAFT

Llama-based GRAFT

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
BERT-based		BERT-based
Datasets		Datasets
Llama-based		Llama-based
Retriever		Retriever
keys		keys
tools		tools
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GRAFT-RE: Gated Retrieval-Augmented Fine-Tuning for Relation Extraction

Repository layout

Setup

Datasets

Decrypt before use

Re-encrypt (e.g., before redistribution)

File format

Run the retriever

Entity-wise retrieval

Entity-pair (pairwise) retrieval

Merge PubMed and Wikipedia results

Fine-tuning

BERT-based GRAFT

Llama-based GRAFT

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages