Skip to content

Issues on reproducing the paper results #11

Description

@jacqueline-weng

I was able to run the codes for three stages and each requires a new virtual environment. It troubled when it comes to training safety neurons.
I ran Llama3-8B-Instruct with the following requirements:
transformers==4.38.2
peft==0.10.0
trl==0.9.6
accelerate==0.43.2
Along with replacing /conda/env/path/site-packages/transformers/trainer.py with the transformers/trainer.py provided in this repo, you also need to append the definition of activate_neurons to /conda/env/path/site-packages/transformers/training_args.py in line 2786.

Image

After training, I checked the changes in parameters and discovered checkpoint saved was identical to the original model. This could be solved by modifying the saving strategy as follows:

is_main_process = True
if hasattr(trainer, "accelerator"):
      is_main_process = trainer.accelerator.is_main_process

if is_main_process:
      try:
          model_to_save = trainer.accelerator.unwrap_model(trainer.model)
      except Exception:
          model_to_save = trainer.model

      if isinstance(model_to_save, PeftModel):
          model_to_save.save_pretrained(output_dir)
          tokenizer.save_pretrained(output_dir)
      else:
          trainer.save_model(output_dir)
          tokenizer.save_pretrained(output_dir)

However, I was not able to reproduce the result in the paper. Here are the training logs of safety-neuron tuned version and all parameter tuned version:
Safety-Neuron tuned version:
Image
All parameters tuned version:

Image

SFT data: the 50 samples randomly selected from the training data in repo of Circuit-Break(https://arxiv.org/pdf/2406.04313):

circuit_breakers_train_sample50.json

I compared math capability using gsm8k-250 English and safety using MultiJail-EN:
The tag safe/unsafe/invalid is measured with the prompt provided in MultiJail paper (https://openreview.net/pdf?id=vESNKdEMGp)

The result was rather wired:

Image

I noticed that in the paper there was no comparison with the all parameters tuned with same SFT data. I expected that all-param SFT would perform worse in math tasks and have equivalent level or less safe than the safety-neuron tuned version.

Could you provide more information on the running environment that I could replicate the experimental result? Or could you open-source your tuned models for reference?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions