@zhaoyiran924, Thanks for sharing your great work! I have a question about the implementation of train_neuron.py and its data format.
Current Implementation
According to the paper, I thought it was using raw Wikipedia's passage to train those neurons, in a next-token prediction way,
while the code currently seems to use a question-answer format for training:
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['original_question'])):
text = f"{example['original_question'][i]}. {example['response'][i]}"
output_texts.append(text)
return output_texts
Questions
-
Could you please confirm the format of how the wiki data passed,
or share an example of how the Wikipedia documents are preprocessed to get the 'original_question' and 'response' fields?
-
Would it make more sense to use a simpler format for plain text documents, like:
def formatting_prompts_func(example):
return example['text']
Thanks for your help.
@zhaoyiran924, Thanks for sharing your great work! I have a question about the implementation of
train_neuron.pyand its data format.Current Implementation
According to the paper, I thought it was using raw Wikipedia's passage to train those neurons, in a next-token prediction way,
while the code currently seems to use a question-answer format for training:
Questions
Could you please confirm the format of how the wiki data passed,
or share an example of how the Wikipedia documents are preprocessed to get the 'original_question' and 'response' fields?
Would it make more sense to use a simpler format for plain text documents, like:
Thanks for your help.