Dear authors,
Thank you for your insightfull work! I have some questions regarding the ablation experiments in the paper. Table 4 presents two ablation experiments, distinguished by whether inference was performed during training. However, why do all benchmark scores in the last row of both experiments are same?
Does that mean, for SFT+RL, there's no difference whether we reasoning in training stages or not?
Any response would be appreciated!
Dear authors,
Thank you for your insightfull work! I have some questions regarding the ablation experiments in the paper. Table 4 presents two ablation experiments, distinguished by whether inference was performed during training. However, why do all benchmark scores in the last row of both experiments are same?
Does that mean, for
SFT+RL, there's no difference whether we reasoning in training stages or not?Any response would be appreciated!