Hi, thank you for releasing this work and for making the training code and configuration files public.
I am trying to reproduce the reported results using the official YAML configuration files provided in this repository, but unfortunately I have not been able to obtain results close to those reported in the paper / README. The gap is quite significant, even after carefully following the training setup.
Here is what I have done so far:
- Used the publicly released YAML configs without modification
- Followed the documented training procedure
- Trained for the full number of steps / epochs as specified
- Verified that there are no obvious implementation or environment issues
Despite this, the final performance is still far below the reported numbers, and the difference does not seem to be explainable by random seed variance alone.
Given this, I would like to ask:
- Are the reported results obtained strictly using the current public YAML configs, or were there additional (possibly unpublished) changes or hyperparameter tweaks?
- Would it be possible to release pretrained / finetuned checkpoints corresponding to the reported results, for verification and comparison?
- Are there any critical details (e.g., specific random seeds, initialization, data preprocessing, or training tricks) that are not yet documented?
Releasing checkpoints would greatly help the community verify correctness and better understand the intended training setup, especially given the current reproduction gap.
Thanks again for your work and for any clarification you can provide!
Best regards
Hi, thank you for releasing this work and for making the training code and configuration files public.
I am trying to reproduce the reported results using the official YAML configuration files provided in this repository, but unfortunately I have not been able to obtain results close to those reported in the paper / README. The gap is quite significant, even after carefully following the training setup.
Here is what I have done so far:
Despite this, the final performance is still far below the reported numbers, and the difference does not seem to be explainable by random seed variance alone.
Given this, I would like to ask:
Releasing checkpoints would greatly help the community verify correctness and better understand the intended training setup, especially given the current reproduction gap.
Thanks again for your work and for any clarification you can provide!
Best regards