Hi, thanks for your great work and for releasing this project.
I am trying to reproduce the experimental results reported in the paper, especially the baseline setting with a 30% retention rate. I have a few questions regarding the exact implementation details.
In the paper/report, the baseline under the 30% retention setting achieves certain results on ScreenSpot-Pro, ScreenSpot-v2, and OSWorld-G. In my reproduction, the results on OSWorld-G are generally consistent with the reported numbers. However, I observe a large gap on ScreenSpot-Pro and ScreenSpot-v2 compared with the experimental report.
Could you please clarify the following details?
-
For the 30% retention baseline, what is the exact implementation used?
- How are the retained tokens/regions selected?
- Is the 30% retention applied before or after any filtering, resizing, or preprocessing?
- Are there any dataset-specific settings for ScreenSpot-Pro, ScreenSpot-v2, or OSWorld-G?
-
What is the exact prompt template used for this baseline?
- Is it the same across ScreenSpot-Pro, ScreenSpot-v2, and OSWorld-G?
- Are there any additional system prompts, grounding instructions, or formatting constraints?
-
Would it be possible to release the corresponding baseline code and prompt configuration?
- This would be very helpful for reproducing the reported results and ensuring a fair comparison.
For reference, in my reproduction, OSWorld-G can roughly match the reported results, but ScreenSpot-Pro and ScreenSpot-v2 show a much larger discrepancy. Therefore, I suspect there may be some differences in the preprocessing, prompt format, or token-retention implementation.
Thanks again for your work. I would really appreciate any clarification or released configuration files that could help reproduce the 30% retention baseline.

Hi, thanks for your great work and for releasing this project.
I am trying to reproduce the experimental results reported in the paper, especially the baseline setting with a 30% retention rate. I have a few questions regarding the exact implementation details.
In the paper/report, the baseline under the 30% retention setting achieves certain results on ScreenSpot-Pro, ScreenSpot-v2, and OSWorld-G. In my reproduction, the results on OSWorld-G are generally consistent with the reported numbers. However, I observe a large gap on ScreenSpot-Pro and ScreenSpot-v2 compared with the experimental report.
Could you please clarify the following details?
For the 30% retention baseline, what is the exact implementation used?
What is the exact prompt template used for this baseline?
Would it be possible to release the corresponding baseline code and prompt configuration?
For reference, in my reproduction, OSWorld-G can roughly match the reported results, but ScreenSpot-Pro and ScreenSpot-v2 show a much larger discrepancy. Therefore, I suspect there may be some differences in the preprocessing, prompt format, or token-retention implementation.
Thanks again for your work. I would really appreciate any clarification or released configuration files that could help reproduce the 30% retention baseline.