Hi, great work! I have a few questions I would like to ask.
From your paper, my understanding is that you first construct a GCoT dataset based on the VSI benchmark, and then train GS-Reasoner on this GCoT data. During inference, GS-Reasoner grounds the relevant objects and then performs subsequent reasoning.
I would like to check whether the same GCoT process is also applied to other 3D reasoning tasks such as SQA3D or Scan2Cap. The paper says "We attribute these gains to explicitly predicting coordinates for 3D visual grounding, which forces the model to better capture geometric and positional cues, thereby improving dense captioning performance." This made me wonder: for tasks like Scan2Cap, did you construct a GCoT-style dataset (e.g., GCoT-Scan2Cap) to train the model on grounding-augmented reasoning? Or does GS-Reasoner automatically output and leverage coordinate predictions for these tasks even without building a separate GCoT dataset beyond VSI?
Thank you very much, and I look forward to your reply!
Hi, great work! I have a few questions I would like to ask.
From your paper, my understanding is that you first construct a GCoT dataset based on the VSI benchmark, and then train GS-Reasoner on this GCoT data. During inference, GS-Reasoner grounds the relevant objects and then performs subsequent reasoning.
I would like to check whether the same GCoT process is also applied to other 3D reasoning tasks such as SQA3D or Scan2Cap. The paper says "We attribute these gains to explicitly predicting coordinates for 3D visual grounding, which forces the model to better capture geometric and positional cues, thereby improving dense captioning performance." This made me wonder: for tasks like Scan2Cap, did you construct a GCoT-style dataset (e.g., GCoT-Scan2Cap) to train the model on grounding-augmented reasoning? Or does GS-Reasoner automatically output and leverage coordinate predictions for these tasks even without building a separate GCoT dataset beyond VSI?
Thank you very much, and I look forward to your reply!