Visual Grounding as intermediate CoT besides VSI

Hi, great work! I have a few questions I would like to ask.

From your paper, my understanding is that you first construct a GCoT dataset based on the VSI benchmark, and then train GS-Reasoner on this GCoT data. During inference, GS-Reasoner grounds the relevant objects and then performs subsequent reasoning.

I would like to check whether the same GCoT process is also applied to other 3D reasoning tasks such as SQA3D or Scan2Cap. The paper says "We attribute these gains to explicitly predicting coordinates for 3D visual grounding, which forces the model to better capture geometric and positional cues, thereby improving dense captioning performance." This made me wonder: for tasks like Scan2Cap, did you construct a GCoT-style dataset (e.g., GCoT-Scan2Cap) to train the model on grounding-augmented reasoning? Or does GS-Reasoner automatically output and leverage coordinate predictions for these tasks even without building a separate GCoT dataset beyond VSI?

Thank you very much, and I look forward to your reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Visual Grounding as intermediate CoT besides VSI #7

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Visual Grounding as intermediate CoT besides VSI #7

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions