Skip to content

Visual Grounding as intermediate CoT besides VSI #7

Description

@Jimntu

Hi, great work! I have a few questions I would like to ask.

From your paper, my understanding is that you first construct a GCoT dataset based on the VSI benchmark, and then train GS-Reasoner on this GCoT data. During inference, GS-Reasoner grounds the relevant objects and then performs subsequent reasoning.

I would like to check whether the same GCoT process is also applied to other 3D reasoning tasks such as SQA3D or Scan2Cap. The paper says "We attribute these gains to explicitly predicting coordinates for 3D visual grounding, which forces the model to better capture geometric and positional cues, thereby improving dense captioning performance." This made me wonder: for tasks like Scan2Cap, did you construct a GCoT-style dataset (e.g., GCoT-Scan2Cap) to train the model on grounding-augmented reasoning? Or does GS-Reasoner automatically output and leverage coordinate predictions for these tasks even without building a separate GCoT dataset beyond VSI?

Thank you very much, and I look forward to your reply!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions