Cannot understand choice of mm_hidden_size 1024

Trying to understand how the spatial and temporal features fit into the projection layer. Based on the config file used to assign the mm.hidden_size on huggingface, it is 1024.

huggingface link: https://huggingface.co/mmaaz60/LLaVA-7B-Lightening-v1-1/blob/main/config.json

![image](https://github.com/user-attachments/assets/30a57ee7-a35d-4980-9cab-0f071d39cce7)


From what I understand, the frames are sampled at 100 frames and the clip encoder outputs a vector of 1024. A temporal mean will result in a vector of (number of patches, 1024) and a spatial mean of each frame will result in a (100(vector which size is the number of frames), 1024) does this mean the input shape of the projection layer is (num of patches + 100, 1024)?

I don't understand how the projection layer of 1024 accepts this size



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot understand choice of mm_hidden_size 1024 #123

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cannot understand choice of mm_hidden_size 1024 #123

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions