Trying to understand how the spatial and temporal features fit into the projection layer. Based on the config file used to assign the mm.hidden_size on huggingface, it is 1024.
huggingface link: https://huggingface.co/mmaaz60/LLaVA-7B-Lightening-v1-1/blob/main/config.json

From what I understand, the frames are sampled at 100 frames and the clip encoder outputs a vector of 1024. A temporal mean will result in a vector of (number of patches, 1024) and a spatial mean of each frame will result in a (100(vector which size is the number of frames), 1024) does this mean the input shape of the projection layer is (num of patches + 100, 1024)?
I don't understand how the projection layer of 1024 accepts this size
Trying to understand how the spatial and temporal features fit into the projection layer. Based on the config file used to assign the mm.hidden_size on huggingface, it is 1024.
huggingface link: https://huggingface.co/mmaaz60/LLaVA-7B-Lightening-v1-1/blob/main/config.json
From what I understand, the frames are sampled at 100 frames and the clip encoder outputs a vector of 1024. A temporal mean will result in a vector of (number of patches, 1024) and a spatial mean of each frame will result in a (100(vector which size is the number of frames), 1024) does this mean the input shape of the projection layer is (num of patches + 100, 1024)?
I don't understand how the projection layer of 1024 accepts this size