I am using a pretrained backbone model to extract features from images and utilize these features for unknown class detection in the context of Open-Set Recognition. The output tensor of the backbone is in the form of (batch size, patch size, embedding dimension). Do you have any ideas on how to process this tensor to make it more suitable for unknown class detection? One idea I have is to average the embedding vectors over all patches and consider this as the vector for each image. When visualizing the distribution of these vectors for each image, it seemed to align closely with human perception. If you have any other good ideas, I would appreciate your suggestions.
I am using a pretrained backbone model to extract features from images and utilize these features for unknown class detection in the context of Open-Set Recognition. The output tensor of the backbone is in the form of (batch size, patch size, embedding dimension). Do you have any ideas on how to process this tensor to make it more suitable for unknown class detection? One idea I have is to average the embedding vectors over all patches and consider this as the vector for each image. When visualizing the distribution of these vectors for each image, it seemed to align closely with human perception. If you have any other good ideas, I would appreciate your suggestions.