Hi and thanks a lot for releasing ProTrek and the pretrained FAISS indices!
I am currently using the following FAISS index for retrieval:
faiss_index/SwissProt/ProTrek_650M_UniRef50/sequence/sequence.index
In the README, it is mentioned that sequence embeddings for SwissProt are stored in this FAISS index. However, the directory name contains UniRef50, which makes me a bit confused about what the index actually contains.
Could you please help clarify the following points?
- Does
sequence.index contain embeddings for SwissProt entries only, or for UniRef50 (cluster representatives), or some combination of both?
- For the corresponding
ids.tsv in the same folder:
- Do these IDs correspond to SwissProt accessions, UniRef50 cluster IDs, or something else?
- Is there any additional mapping file I should use to interpret the IDs (e.g., mapping UniRef50 clusters back to SwissProt)?
Thanks for your help!
Hi and thanks a lot for releasing ProTrek and the pretrained FAISS indices!
I am currently using the following FAISS index for retrieval:
faiss_index/SwissProt/ProTrek_650M_UniRef50/sequence/sequence.indexIn the README, it is mentioned that sequence embeddings for SwissProt are stored in this FAISS index. However, the directory name contains
UniRef50, which makes me a bit confused about what the index actually contains.Could you please help clarify the following points?
sequence.indexcontain embeddings for SwissProt entries only, or for UniRef50 (cluster representatives), or some combination of both?ids.tsvin the same folder:Thanks for your help!