Hey!
Great job on Arena, I think in the era of saturated benchmarks, having an actual large-number vibes-based evaluation is very important.
I was wondering, would you entertain adding models that are only available for one of the three categories? I think retrieval is by far the most popular use for embeddings nowadays, so I could see it making sense, but I can also understand if not.
If so, I'd be happy to contribute a ColBERT implementation, as we're working potential English proof-of-concepts with the ColBERTv2.5 recipe which I think could be very interesting to try out in this benchmark!
With compression, etc..., the indexes should also be within the same ~order of magnitude as the ones of 1024 dim vectors, so it shouldn't be too much of a storage nightmare.
Hey!
Great job on Arena, I think in the era of saturated benchmarks, having an actual large-number vibes-based evaluation is very important.
I was wondering, would you entertain adding models that are only available for one of the three categories? I think retrieval is by far the most popular use for embeddings nowadays, so I could see it making sense, but I can also understand if not.
If so, I'd be happy to contribute a ColBERT implementation, as we're working potential English proof-of-concepts with the ColBERTv2.5 recipe which I think could be very interesting to try out in this benchmark!
With compression, etc..., the indexes should also be within the same ~order of magnitude as the ones of 1024 dim vectors, so it shouldn't be too much of a storage nightmare.