Skip to content

Add LMEB Evaluation Results of 13 Embedding Models#583

Merged
Samoed merged 13 commits into
embeddings-benchmark:mainfrom
ItsukiFujii:main
Jul 2, 2026
Merged

Add LMEB Evaluation Results of 13 Embedding Models#583
Samoed merged 13 commits into
embeddings-benchmark:mainfrom
ItsukiFujii:main

Conversation

@ItsukiFujii

Copy link
Copy Markdown
Contributor

Checklist

  • My model has a model sheet, report, or similar
  • My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
    • No, but there is an existing PR ___
  • The results submitted are obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

These 13 models include

  1. BAAI__bge-large-en-v1.5
  2. BAAI__bge-m3
  3. BAAI__bge-multilingual-gemma2
  4. HIT-TMG__KaLM-embedding-multilingual-mini-instruct-v1.5
  5. intfloat__e5-mistral-7b-instruct
  6. intfloat__multilingual-e5-large-instruct
  7. jinaai__jina-embeddings-v5-text-small-retrieval
  8. KaLM-Embedding__KaLM-embedding-multilingual-mini-instruct-v2.5
  9. nvidia__NV-Embed-v2
  10. Qwen__Qwen3-Embedding-0.6B
  11. Qwen__Qwen3-Embedding-4B
  12. Qwen__Qwen3-Embedding-8B
  13. tencent__KaLM-Embedding-Gemma3-12B-2511

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: BAAI/bge-large-en-v1.5, BAAI/bge-m3, BAAI/bge-multilingual-gemma2, HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5, KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5, Qwen/Qwen3-Embedding-0.6B, Qwen/Qwen3-Embedding-4B, Qwen/Qwen3-Embedding-8B, intfloat/e5-mistral-7b-instruct, intfloat/multilingual-e5-large-instruct, jinaai/jina-embeddings-v5-text-small, nvidia/NV-Embed-v2, tencent/KaLM-Embedding-Gemma3-12B-2511

Results for BAAI/bge-large-en-v1.5

task_name BAAI/bge-large-en-v1.5 Max result Model with max result In Training Data
ConvoMem .338 False
CovidQA .516 False
DeepPlanning .437 False
EPBench .659 False
ESGReports .324 False
Gorilla .291 False
KnowMeBench .262 False
LMEBMLDR .688 False
LMEB_SciFact .693 False
LoCoMo .253 False
LongMemEval .574 False
LooGLE .380 False
MemBench .543 False
MemGovern .725 False
NovelQA .110 False
PeerQA .182 False
ProceduralMemBench .502 False
QASPER .372 False
REALTALK .341 False
ReMe .567 False
TMD .149 False
ToolBench .369 False
Average .421 nan -

Results for BAAI/bge-m3

task_name BAAI/bge-m3 Max result Model with max result In Training Data
ConvoMem .552 False
CovidQA .661 False
DeepPlanning .528 False
EPBench .829 False
ESGReports .357 False
Gorilla .290 False
KnowMeBench .420 False
LMEBMLDR .773 False
LMEB_SciFact .712 False
LoCoMo .418 False
LongMemEval .647 False
LooGLE .525 False
MemBench .592 False
MemGovern .805 False
NovelQA .270 False
PeerQA .285 False
ProceduralMemBench .482 False
QASPER .438 False
REALTALK .418 False
ReMe .544 False
TMD .244 False
ToolBench .414 False
Average .509 nan -

Results for BAAI/bge-multilingual-gemma2

task_name BAAI/bge-multilingual-gemma2 Max result Model with max result In Training Data
ConvoMem .668 False
CovidQA .799 False
DeepPlanning .538 False
EPBench .883 False
ESGReports .468 False
Gorilla .454 False
KnowMeBench .536 False
LMEBMLDR .854 False
LMEB_SciFact .839 False
LoCoMo .560 False
LongMemEval .817 False
LooGLE .645 False
MemBench .725 False
MemGovern .888 False
NovelQA .376 False
PeerQA .327 False
ProceduralMemBench .524 False
QASPER .528 False
REALTALK .482 False
ReMe .648 False
TMD .307 False
ToolBench .626 False
Average .613 nan -

Results for HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5

task_name HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5 Max result Model with max result In Training Data
ConvoMem .660 False
CovidQA .753 False
DeepPlanning .553 False
EPBench .798 False
ESGReports .436 False
Gorilla .348 False
KnowMeBench .409 False
LMEBMLDR .747 False
LMEB_SciFact .767 False
LoCoMo .400 False
LongMemEval .770 False
LooGLE .571 False
MemBench .674 False
MemGovern .870 False
NovelQA .272 False
PeerQA .322 False
ProceduralMemBench .468 False
QASPER .518 False
REALTALK .393 False
ReMe .633 False
TMD .206 False
ToolBench .575 False
Average .552 nan -

Results for KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5

task_name KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5 Max result Model with max result In Training Data
ConvoMem .628 False
CovidQA .739 False
DeepPlanning .512 False
EPBench .852 False
ESGReports .423 False
Gorilla .333 False
KnowMeBench .400 False
LMEBMLDR .787 False
LMEB_SciFact .817 False
LoCoMo .418 False
LongMemEval .750 False
LooGLE .577 False
MemBench .695 False
MemGovern .852 False
NovelQA .273 False
PeerQA .303 False
ProceduralMemBench .519 False
QASPER .461 False
REALTALK .386 False
ReMe .643 False
TMD .168 False
ToolBench .549 False
Average .549 nan -

Results for Qwen/Qwen3-Embedding-0.6B

task_name Qwen/Qwen3-Embedding-0.6B Max result Model with max result In Training Data
ConvoMem .631 False
CovidQA .741 False
DeepPlanning .534 False
EPBench .822 False
ESGReports .368 False
Gorilla .422 False
KnowMeBench .368 False
LMEBMLDR .763 False
LMEB_SciFact .801 False
LoCoMo .343 False
LongMemEval .731 False
LooGLE .557 False
MemBench .705 False
MemGovern .880 False
NovelQA .225 False
PeerQA .270 False
ProceduralMemBench .542 False
QASPER .473 False
REALTALK .375 False
ReMe .645 False
TMD .277 False
ToolBench .534 False
Average .546 nan -

Results for Qwen/Qwen3-Embedding-4B

task_name Qwen/Qwen3-Embedding-4B Max result Model with max result In Training Data
ConvoMem .547 False
CovidQA .768 False
DeepPlanning .539 False
EPBench .736 False
ESGReports .404 False
Gorilla .428 False
KnowMeBench .400 False
LMEBMLDR .769 False
LMEB_SciFact .800 False
LoCoMo .301 False
LongMemEval .502 False
LooGLE .585 False
MemBench .661 False
MemGovern .897 False
NovelQA .252 False
PeerQA .270 False
ProceduralMemBench .544 False
QASPER .499 False
REALTALK .348 False
ReMe .672 False
TMD .291 False
ToolBench .516 False
Average .533 nan -

Results for Qwen/Qwen3-Embedding-8B

task_name Qwen/Qwen3-Embedding-8B Max result Model with max result In Training Data
ConvoMem .556 False
CovidQA .791 False
DeepPlanning .526 False
EPBench .797 False
ESGReports .429 False
Gorilla .452 False
KnowMeBench .419 False
LMEBMLDR .776 False
LMEB_SciFact .787 False
LoCoMo .360 False
LongMemEval .777 False
LooGLE .594 False
MemBench .711 False
MemGovern .895 False
NovelQA .278 False
PeerQA .297 False
ProceduralMemBench .522 False
QASPER .472 False
REALTALK .407 False
ReMe .682 False
TMD .290 False
ToolBench .489 False
Average .559 nan -

Results for intfloat/e5-mistral-7b-instruct

task_name intfloat/e5-mistral-7b-instruct Max result Model with max result In Training Data
ConvoMem .686 False
CovidQA .611 False
DeepPlanning .401 False
EPBench .742 False
ESGReports .484 False
Gorilla .432 False
KnowMeBench .467 False
LMEBMLDR .711 False
LMEB_SciFact .818 False
LoCoMo .468 False
LongMemEval .782 False
LooGLE .519 False
MemBench .690 False
MemGovern .870 False
NovelQA .267 False
PeerQA .318 False
ProceduralMemBench .514 False
QASPER .509 False
REALTALK .429 False
ReMe .616 False
TMD .264 False
ToolBench .550 False
Average .552 nan -

Results for intfloat/multilingual-e5-large-instruct

task_name intfloat/multilingual-e5-large-instruct Max result Model with max result In Training Data
ConvoMem .684 False
CovidQA .691 False
DeepPlanning .485 False
EPBench .754 False
ESGReports .350 False
Gorilla .353 False
KnowMeBench .435 False
LMEBMLDR .813 False
LMEB_SciFact .791 False
LoCoMo .498 False
LongMemEval .697 False
LooGLE .536 False
MemBench .672 False
MemGovern .846 False
NovelQA .286 False
PeerQA .328 False
ProceduralMemBench .519 False
QASPER .514 False
REALTALK .448 False
ReMe .627 False
TMD .269 False
ToolBench .492 False
Average .549 nan -

Results for jinaai/jina-embeddings-v5-text-small

task_name jinaai/jina-embeddings-v5-text-small Max result Model with max result In Training Data
ConvoMem .604 False
CovidQA .729 False
DeepPlanning .484 False
EPBench .825 False
ESGReports .378 False
Gorilla .345 False
KnowMeBench .408 False
LMEBMLDR .765 False
LMEB_SciFact .780 False
LoCoMo .393 False
LongMemEval .747 False
LooGLE .517 False
MemBench .727 False
MemGovern .866 False
NovelQA .233 False
PeerQA .306 False
ProceduralMemBench .491 False
QASPER .510 False
REALTALK .369 False
ReMe .650 False
TMD .219 False
ToolBench .582 False
Average .542 nan -

Results for nvidia/NV-Embed-v2

task_name nvidia/NV-Embed-v2 Max result Model with max result In Training Data
ConvoMem .710 False
CovidQA .772 False
DeepPlanning .492 False
EPBench .806 False
ESGReports .537 False
Gorilla .430 False
KnowMeBench .565 False
LMEBMLDR .855 False
LMEB_SciFact .818 False
LoCoMo .512 False
LongMemEval .758 False
LooGLE .657 False
MemBench .708 False
MemGovern .924 False
NovelQA .405 False
PeerQA .340 False
ProceduralMemBench .523 False
QASPER .572 False
REALTALK .481 False
ReMe .603 False
TMD .191 False
ToolBench .542 False
Average .600 nan -

Results for tencent/KaLM-Embedding-Gemma3-12B-2511

task_name tencent/KaLM-Embedding-Gemma3-12B-2511 Max result Model with max result In Training Data
ConvoMem .641 False
CovidQA .781 False
DeepPlanning .550 False
EPBench .895 False
ESGReports .483 False
Gorilla .500 False
KnowMeBench .521 False
LMEBMLDR .764 False
LMEB_SciFact .801 False
LoCoMo .426 False
LongMemEval .810 False
LooGLE .581 False
MemBench .738 False
MemGovern .913 False
NovelQA .318 False
PeerQA .313 False
ProceduralMemBench .526 False
QASPER .501 False
REALTALK .445 False
ReMe .665 False
TMD .340 False
ToolBench .650 False
Average .598 nan -

@ItsukiFujii

Copy link
Copy Markdown
Contributor Author
setting model MTEB Framework Paper Reported abs_delta
w_inst KaLM-Embedding-Gemma3 59.83 60.10 0.27
w_inst NV-Embed-v2 60.00 60.25 0.25
w_inst Qwen3-Embedding-0.6B 54.56 54.71 0.15
w_inst multilingual-e5-large-instruct 54.95 55.06 0.11
w_inst bge-multilingual-gemma2 61.32 61.41 0.09
w_inst Qwen3-Embedding-4B 53.30 53.23 0.07
w_inst bge-m3 (Dense) 50.93 50.88 0.05
w_inst jina-v5-text-small 54.22 54.18 0.04
w_inst bge-large-en-v1.5 42.14 42.17 0.03
w_inst KaLM-Embedding-V1.5 55.19 55.21 0.02
w_inst KaLM-Embedding-V2.5 54.93 54.92 0.01
w_inst e5-mistral-7b-instruct 55.22 55.21 0.01
w_inst Qwen3-Embedding-8B 55.94 55.94 0.00

As shown in the table, the results obtained using the MTEB evaluation framework are highly consistent with the originally reported results in the paper.

@Samoed

@Samoed

Samoed commented Jul 2, 2026

Copy link
Copy Markdown
Member

Great! Thank you for submitting!

@Samoed Samoed merged commit 3228a76 into embeddings-benchmark:main Jul 2, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants