Conversation
|
Hi @kozistr , I've been testing this PR on an NVIDIA RTX 5090 (sm_120) and also on CPU. While the 0.6B model works well, I've encountered a build blockage and a potential scoring discrepancy in the 8B model. 1. Build Issue (Fixed locally)The build initially failed with 2. Scoring Observation (8B vs 0.6B)After building the binary with Test Case:
A Quick Question on My SetupI noticed that for TEI I am using the official Qwen3-Reranker weights, whereas my vLLM reference uses the Since the inflated score (0.27 vs 0.0006) persists on both CPU and CUDA backends in TEI, I was wondering:
I'd appreciate it if you could point me in the right direction if I'm doing something wrong. Happy to provide more logs for debugging! Build Env: RTX 5090 | CUDA_COMPUTE_CAP=120 | Target: gRPC |
|
Hi @kozistr , To provide more context for the benchmark results, I would like to share the exact prompt construction logic used in my vLLM reference. Crucially, this structure strictly follows the official recommendation for the Qwen3-Reranker models. This may be useful for comparing the data preprocessing between the two backends. vLLM Prompt Construction (Python): def _build_prompts(self, query: str, document: str, instruction: str) -> Tuple[str, str]:
# Part 1: System and User Input
query_str = (
f'<|im_start|>system\n'
f'Judge whether the Document meets the requirements based on the Query and the Instruct provided. '
f'Note that the answer can only be "yes" or "no".<|im_end|>\n'
f'<|im_start|>user\n'
f'<Instruct>: {instruction}\n'
f'<Query>: {query}\n'
)
# Part 2: Document and Assistant/Think Block
formatted_doc = (
f"<Document>: {document}<|im_end|>\n"
f"<|im_start|>assistant\n"
f"<think>\n\n</think>\n\n"
)
return query_str, formatted_doc
A quick question about TEI's expected input: Additionally, I would like to share an objective observation regarding the 0.6B model's performance. Observation on Ranking Quality: This observation might indicate that, even though the 0.6B variant doesn't exhibit the extreme numerical drift seen in the 8B model, there could still be a discrepancy in the inference pipeline (such as tokenization handling, prompt formatting differences, or embedding pooling) affecting the overall ranking accuracy. I am currently gathering more detailed metrics to quantify this NDCG drop. I hope this information is helpful for evaluating the current implementation's alignment with the original model's expected behavior. |
Hi @Kyogre-Primal! Thanks for the report! I looked into it and found there're some unintended behaviors in the code. In short, So, I've just fixed the bug and checked that it worked!
Currently, I've hard-coded them into the TEI code. You can find here! Meaning that, you can simply query the raw texts to the TEI! Thanks! |
|
+) I've just updated And I opened the PRs to contain those configs in the config.json. please use the following models instead of the main branch model!
|
|
Hi @kozistr , thank you so much for the super quick turnaround and for looking into the issues I raised! It's great to hear that the 8B scoring drift was just related to the tied-embeddings config. Also, thanks for clarifying that TEI expects raw text for the prompts. I haven't had a chance to test your latest commits locally yet, but I will pull the updates and re-run my benchmarks (including the 0.6B NDCG tests) as soon as possible. Really appreciate your hard work and responsiveness on this! 🚀 |
|
Hi @kozistr, I've completed my local benchmarks and I'm happy to report that the results look solid and perfectly align with expectations! To ensure the prompt template was applied correctly for my specific evaluation, I bypassed the hardcoded Rust templates by passing the ChatML structure via a custom {
"prompts": {
"query": "<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: ",
"document": "\n<Document>: ",
"prefix": "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n",
"suffix": "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
}
}Evaluation Context: Qwen3-Reranker-0.6B Qwen3-Reranker-8B A quick observation on performance: BGE-Reranker-Large (Baseline) Of course, this trade-off between speed and accuracy is completely expected given the CausalLM architecture and the parameter sizes of the Qwen3 models. Overall, this is a fantastic and highly anticipated addition to TEI. The implementation is working great. Thank you so much for your hard work and responsiveness on this PR! 🚀 |
The build was failing with `error[E0432]` due to unresolved imports for `ClassificationHead` and `Qwen3ClassificationHead` in `flash_qwen3.rs`. This commit updates the import paths to correctly reference the `qwen3` module instead of the root `models` module, preventing namespace collisions and fixing the build block.
No worries :) Thanks for running the benchmark, and of course, for the fix. Feel free to ping me if you spot any bugs or have any ideas for the current design :) |
|
LGTM |
|
Hi @kozistr, Hope everything is going well! I noticed this PR hasn't been merged yet, and I was thinking about a potential upstream blocker that might affect the user experience once it lands. Since the current implementation in To prevent this from blocking users from using the official weights out-of-the-box (and to decouple this PR from Qwen's slow review process), do you think we could provide a fallback mechanism directly within TEI? For example:
This way, users can just pull the official Let me know what you think about this approach! I'd be happy to help test the fallback logic if you decide to implement it. |
Hi @Kyogre-Primal! For now, we can just load the model with a specific revision!
I've tried it, but, the problem is that Qwem embedding and reranker models have almost identical configurations, so it's hard to tell them apart just by looking at the config file :(
As you mentioned, maybe it might be a good idea to specify the revision number in the README, so users know which version to use! thanks! |
|
Hi @kozistr! Thanks for the explanation! The My only remaining thought is that new users might not immediately know which revision to use — so documenting the recommended revision in the README (as you mentioned) would definitely help. If there's anything I can do to help with that, happy to open a follow-up PR. Overall, great work on this PR — Qwen3-Reranker support is a valuable addition. Thanks again for all the effort! 🎉 |
What does this PR do?
Fixes #643
Fixes #691
Fixes #763
Related to #795, #730
Note
I added the qwen3 prompts (prefix, suffix, query, document) to the
promptsvariable, which saves the prompts from the sentence transformer config, primarily to maintain the current structure as much as possible.I think it'd be better to refactor this using a template in the future!
Add-> I found thatlabel2idandid2labelsettings when the model is a classifier and does not have the.Qwen3-Rerankerhas the same config asQwen3-Embedding. So, to distinguish them, I manually input thelabel2idandid2labelconfigs into theconfig.json.Dualinput as well.Feel free to leave a comment or feedback!
Output (tested on CPU)
query: "What is the capital of China?"
document: "The capital of Korea is Seoul."
Transformer: score 0.001029968
TEI: (f32) 0.0011602141, (f16) 0.001133569
Before submitting
instasnapshots?Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@alvarobartt