Skip to content

Support Qwen3-Reranker#835

Open
kozistr wants to merge 18 commits intohuggingface:mainfrom
kozistr:feature/qwen3-reranker
Open

Support Qwen3-Reranker#835
kozistr wants to merge 18 commits intohuggingface:mainfrom
kozistr:feature/qwen3-reranker

Conversation

@kozistr
Copy link
Copy Markdown
Contributor

@kozistr kozistr commented Feb 20, 2026

What does this PR do?

Fixes #643
Fixes #691
Fixes #763
Related to #795, #730

Note

I added the qwen3 prompts (prefix, suffix, query, document) to the prompts variable, which saves the prompts from the sentence transformer config, primarily to maintain the current structure as much as possible.

I think it'd be better to refactor this using a template in the future!

Feel free to leave a comment or feedback!

Output (tested on CPU)

  • query: "What is the capital of China?"

  • document: "The capital of Korea is Seoul."

  • Transformer: score 0.001029968

  • TEI: (f32) 0.0011602141, (f16) 0.001133569

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
  • Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@alvarobartt

@Kyogre-Primal
Copy link
Copy Markdown

Kyogre-Primal commented Mar 19, 2026

Hi @kozistr ,

I've been testing this PR on an NVIDIA RTX 5090 (sm_120) and also on CPU. While the 0.6B model works well, I've encountered a build blockage and a potential scoring discrepancy in the 8B model.

1. Build Issue (Fixed locally)

The build initially failed with error[E0432]: unresolved imports in backends/candle/src/models/flash_qwen3.rs. It seems ClassificationHead and Qwen3ClassificationHead are not correctly accessible from crate::models. I bypassed this locally by updating the imports to crate::models::qwen3::....

2. Scoring Observation (8B vs 0.6B)

After building the binary with CUDA_COMPUTE_CAP=120 (and also testing on CPU), I noticed that the 8B model in TEI seems to over-score irrelevant pairs significantly compared to the 0.6B version and vLLM.

Test Case:
Query: "What is the capital of China?" | Doc: "The capital of Korea is Seoul."

Model Variant Backend Score Weight Source
Qwen3-0.6B TEI (Candle) 0.0011513996 Qwen/Qwen3-Reranker-0.6B
Qwen3-0.6B vLLM 0.00035568897 tomaarsen/...-0.6B-seq-cls
Qwen3-8B TEI (Candle) 0.2746431 Qwen/Qwen3-Reranker-8B
Qwen3-8B vLLM 0.00067309104 tomaarsen/...-8B-seq-cls

A Quick Question on My Setup

I noticed that for TEI I am using the official Qwen3-Reranker weights, whereas my vLLM reference uses the tomaarsen/Qwen3-Reranker-x-seq-cls variants (which I previously verified as working correctly).

Since the inflated score (0.27 vs 0.0006) persists on both CPU and CUDA backends in TEI, I was wondering:

  • Could there be a mapping issue in ClassificationHead when loading the official 8B weights (possibly related to GQA or hidden dimension size)?
  • Or is there a specific configuration I might have missed in the environment to align the output with the seq-cls reference?

I'd appreciate it if you could point me in the right direction if I'm doing something wrong. Happy to provide more logs for debugging!


Build Env: RTX 5090 | CUDA_COMPUTE_CAP=120 | Target: gRPC

@Kyogre-Primal
Copy link
Copy Markdown

Kyogre-Primal commented Mar 19, 2026

Hi @kozistr ,

To provide more context for the benchmark results, I would like to share the exact prompt construction logic used in my vLLM reference. Crucially, this structure strictly follows the official recommendation for the Qwen3-Reranker models. This may be useful for comparing the data preprocessing between the two backends.

vLLM Prompt Construction (Python):

def _build_prompts(self, query: str, document: str, instruction: str) -> Tuple[str, str]:
    # Part 1: System and User Input
    query_str = (
        f'<|im_start|>system\n'
        f'Judge whether the Document meets the requirements based on the Query and the Instruct provided. '
        f'Note that the answer can only be "yes" or "no".<|im_end|>\n'
        f'<|im_start|>user\n'
        f'<Instruct>: {instruction}\n'
        f'<Query>: {query}\n'
    )
    
    # Part 2: Document and Assistant/Think Block
    formatted_doc = (
        f"<Document>: {document}<|im_end|>\n"
        f"<|im_start|>assistant\n"
        f"<think>\n\n</think>\n\n"
    )
    return query_str, formatted_doc

A quick question about TEI's expected input:
When using general-purpose engines like vLLM, it's common to build these templates manually. However, since TEI's /rerank API naturally separates the query and texts fields, I wasn't quite sure if I should pass the raw text or pre-format it.

For my TEI tests, I simply passed the raw strings, wondering if TEI might automatically apply the model's chat_template under the hood. If TEI currently uses raw concatenation for Qwen3 instead of the ChatML structure, I'm curious if this difference in input format might be a contributing factor to the 8B model's score drift and the 0.6B's quality drop. Could you kindly let me know the recommended way to pass inputs for such instruction-tuned models in TEI?

Additionally, I would like to share an objective observation regarding the 0.6B model's performance.

Observation on Ranking Quality:
Although the individual score for the 0.6B model (approx. 0.001) appears to be within a reasonable numerical range compared to vLLM, my further testing indicates a noticeable drop in NDCG scores when evaluated against established benchmarks for this model.

This observation might indicate that, even though the 0.6B variant doesn't exhibit the extreme numerical drift seen in the 8B model, there could still be a discrepancy in the inference pipeline (such as tokenization handling, prompt formatting differences, or embedding pooling) affecting the overall ranking accuracy.

I am currently gathering more detailed metrics to quantify this NDCG drop. I hope this information is helpful for evaluating the current implementation's alignment with the original model's expected behavior.

@kozistr
Copy link
Copy Markdown
Contributor Author

kozistr commented Mar 21, 2026

Hi @kozistr ,

To provide more context for the benchmark results, I would like to share the exact prompt construction logic used in my vLLM reference. Crucially, this structure strictly follows the official recommendation for the Qwen3-Reranker models. This may be useful for comparing the data preprocessing between the two backends.

vLLM Prompt Construction (Python):

def _build_prompts(self, query: str, document: str, instruction: str) -> Tuple[str, str]:
# Part 1: System and User Input
query_str = (
f'<|im_start|>system\n'
f'Judge whether the Document meets the requirements based on the Query and the Instruct provided. '
f'Note that the answer can only be "yes" or "no".<|im_end|>\n'
f'<|im_start|>user\n'
f': {instruction}\n'
f': {query}\n'
)

# Part 2: Document and Assistant/Think Block
formatted_doc = (
    f"<Document>: {document}<|im_end|>\n"
    f"<|im_start|>assistant\n"
    f"<think>\n\n</think>\n\n"
)
return query_str, formatted_doc

A quick question about TEI's expected input: When using general-purpose engines like vLLM, it's common to build these templates manually. However, since TEI's /rerank API naturally separates the query and texts fields, I wasn't quite sure if I should pass the raw text or pre-format it. For my TEI tests, I simply passed the raw strings, wondering if TEI might automatically apply the model's chat_template under the hood. If TEI currently uses raw concatenation for Qwen3 instead of the ChatML structure, I'm curious if this difference in input format might be a contributing factor to the 8B model's score drift and the 0.6B's quality drop. Could you kindly let me know the recommended way to pass inputs for such instruction-tuned models in TEI?

Additionally, I would like to share an objective observation regarding the 0.6B model's performance.

Observation on Ranking Quality: Although the individual score for the 0.6B model (approx. 0.001) appears to be within a reasonable numerical range compared to vLLM, my further testing indicates a noticeable drop in NDCG scores when evaluated against established benchmarks for this model.

This observation might indicate that, even though the 0.6B variant doesn't exhibit the extreme numerical drift seen in the 8B model, there could still be a discrepancy in the inference pipeline (such as tokenization handling, prompt formatting differences, or embedding pooling) affecting the overall ranking accuracy.

I am currently gathering more detailed metrics to quantify this NDCG drop. I hope this information is helpful for evaluating the current implementation's alignment with the original model's expected behavior.

Hi @Kyogre-Primal! Thanks for the report! I looked into it and found there're some unintended behaviors in the code. In short, Qwen3-Reranker-8B isn't supposed to use the tied-embedding, but it turns out it's actually enabled.

So, I've just fixed the bug and checked that it worked!

Query: "What is the capital of China?" | Doc: "The capital of Korea is Seoul."
Transformer: 0.00022876262664794922
TEI: 0.0002287133

Could you kindly let me know the recommended way to pass inputs for such instruction-tuned models in TEI?

Currently, I've hard-coded them into the TEI code. You can find here!

Meaning that, you can simply query the raw texts to the TEI!

Thanks!

@kozistr
Copy link
Copy Markdown
Contributor Author

kozistr commented Mar 21, 2026

+) I've just updated get_backend_model_type to recognize the model as a classifier when config.json contains label2id and id2label configs and the architecture ends with CausalLM.

And I opened the PRs to contain those configs in the config.json. please use the following models instead of the main branch model!

@Kyogre-Primal
Copy link
Copy Markdown

Hi @kozistr , thank you so much for the super quick turnaround and for looking into the issues I raised! It's great to hear that the 8B scoring drift was just related to the tied-embeddings config. Also, thanks for clarifying that TEI expects raw text for the prompts.

I haven't had a chance to test your latest commits locally yet, but I will pull the updates and re-run my benchmarks (including the 0.6B NDCG tests) as soon as possible. Really appreciate your hard work and responsiveness on this! 🚀

@Kyogre-Primal
Copy link
Copy Markdown

Kyogre-Primal commented Mar 24, 2026

Hi @kozistr,

I've completed my local benchmarks and I'm happy to report that the results look solid and perfectly align with expectations!

To ensure the prompt template was applied correctly for my specific evaluation, I bypassed the hardcoded Rust templates by passing the ChatML structure via a custom config_sentence_transformers.json in the model directory. This approach worked flawlessly.

{
  "prompts": {
    "query": "<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: ",
    "document": "\n<Document>: ",
    "prefix": "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n",
    "suffix": "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
  }
}

Evaluation Context:
For the ranking quality metrics, the NDCG scores presented below were calculated based on a custom dataset consisting of 5,000 query-document pairs annotated by GPT-5.2. Here are the benchmark results for both variants on my setup:

Qwen3-Reranker-0.6B

concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%      9.6     41.0    107.4     71.4  0.8358  0.8290  0.8222  0.8299  0.8442  0.8666  0.9174

Qwen3-Reranker-8B

concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%     71.8    318.2    710.0      9.9  0.8474  0.8514  0.8437  0.8510  0.8644  0.8824  0.9292

A quick observation on performance:
While the ranking quality (especially for the 8B model) is excellent, the latency and throughput drop is quite noticeable when compared to traditional encoder-based rerankers. For context, here is my baseline for bge-reranker-large on the same machine:

BGE-Reranker-Large (Baseline)

concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%      4.0      6.6     10.8    237.4  0.8289  0.8232  0.8161  0.8205  0.8356  0.8563  0.9155

Of course, this trade-off between speed and accuracy is completely expected given the CausalLM architecture and the parameter sizes of the Qwen3 models.

Overall, this is a fantastic and highly anticipated addition to TEI. The implementation is working great. Thank you so much for your hard work and responsiveness on this PR! 🚀

The build was failing with `error[E0432]` due to unresolved imports for `ClassificationHead` and `Qwen3ClassificationHead` in `flash_qwen3.rs`. This commit updates the import paths to correctly reference the `qwen3` module instead of the root `models` module, preventing namespace collisions and fixing the build block.
@kozistr
Copy link
Copy Markdown
Contributor Author

kozistr commented Mar 24, 2026

Hi @kozistr,

I've completed my local benchmarks and I'm happy to report that the results look solid and perfectly align with expectations!

To ensure the prompt template was applied correctly for my specific evaluation, I bypassed the hardcoded Rust templates by passing the ChatML structure via a custom config_sentence_transformers.json in the model directory. This approach worked flawlessly.

{
  "prompts": {
    "query": "<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: ",
    "document": "\n<Document>: ",
    "prefix": "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n",
    "suffix": "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
  }
}

Evaluation Context: For the ranking quality metrics, the NDCG scores presented below were calculated based on a custom dataset consisting of 5,000 query-document pairs annotated by GPT-5.2. Here are the benchmark results for both variants on my setup:

Here are the benchmark results for both variants on my setup:

Qwen3-Reranker-0.6B

concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%      9.6     41.0    107.4     71.4  0.8358  0.8290  0.8222  0.8299  0.8442  0.8666  0.9174

Qwen3-Reranker-8B

concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%     71.8    318.2    710.0      9.9  0.8474  0.8514  0.8437  0.8510  0.8644  0.8824  0.9292

A quick observation on performance: While the ranking quality (especially for the 8B model) is excellent, the latency and throughput drop is quite noticeable when compared to traditional encoder-based rerankers. For context, here is my baseline for bge-reranker-large on the same machine:

BGE-Reranker-Large (Baseline)

concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%      4.0      6.6     10.8    237.4  0.8289  0.8232  0.8161  0.8205  0.8356  0.8563  0.9155

Of course, this trade-off between speed and accuracy is completely expected given the CausalLM architecture and the parameter sizes of the Qwen3 models.

Overall, this is a fantastic and highly anticipated addition to TEI. The implementation is working great. Thank you so much for your hard work and responsiveness on this PR! 🚀

No worries :) Thanks for running the benchmark, and of course, for the fix.

Feel free to ping me if you spot any bugs or have any ideas for the current design :)

@alvarobartt alvarobartt modified the milestone: v1.10.0 Mar 31, 2026
@linshuai277
Copy link
Copy Markdown

LGTM

@Kyogre-Primal
Copy link
Copy Markdown

Hi @kozistr,

Hope everything is going well! I noticed this PR hasn't been merged yet, and I was thinking about a potential upstream blocker that might affect the user experience once it lands.

Since the current implementation in get_backend_model_type relies on the Qwen3 config.json having the label2id and id2label fields, users currently need the Qwen team to merge your Hugging Face PRs (like HF discussion #10 and #23). However, from what I've seen, the upstream Qwen maintainers are quite unresponsive and rarely merge community config updates.

To prevent this from blocking users from using the official weights out-of-the-box (and to decouple this PR from Qwen's slow review process), do you think we could provide a fallback mechanism directly within TEI?

For example:

  • If label2id and id2label are missing in the config, but the model's architectures or _name_or_path indicates it is a Qwen3 Reranker variant, could we programmatically inject/assume the default classification mapping?

This way, users can just pull the official Qwen/Qwen3-Reranker-8B without needing to modify the config.json themselves or use a forked repository. I suspect having it work seamlessly with the official unmodified weights might also help get this PR smoothly reviewed and merged by the TEI maintainers.

Let me know what you think about this approach! I'd be happy to help test the fallback logic if you decide to implement it.

@kozistr
Copy link
Copy Markdown
Contributor Author

kozistr commented Apr 7, 2026

Hi @kozistr,

Hope everything is going well! I noticed this PR hasn't been merged yet, and I was thinking about a potential upstream blocker that might affect the user experience once it lands.

Since the current implementation in get_backend_model_type relies on the Qwen3 config.json having the label2id and id2label fields, users currently need the Qwen team to merge your Hugging Face PRs (like HF discussion #10 and #23). However, from what I've seen, the upstream Qwen maintainers are quite unresponsive and rarely merge community config updates.

To prevent this from blocking users from using the official weights out-of-the-box (and to decouple this PR from Qwen's slow review process), do you think we could provide a fallback mechanism directly within TEI?

For example:

  • If label2id and id2label are missing in the config, but the model's architectures or _name_or_path indicates it is a Qwen3 Reranker variant, could we programmatically inject/assume the default classification mapping?

This way, users can just pull the official Qwen/Qwen3-Reranker-8B without needing to modify the config.json themselves or use a forked repository. I suspect having it work seamlessly with the official unmodified weights might also help get this PR smoothly reviewed and merged by the TEI maintainers.

Let me know what you think about this approach! I'd be happy to help test the fallback logic if you decide to implement it.

Hi @Kyogre-Primal! For now, we can just load the model with a specific revision!

      --revision <REVISION>
          The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`

          [env: REVISION=]

_name_or_path indicates it is a Qwen3 Reranker variant

I've tried it, but, the problem is that Qwem embedding and reranker models have almost identical configurations, so it's hard to tell them apart just by looking at the config file :(

As you mentioned, maybe it might be a good idea to specify the revision number in the README, so users know which version to use!

thanks!

@Kyogre-Primal
Copy link
Copy Markdown

Hi @kozistr! Thanks for the explanation! The --revision approach makes sense as a pragmatic solution, especially given the difficulty of distinguishing reranker from embedding models by config alone.

My only remaining thought is that new users might not immediately know which revision to use — so documenting the recommended revision in the README (as you mentioned) would definitely help. If there's anything I can do to help with that, happy to open a follow-up PR.

Overall, great work on this PR — Qwen3-Reranker support is a valuable addition. Thanks again for all the effort! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3-Reranker-0.6B start error Any plan to support qwen3 reranker model in TEI Feature Request: Add Support for Qwen3-Reranker Model

4 participants