Support Qwen3-Reranker by kozistr · Pull Request #835 · huggingface/text-embeddings-inference

kozistr · 2026-02-20T14:21:19Z

What does this PR do?

Fixes #643
Fixes #691
Fixes #763
Related to #795, #730

Note

I added the qwen3 prompts (prefix, suffix, query, document) to the prompts variable, which saves the prompts from the sentence transformer config, primarily to maintain the current structure as much as possible.

I think it'd be better to refactor this using a template in the future!

~~Add label2id and id2label settings when the model is a classifier and does not have the.~~ -> I found that Qwen3-Reranker has the same config as Qwen3-Embedding. So, to distinguish them, I manually input the label2id and id2label configs into the config.json.
- Qwen3-Reranker-0.6B: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/23
- Qwen3-Reranker-4B: https://huggingface.co/Qwen/Qwen3-Reranker-4B/discussions/10
Enable the addition of a prompt in case of the Dual input as well.

Feel free to leave a comment or feedback!

Output (tested on CPU)

query: "What is the capital of China?"
document: "The capital of Korea is Seoul."
Transformer: score 0.001029968
TEI: (f32) 0.0011602141, (f16) 0.001133569

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@alvarobartt

Kyogre-Primal · 2026-03-19T08:20:47Z

Hi @kozistr ,

I've been testing this PR on an NVIDIA RTX 5090 (sm_120) and also on CPU. While the 0.6B model works well, I've encountered a build blockage and a potential scoring discrepancy in the 8B model.

1. Build Issue (Fixed locally)

The build initially failed with error[E0432]: unresolved imports in backends/candle/src/models/flash_qwen3.rs. It seems ClassificationHead and Qwen3ClassificationHead are not correctly accessible from crate::models. I bypassed this locally by updating the imports to crate::models::qwen3::....

2. Scoring Observation (8B vs 0.6B)

After building the binary with CUDA_COMPUTE_CAP=120 (and also testing on CPU), I noticed that the 8B model in TEI seems to over-score irrelevant pairs significantly compared to the 0.6B version and vLLM.

Test Case:
Query: "What is the capital of China?" | Doc: "The capital of Korea is Seoul."

Model Variant	Backend	Score	Weight Source
Qwen3-0.6B	TEI (Candle)	0.0011513996	`Qwen/Qwen3-Reranker-0.6B`
Qwen3-0.6B	vLLM	0.00035568897	`tomaarsen/...-0.6B-seq-cls`
Qwen3-8B	TEI (Candle)	0.2746431	`Qwen/Qwen3-Reranker-8B`
Qwen3-8B	vLLM	0.00067309104	`tomaarsen/...-8B-seq-cls`

A Quick Question on My Setup

I noticed that for TEI I am using the official Qwen3-Reranker weights, whereas my vLLM reference uses the tomaarsen/Qwen3-Reranker-x-seq-cls variants (which I previously verified as working correctly).

Since the inflated score (0.27 vs 0.0006) persists on both CPU and CUDA backends in TEI, I was wondering:

Could there be a mapping issue in ClassificationHead when loading the official 8B weights (possibly related to GQA or hidden dimension size)?
Or is there a specific configuration I might have missed in the environment to align the output with the seq-cls reference?

I'd appreciate it if you could point me in the right direction if I'm doing something wrong. Happy to provide more logs for debugging!

Build Env: RTX 5090 | CUDA_COMPUTE_CAP=120 | Target: gRPC

Kyogre-Primal · 2026-03-19T09:36:27Z

Hi @kozistr ,

To provide more context for the benchmark results, I would like to share the exact prompt construction logic used in my vLLM reference. Crucially, this structure strictly follows the official recommendation for the Qwen3-Reranker models. This may be useful for comparing the data preprocessing between the two backends.

vLLM Prompt Construction (Python):

def _build_prompts(self, query: str, document: str, instruction: str) -> Tuple[str, str]:
    # Part 1: System and User Input
    query_str = (
        f'<|im_start|>system\n'
        f'Judge whether the Document meets the requirements based on the Query and the Instruct provided. '
        f'Note that the answer can only be "yes" or "no".<|im_end|>\n'
        f'<|im_start|>user\n'
        f'<Instruct>: {instruction}\n'
        f'<Query>: {query}\n'
    )
    
    # Part 2: Document and Assistant/Think Block
    formatted_doc = (
        f"<Document>: {document}<|im_end|>\n"
        f"<|im_start|>assistant\n"
        f"<think>\n\n</think>\n\n"
    )
    return query_str, formatted_doc

A quick question about TEI's expected input:
When using general-purpose engines like vLLM, it's common to build these templates manually. However, since TEI's /rerank API naturally separates the query and texts fields, I wasn't quite sure if I should pass the raw text or pre-format it.

For my TEI tests, I simply passed the raw strings, wondering if TEI might automatically apply the model's chat_template under the hood. If TEI currently uses raw concatenation for Qwen3 instead of the ChatML structure, I'm curious if this difference in input format might be a contributing factor to the 8B model's score drift and the 0.6B's quality drop. Could you kindly let me know the recommended way to pass inputs for such instruction-tuned models in TEI?

Additionally, I would like to share an objective observation regarding the 0.6B model's performance.

Observation on Ranking Quality:
Although the individual score for the 0.6B model (approx. 0.001) appears to be within a reasonable numerical range compared to vLLM, my further testing indicates a noticeable drop in NDCG scores when evaluated against established benchmarks for this model.

This observation might indicate that, even though the 0.6B variant doesn't exhibit the extreme numerical drift seen in the 8B model, there could still be a discrepancy in the inference pipeline (such as tokenization handling, prompt formatting differences, or embedding pooling) affecting the overall ranking accuracy.

I am currently gathering more detailed metrics to quantify this NDCG drop. I hope this information is helpful for evaluating the current implementation's alignment with the original model's expected behavior.

kozistr · 2026-03-21T12:48:51Z

Hi @kozistr ,

To provide more context for the benchmark results, I would like to share the exact prompt construction logic used in my vLLM reference. Crucially, this structure strictly follows the official recommendation for the Qwen3-Reranker models. This may be useful for comparing the data preprocessing between the two backends.

vLLM Prompt Construction (Python):

def _build_prompts(self, query: str, document: str, instruction: str) -> Tuple[str, str]:
# Part 1: System and User Input
query_str = (
f'<|im_start|>system\n'
f'Judge whether the Document meets the requirements based on the Query and the Instruct provided. '
f'Note that the answer can only be "yes" or "no".<|im_end|>\n'
f'<|im_start|>user\n'
f': {instruction}\n'
f': {query}\n'
)
# Part 2: Document and Assistant/Think Block
formatted_doc = (
    f"<Document>: {document}<|im_end|>\n"
    f"<|im_start|>assistant\n"
    f"<think>\n\n</think>\n\n"
)
return query_str, formatted_doc
A quick question about TEI's expected input: When using general-purpose engines like vLLM, it's common to build these templates manually. However, since TEI's /rerank API naturally separates the query and texts fields, I wasn't quite sure if I should pass the raw text or pre-format it. For my TEI tests, I simply passed the raw strings, wondering if TEI might automatically apply the model's chat_template under the hood. If TEI currently uses raw concatenation for Qwen3 instead of the ChatML structure, I'm curious if this difference in input format might be a contributing factor to the 8B model's score drift and the 0.6B's quality drop. Could you kindly let me know the recommended way to pass inputs for such instruction-tuned models in TEI?

Additionally, I would like to share an objective observation regarding the 0.6B model's performance.

Observation on Ranking Quality: Although the individual score for the 0.6B model (approx. 0.001) appears to be within a reasonable numerical range compared to vLLM, my further testing indicates a noticeable drop in NDCG scores when evaluated against established benchmarks for this model.

This observation might indicate that, even though the 0.6B variant doesn't exhibit the extreme numerical drift seen in the 8B model, there could still be a discrepancy in the inference pipeline (such as tokenization handling, prompt formatting differences, or embedding pooling) affecting the overall ranking accuracy.

I am currently gathering more detailed metrics to quantify this NDCG drop. I hope this information is helpful for evaluating the current implementation's alignment with the original model's expected behavior.

Hi @Kyogre-Primal! Thanks for the report! I looked into it and found there're some unintended behaviors in the code. In short, Qwen3-Reranker-8B isn't supposed to use the tied-embedding, but it turns out it's actually enabled.

So, I've just fixed the bug and checked that it worked!

Query: "What is the capital of China?" | Doc: "The capital of Korea is Seoul."
Transformer: 0.00022876262664794922
TEI: 0.0002287133

Could you kindly let me know the recommended way to pass inputs for such instruction-tuned models in TEI?

Currently, I've hard-coded them into the TEI code. You can find here!

Meaning that, you can simply query the raw texts to the TEI!

Thanks!

kozistr · 2026-03-21T13:23:58Z

+) I've just updated get_backend_model_type to recognize the model as a classifier when config.json contains label2id and id2label configs and the architecture ends with CausalLM.

And I opened the PRs to contain those configs in the config.json. please use the following models instead of the main branch model!

Qwen3-Reranker-8B: ...
Qwen3-Reranker-0.6B: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/23
Qwen3-Reranker-4B: https://huggingface.co/Qwen/Qwen3-Reranker-4B/discussions/10

Kyogre-Primal · 2026-03-23T02:16:25Z

Hi @kozistr , thank you so much for the super quick turnaround and for looking into the issues I raised! It's great to hear that the 8B scoring drift was just related to the tied-embeddings config. Also, thanks for clarifying that TEI expects raw text for the prompts.

I haven't had a chance to test your latest commits locally yet, but I will pull the updates and re-run my benchmarks (including the 0.6B NDCG tests) as soon as possible. Really appreciate your hard work and responsiveness on this! 🚀

Kyogre-Primal · 2026-03-24T03:26:48Z

Hi @kozistr,

I've completed my local benchmarks and I'm happy to report that the results look solid and perfectly align with expectations!

To ensure the prompt template was applied correctly for my specific evaluation, I bypassed the hardcoded Rust templates by passing the ChatML structure via a custom config_sentence_transformers.json in the model directory. This approach worked flawlessly.

{
  "prompts": {
    "query": "<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: ",
    "document": "\n<Document>: ",
    "prefix": "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n",
    "suffix": "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
  }
}

Evaluation Context:
For the ranking quality metrics, the NDCG scores presented below were calculated based on a custom dataset consisting of 5,000 query-document pairs annotated by GPT-5.2. Here are the benchmark results for both variants on my setup:

Qwen3-Reranker-0.6B

concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%      9.6     41.0    107.4     71.4  0.8358  0.8290  0.8222  0.8299  0.8442  0.8666  0.9174

Qwen3-Reranker-8B

concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%     71.8    318.2    710.0      9.9  0.8474  0.8514  0.8437  0.8510  0.8644  0.8824  0.9292

A quick observation on performance:
While the ranking quality (especially for the 8B model) is excellent, the latency and throughput drop is quite noticeable when compared to traditional encoder-based rerankers. For context, here is my baseline for bge-reranker-large on the same machine:

BGE-Reranker-Large (Baseline)

concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%      4.0      6.6     10.8    237.4  0.8289  0.8232  0.8161  0.8205  0.8356  0.8563  0.9155

Of course, this trade-off between speed and accuracy is completely expected given the CausalLM architecture and the parameter sizes of the Qwen3 models.

Overall, this is a fantastic and highly anticipated addition to TEI. The implementation is working great. Thank you so much for your hard work and responsiveness on this PR! 🚀

The build was failing with `error[E0432]` due to unresolved imports for `ClassificationHead` and `Qwen3ClassificationHead` in `flash_qwen3.rs`. This commit updates the import paths to correctly reference the `qwen3` module instead of the root `models` module, preventing namespace collisions and fixing the build block.

kozistr · 2026-03-24T03:40:46Z

Hi @kozistr,

I've completed my local benchmarks and I'm happy to report that the results look solid and perfectly align with expectations!

To ensure the prompt template was applied correctly for my specific evaluation, I bypassed the hardcoded Rust templates by passing the ChatML structure via a custom config_sentence_transformers.json in the model directory. This approach worked flawlessly.
{
  "prompts": {
    "query": "<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: ",
    "document": "\n<Document>: ",
    "prefix": "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n",
    "suffix": "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
  }
}
Evaluation Context: For the ranking quality metrics, the NDCG scores presented below were calculated based on a custom dataset consisting of 5,000 query-document pairs annotated by GPT-5.2. Here are the benchmark results for both variants on my setup:

Here are the benchmark results for both variants on my setup:

Qwen3-Reranker-0.6B
concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%      9.6     41.0    107.4     71.4  0.8358  0.8290  0.8222  0.8299  0.8442  0.8666  0.9174
Qwen3-Reranker-8B
concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%     71.8    318.2    710.0      9.9  0.8474  0.8514  0.8437  0.8510  0.8644  0.8824  0.9292
A quick observation on performance: While the ranking quality (especially for the 8B model) is excellent, the latency and throughput drop is quite noticeable when compared to traditional encoder-based rerankers. For context, here is my baseline for bge-reranker-large on the same machine:

BGE-Reranker-Large (Baseline)
concurrency       succ%  p50(ms)  p95(ms)  p99(ms)    req/s      @1      @3      @5     @10     @20     @30     @50
----------------------------------------------------------------------------------------------------------------------
1                100.0%      4.0      6.6     10.8    237.4  0.8289  0.8232  0.8161  0.8205  0.8356  0.8563  0.9155
Of course, this trade-off between speed and accuracy is completely expected given the CausalLM architecture and the parameter sizes of the Qwen3 models.

Overall, this is a fantastic and highly anticipated addition to TEI. The implementation is working great. Thank you so much for your hard work and responsiveness on this PR! 🚀

No worries :) Thanks for running the benchmark, and of course, for the fix.

Feel free to ping me if you spot any bugs or have any ideas for the current design :)

linshuai277 · 2026-04-01T03:00:25Z

LGTM

Kyogre-Primal · 2026-04-02T03:44:47Z

Hi @kozistr,

Hope everything is going well! I noticed this PR hasn't been merged yet, and I was thinking about a potential upstream blocker that might affect the user experience once it lands.

Since the current implementation in get_backend_model_type relies on the Qwen3 config.json having the label2id and id2label fields, users currently need the Qwen team to merge your Hugging Face PRs (like HF discussion #10 and #23). However, from what I've seen, the upstream Qwen maintainers are quite unresponsive and rarely merge community config updates.

To prevent this from blocking users from using the official weights out-of-the-box (and to decouple this PR from Qwen's slow review process), do you think we could provide a fallback mechanism directly within TEI?

For example:

If label2id and id2label are missing in the config, but the model's architectures or _name_or_path indicates it is a Qwen3 Reranker variant, could we programmatically inject/assume the default classification mapping?

This way, users can just pull the official Qwen/Qwen3-Reranker-8B without needing to modify the config.json themselves or use a forked repository. I suspect having it work seamlessly with the official unmodified weights might also help get this PR smoothly reviewed and merged by the TEI maintainers.

Let me know what you think about this approach! I'd be happy to help test the fallback logic if you decide to implement it.

kozistr · 2026-04-07T14:50:14Z

Hi @kozistr,

Hope everything is going well! I noticed this PR hasn't been merged yet, and I was thinking about a potential upstream blocker that might affect the user experience once it lands.

Since the current implementation in get_backend_model_type relies on the Qwen3 config.json having the label2id and id2label fields, users currently need the Qwen team to merge your Hugging Face PRs (like HF discussion #10 and #23). However, from what I've seen, the upstream Qwen maintainers are quite unresponsive and rarely merge community config updates.

To prevent this from blocking users from using the official weights out-of-the-box (and to decouple this PR from Qwen's slow review process), do you think we could provide a fallback mechanism directly within TEI?

For example:

If label2id and id2label are missing in the config, but the model's architectures or _name_or_path indicates it is a Qwen3 Reranker variant, could we programmatically inject/assume the default classification mapping?

This way, users can just pull the official Qwen/Qwen3-Reranker-8B without needing to modify the config.json themselves or use a forked repository. I suspect having it work seamlessly with the official unmodified weights might also help get this PR smoothly reviewed and merged by the TEI maintainers.

Let me know what you think about this approach! I'd be happy to help test the fallback logic if you decide to implement it.

Hi @Kyogre-Primal! For now, we can just load the model with a specific revision!

      --revision <REVISION>
          The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`

          [env: REVISION=]

_name_or_path indicates it is a Qwen3 Reranker variant

I've tried it, but, the problem is that Qwem embedding and reranker models have almost identical configurations, so it's hard to tell them apart just by looking at the config file :(

As you mentioned, maybe it might be a good idea to specify the revision number in the README, so users know which version to use!

thanks!

Kyogre-Primal · 2026-04-08T02:32:35Z

Hi @kozistr! Thanks for the explanation! The --revision approach makes sense as a pragmatic solution, especially given the difficulty of distinguishing reranker from embedding models by config alone.

My only remaining thought is that new users might not immediately know which revision to use — so documenting the recommended revision in the README (as you mentioned) would definitely help. If there's anything I can do to help with that, happy to open a follow-up PR.

Overall, great work on this PR — Qwen3-Reranker support is a valuable addition. Thanks again for all the effort! 🎉

keeper0steam-rgb · 2026-04-29T07:29:24Z

你好，我尝试过你提交的分支，确实Qwen3-Reranker可以使用了，但是Qwen3-Embedding不可用了，报以下错误：

root@l2-test:~/data/hf_cache# docker run --name qwen3-embd --gpus all -p 35551:80 -v /root/data/hf_cache:/data swr.cn-east-3.myhuaweicloud.com/kubesre/ghcr.io/keeper0steam-rgb/text-embeddings-inference:cuda-all-1.9.3-linux-amd64 --model-id /data/Qwen3-Embedding-8B --dtype float16 --served-model-name Qwen/Qwen3-Embedding-8B
2026-04-29T07:25:24.531899Z INFO text_embeddings_router: router/src/main.rs:216: Args { model_id: "/dat*/*****-********g-8B", revision: None, tokenization_workers: None, dtype: Some(Float16), served_model_name: Some("Qwen/Qwen3-Embedding-8B"), pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "8e5f11e67441", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2026-04-29T07:25:24.831104Z WARN text_embeddings_router: router/src/lib.rs:204: Could not find a Sentence Transformers config
2026-04-29T07:25:24.831123Z WARN text_embeddings_router: router/src/lib.rs:215: The maximum input length is 40960 which exceeds --max-batch-tokens=16384. Input sequences will be truncated to 16384 tokens, as --auto-truncate is either not provided (defaults to true) or provided as true. To avoid truncation, increase --max-batch-tokens to at least 40960 and set --auto-truncate false.
2026-04-29T07:25:24.831128Z INFO text_embeddings_router: router/src/lib.rs:222: Maximum number of tokens per request: 16384
2026-04-29T07:25:24.831298Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 15 tokenization workers
2026-04-29T07:25:24.831455Z INFO text_embeddings_router: router/src/lib.rs:290: Starting model backend
2026-04-29T07:25:25.127103Z INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:527: Starting FlashQwen3 model on Cuda(CudaDevice(DeviceId(1)))
2026-04-29T07:25:25.127508Z ERROR text_embeddings_backend: backends/src/lib.rs:567: Could not start Candle backend: Could not start backend: cannot find tensor .embed_tokens.weight
Error: Could not create backend

Caused by:
Could not start backend: Could not start a suitable backend

keeper0steam-rgb · 2026-04-30T01:40:12Z

还有，我在A100和L2上都把Qwen3-Reranker运行了，但是在T4上运行报错了
curl --request POST
--url http://124.71.149.120:35552/rerank
--header 'Content-Type: application/json'
--header 'User-Agent: insomnia/12.0.0'
--data '{"query": "when is Peng Weilong'''s birthday", "texts": [
"xxxxx",

"Peng Weilong'''s birthday is 2021-3-5"]}'

返回：
{
"error": "score is NaN",
"error_type": "Backend"
}

keeper0steam-rgb · 2026-04-30T06:07:17Z

@kozistr 请看下上面我提出的问题

malaiwah · 2026-06-26T00:12:10Z

Thanks @kozistr for the original implementation work here. I adapted the same yes/no logit-difference design into a new draft PR against current main: #886

The main fixes relative to this branch are:

preserve the existing model. prefix behavior for Qwen3 embeddings, avoiding the .embed_tokens.weight regression reported above
inject the Qwen3-Reranker prompt template only after Qwen3ForCausalLM has been resolved as a classifier/reranker, so Qwen3-Embedding models with the same architecture are not hijacked
add real-weight coverage for both Qwen/Qwen3-Embedding-0.6B and Qwen/Qwen3-Reranker-0.6B, including a left-padded reranker batch invariant

Validated locally on macOS/arm64 CPU/Candle with cargo test -p text-embeddings-backend-candle --test test_qwen3 -- --nocapture, plus backend/router compile checks.

kozistr added 3 commits February 20, 2026 01:44

feature: qwen3-reranker

6c54eb4

feature: qwen3 template

cf35f06

update: test cases

b628b56

kozistr mentioned this pull request Feb 20, 2026

[Feature] Add native Qwen3-Reranker support and SM87 compute capability #795

Closed

4 tasks

kozistr added 7 commits February 21, 2026 20:02

update: qwen3 classification head

58ccd7a

docs: README

3b364c6

update: id2label

f23cc8c

Merge branch 'main' into feature/qwen3-reranker

f9c55ba

Merge branch 'main' into feature/qwen3-reranker

f46e201

Merge branch 'main' into feature/qwen3-reranker

283280d

Merge branch 'main' into feature/qwen3-reranker

2596367

kozistr added 2 commits March 21, 2026 21:19

update: tie_word_embeddings config

1e714aa

update: classifier weight name

6fb471f

kozistr added 2 commits March 21, 2026 21:50

update: flash qwen3 classifier

b5cff4c

update: get_backend_model_type

6a7945a

Merge branch 'main' into feature/qwen3-reranker

458b515

Kyogre-Primal mentioned this pull request Mar 24, 2026

fix(candle): resolve import paths for Qwen3 classification heads kozistr/text-embeddings-inference#2

Merged

5 tasks

alvarobartt modified the milestone: v1.10.0 Mar 31, 2026

Merge branch 'main' into feature/qwen3-reranker

687620b

Merge branch 'main' into feature/qwen3-reranker

10ab603

Merge branch 'main' into feature/qwen3-reranker

b6af5d9

malaiwah mentioned this pull request Jun 26, 2026

feat(qwen3): support reranker on candle backend #886

Open

5 tasks

Uh oh!

Conversation

kozistr commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Note

Output (tested on CPU)

Before submitting

Who can review?

Uh oh!

Kyogre-Primal commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Build Issue (Fixed locally)

2. Scoring Observation (8B vs 0.6B)

A Quick Question on My Setup

Uh oh!

Kyogre-Primal commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kozistr commented Mar 21, 2026

Uh oh!

kozistr commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kyogre-Primal commented Mar 23, 2026

Uh oh!

Kyogre-Primal commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kozistr commented Mar 24, 2026

Uh oh!

linshuai277 commented Apr 1, 2026

Uh oh!

Kyogre-Primal commented Apr 2, 2026

Uh oh!

kozistr commented Apr 7, 2026

Uh oh!

Kyogre-Primal commented Apr 8, 2026

Uh oh!

keeper0steam-rgb commented Apr 29, 2026

Uh oh!

keeper0steam-rgb commented Apr 30, 2026

Uh oh!

keeper0steam-rgb commented Apr 30, 2026

Uh oh!

malaiwah commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kozistr commented Feb 20, 2026 •

edited

Loading

Kyogre-Primal commented Mar 19, 2026 •

edited

Loading

Kyogre-Primal commented Mar 19, 2026 •

edited

Loading

kozistr commented Mar 21, 2026 •

edited

Loading

Kyogre-Primal commented Mar 24, 2026 •

edited

Loading