[LLM:Feature] Add Qwen3-ASR and Qwen3-VL-Embedding export and runtime support. by huangzhengxiang · Pull Request #4478 · alibaba/MNN

huangzhengxiang · 2026-05-28T02:43:27Z

Description

This patch adds end-to-end support for Qwen3-ASR-0.6B in MNN across both the Python export pipeline and the C++ runtime. On the export side, it introduces qwen3_asr model registration, mapping, and audio wrapper support so the model can be exported correctly as llm.mnn + audio.mnn. On the runtime side, it adds a dedicated qwen3_asr audio encoder path and fixes the input-shape mismatch between the exported audio encoder and the generic whisper runtime path. It also adds protection for empty input_ids so invalid inputs fail cleanly instead of crashing.

The patch has been validated end to end: build-linux successfully builds llm_demo and MNNConvert, the exported Qwen3-ASR package runs on real audio, and outputs match the Python baseline across multiple samples.

Besides, this PR fixes a set of runtime/export mismatches that caused large divergence between the C++ embedding pipeline and the Python/HuggingFace reference for Qwen3-VL-Embedding, especially on multimodal inputs.

The main issues were not in one place. They spanned:

chat template rendering in the C++ tokenizer/Jinja path
tokenizer post-processing and segmented multimodal encode behavior
vision preprocessing interpolation
stale/incorrect assumptions in the embedding runtime path for
visual models
repeated allocations in the Qwen3-VL vision preprocess path

After these fixes:

text-only embedding aligns with the Python reference
multimodal token structure aligns with the Python processor
behavior
Qwen3-VL vision-side inputs are structurally aligned
end-to-end image embedding is much closer to Python, with the
remaining difference reduced to small numeric drift rather than
structural mismatch

Root Causes Fixed

C++ chat template rendering did not support join

The exported template uses messages | map(attribute='content') |
join('').
The C++ Jinja subset supported map(...) but not join.
As a result, arrays were rendered via JSON dump and user content
became ["hello world"] instead of hello world.

Tokenizer post-processor behavior was not preserved end-to-end

The missing trailing special token came from tokenizer
post_processor behavior in tokenizer.json.
Export/runtime only partially preserved tokenizer behavior, so the
C++ path could miss the final special token.

Multimodal prompt assembly applied tokenizer post-processing at
the wrong granularity

Omni::tokenizer_encode(const MultimodalPrompt&) split text around
... and called Tokenizer::encode() on each segment.
Once tokenizer post-processing was restored, each segment
incorrectly received sequence-level post-processing, inserting a
trailing special token before image blocks.
The correct behavior is to raw-encode segments, concatenate
multimodal content, then apply post-processing once to the full
sequence.

Qwen3-VL image preprocessing requested cubic interpolation but the
CV image-process path had no cubic sampler

FilterType_BICUBIC previously fell through to nearest-neighbor
behavior in the CV image processing path.
This caused large mismatch in patch tensors versus Python.

Qwen3-VL embedding runtime needed a visual-aware embedding path

Embedding::createEmbedding(...) always returned plain Embedding,
which is insufficient for visual embedding models.
The visual embedding path needs Omni behavior even when used in
embedding mode.

What Changed

Tokenizer / template alignment

Added join support to the C++ Jinja implementation.
Generalized tokenizer export/runtime handling of single-sequence
TemplateProcessing.
Added a post-processing-aware tokenizer API so callers can choose
whether to apply post-processing.
Changed multimodal prompt assembly to:
- encode text segments without post-processing
- assemble multimodal ids
- apply tokenizer post-processing once at the end

Files:

transformers/llm/engine/src/tokenizer/jinja.hpp
transformers/llm/engine/src/tokenizer/tokenizer.hpp
transformers/llm/engine/src/tokenizer/tokenizer.cpp
transformers/llm/export/utils/tokenizer.py

Embedding runtime fixes

Embedding::createEmbedding(...) now instantiates Omni for visual
embedding models.
Embedding::load() now sets external weight file explicitly before
module load.
Omni now derives from Embedding, not directly from Llm, so visual
embedding models can reuse embedding APIs while still taking the
multimodal runtime path.
Added Omni::ids_embedding(...) and embedding-mode forwarding
support.

Files:

transformers/llm/engine/src/embedding.cpp
transformers/llm/engine/src/omni.cpp
transformers/llm/engine/src/omni.hpp
transformers/llm/engine/include/llm/llm.hpp

Qwen3-VL preprocessing alignment

Qwen3-VL image preprocessing now uses actual image dimensions from
input tensor metadata.
For Qwen3-VL, the vision resize path now uses cubic interpolation
instead of linear.
Added reusable tensor caches for Qwen vision preprocess
intermediates:
- position_ids
- attention masks
- window index
- idx_tensor
- weight_tensor

Files:

transformers/llm/engine/src/omni.cpp
transformers/llm/engine/src/omni.hpp

CV bicubic implementation

Implemented cubic samplers for C1/C3/C4 image formats in the CPU
image-process path.
Routed FilterType_BICUBIC to real cubic samplers instead of falling
back to nearest.

Files:

source/backend/cpu/compute/ImageProcessFunction.hpp
source/backend/cpu/compute/ImageProcessFunction.cpp
source/cv/ImageProcessUtils.cpp

Developer / workflow updates

Kept the tokenizer demo enhancement for explicit encode-mode
debugging.
Updated skills with the main lesson from this debugging session:
for multimodal alignment bugs, compare real C++ runtime inputs/
outputs first instead of inferring from export-side behavior.

Files:

transformers/llm/engine/demo/tokenizer_demo.cpp
skills/support-new-llm/SKILL.md
skills/test-ci/SKILL.md

Why This Fix Is Correct

The fixes were validated against the real Python reference path, not
only export-side assumptions.

Text path:

C++ tokenizer/chat-template output was compared directly against
the HuggingFace reference.
The previously missing final special token and malformed user-
content rendering were both reproduced and then fixed.

Multimodal token path:

The previous extra token before the image block was traced to per-
segment post-processing.
After moving post-processing to the final assembled sequence, C++
multimodal token length aligned with Python.

Vision preprocessing:

position_ids, idx_tensor, and weight_tensor align with Python.
Patch tensors moved from large mismatch to close numeric agreement
after real cubic sampling was added.

End-to-end effect:

The remaining difference is no longer a structural/tokenization/
preprocessing bug; it is reduced to small numerical deviation.

Performance Impact

This PR also removes unnecessary overhead in the Qwen3-VL vision
preprocess path:

avoids repeated allocation of several fixed-shape preprocess
tensors
keeps the vision preprocess path cheaper for repeated requests

It also clarifies the main performance bottleneck observed during
profiling:

the dominant image-path cost is in visual.mnn forward, not in CV
preprocessing

Testing

Verified locally:

rebuilt embedding_demo
confirmed text embedding alignment after tokenizer/template fixes
confirmed multimodal token structure alignment after segmented-
encode fix
confirmed Qwen3-VL image preprocessing structural alignment
improvements
confirmed end-to-end image embedding moved close to Python
reference
rebuilt after cleanup to ensure no temporary profiling changes
remained

Build check:

make -j8 embedding_demo

Notes

This PR intentionally keeps the substantive runtime/export fixes and
removes temporary profiling prints used during debugging.

Module

LLM

Type

Feature

Checklist

Code compiles without errors
Tested on relevant platform(s)

…eprocessing

wangzhaode

Review Summary

整体代码质量不错，特别是 tokenizer post_processor 对齐、Jinja join filter 支持、bicubic 插值和 vision 缓存优化都很好。

核心问题：ASR chat template 硬编码

最大的问题是 llm.cpp 中为 qwen3_asr 硬编码了 chat template 构建逻辑（build_qwen3_asr_prompt、try_build_qwen3_asr_chat_prompt 等约 70 行代码）。这违反了 MNN LLM 引擎的设计原则——chat template 应该通过 Jinja 模板在导出时配置到 config.json，而不是在 C++ runtime 中为特定模型写特殊分支。

建议：将 ASR 的 prompt 模板写到 config.json 的 jinja.chat_template 中，如果模型本身没有合适的模板，在 Python 导出脚本中构造一个。这样不需要修改 C++ 代码就能支持未来类似的 ASR 模型。

其他需要关注的点

Omni 继承链变更 (Omni : Embedding : Llm)：架构上合理，但需要确认 JNI/iOS bridge 等外部调用方不受影响
Debug 信息过多：forwardRaw 中新增了大量 dims 打印，建议精简或用 MNN_DEBUG 宏控制
shapeMutable 改为配置驱动：is_weight_eager_release() 这个逻辑需要确保默认行为不变

可以直接合入的部分

bicubic 插值 (ImageProcessFunction.cpp)
Jinja join filter (jinja.hpp)
tokenizer post_processor 导出和加载 (tokenizer.cpp/py)
vision 缓存复用优化
null 检查和错误日志增强

请先解决 chat template 硬编码问题后再合入。

wangzhaode · 2026-06-04T09:42:10Z

    std::vector<Express::VARP> outputs = selectModule->onForward(inputs);

    if (outputs.empty()) {
+        MNN_ERROR("[Error]: onForward returned no outputs. seqLen=%d, inDecode=%d, inputs=%zu, moduleKey=(%d,%d)\n",


优化一下Debug信息，可以减少一些

wangzhaode · 2026-06-04T09:42:28Z

+    return contains_audio_tag(text);
+}
+
+static inline std::string build_qwen3_asr_prompt(const std::string& audio_prompt,


build_prompt不要加到这里；还是直接用jinja来实现吧

wangzhaode · 2026-06-04T09:43:28Z

    MNN::Express::ExecutorScope s(mExecutor);
    auto prompt = user_content;
-    if (mConfig->use_template()) {
+    if (use_qwen3_asr_audio_template(mConfig, user_content)) {


不要针对某个模型单独写template 还是写到config.json里，如果模型本身没有，就在导出时Python脚本里构造一下

wangzhaode · 2026-06-04T09:43:46Z

    }
-    auto prompt = apply_chat_template(chat_prompts);
+    std::string prompt;
+    bool use_asr_prompt = try_build_qwen3_asr_chat_prompt(mConfig, chat_prompts, true, &prompt);


同上不要单独写chat_template

wangzhaode · 2026-06-04T09:43:56Z

        // asymmetry: the template adds <think> to the LAST assistant message only
        // when true. Using false renders all messages consistently.
-        auto prompt_for_compare = mTokenizer->apply_chat_template(chat_prompts, false);
+        std::string prompt_for_compare;


wangzhaode · 2026-06-12T03:52:39Z

    }

+    bool asr_use_audio_template() const {
+        return config_.value("asr_use_audio_template", false);


这个是否有必要单独加一个参数

wangzhaode · 2026-06-12T03:54:24Z

-            self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True, use_fast=False)
-        except:
-            self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True, use_fast=True)
+        prefer_fast = model_type in ('qwen3_vl', 'qwen3_vl_moe')


这里 prefer_fast 直接按照模型类型选择吗

wangzhaode · 2026-06-12T03:55:36Z

+
+---
+
+## 13. 外部包模型的注册与复合配置嵌套


这里增加的内容可以简化一下其实2~3行就能说明

wangzhaode · 2026-06-12T03:56:05Z

+
+---
+
+## 15. Audio encoder 导出接口与 C++ runtime 输入约定不一致


这个问题也不用太大篇幅最多用5~6行就可以了

huangzhengxiang · 2026-06-24T05:54:04Z

resync the alibaba/MNN master branch.

huangzhengxiang and others added 2 commits May 28, 2026 10:37

[LLM:Feature] Add Qwen3-ASR-0.6B export and runtime support

b2984a3

Merge branch 'alibaba:master' into master

2f8c3c1

wangzhaode self-assigned this May 28, 2026

[LLM:Feature] Support and align Qwen3-VL embedding runtime with HF pr…

72238a5

…eprocessing

huangzhengxiang changed the title ~~[LLM:Feature] Add Qwen3-ASR-0.6B export and runtime support~~ [LLM:Feature] Add Qwen3-ASR and Qwen3-VL-Embedding export and runtime support. May 29, 2026

huangzhengxiang and others added 5 commits May 29, 2026 19:28

Merge remote-tracking branch 'origin/master'

63329bb

[LLM:Feature] update audio chat template

ab2e01c

merge main

d9d2621

Merge branch 'alibaba:master' into master

9ec9cd4

merge master

424988d

huangzhengxiang force-pushed the master branch from 1ad5f42 to 424988d Compare June 5, 2026 06:50

huangzhengxiang added 3 commits June 5, 2026 15:48

[LLM:Bugfix] fix vl embedding merge bugs

859287c

merge remote master

014a353

update skill

a97a8aa

wangzhaode requested changes Jun 11, 2026

View reviewed changes

wangzhaode reviewed Jun 12, 2026

View reviewed changes

huangzhengxiang added 2 commits June 12, 2026 15:01

[Refactor] move asr chat template to config.json jinja

c38f40c

merge remote master

8d6dcc5

huangzhengxiang requested a review from wangzhaode June 12, 2026 07:10

huangzhengxiang and others added 3 commits June 15, 2026 21:47

Merge branch 'alibaba:master' into master

6de5073

[Bugfix] fix set_config neglection of mllm config

e5b8ccb

sync remote master

d1070e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM:Feature] Add Qwen3-ASR and Qwen3-VL-Embedding export and runtime support.#4478

[LLM:Feature] Add Qwen3-ASR and Qwen3-VL-Embedding export and runtime support.#4478
huangzhengxiang wants to merge 16 commits into
alibaba:masterfrom
Embedded-AI-Systems:master

huangzhengxiang commented May 28, 2026 •

edited

Loading

Uh oh!

wangzhaode left a comment

Uh oh!

wangzhaode Jun 4, 2026

Uh oh!

wangzhaode Jun 4, 2026

Uh oh!

wangzhaode Jun 4, 2026

Uh oh!

wangzhaode Jun 4, 2026

Uh oh!

wangzhaode Jun 4, 2026

Uh oh!

wangzhaode Jun 12, 2026

Uh oh!

wangzhaode Jun 12, 2026

Uh oh!

wangzhaode Jun 12, 2026

Uh oh!

wangzhaode Jun 12, 2026

Uh oh!

huangzhengxiang commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		---

		## 15. Audio encoder 导出接口与 C++ runtime 输入约定不一致

Conversation

huangzhengxiang commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Module

Type

Checklist

Uh oh!

wangzhaode left a comment

Choose a reason for hiding this comment

Review Summary

核心问题：ASR chat template 硬编码

其他需要关注的点

可以直接合入的部分

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huangzhengxiang commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huangzhengxiang commented May 28, 2026 •

edited

Loading