[LLM:Feature] Add Qwen3-ASR and Qwen3-VL-Embedding export and runtime support.#4478
[LLM:Feature] Add Qwen3-ASR and Qwen3-VL-Embedding export and runtime support.#4478huangzhengxiang wants to merge 16 commits into
Conversation
wangzhaode
left a comment
There was a problem hiding this comment.
Review Summary
整体代码质量不错,特别是 tokenizer post_processor 对齐、Jinja join filter 支持、bicubic 插值和 vision 缓存优化都很好。
核心问题:ASR chat template 硬编码
最大的问题是 llm.cpp 中为 qwen3_asr 硬编码了 chat template 构建逻辑(build_qwen3_asr_prompt、try_build_qwen3_asr_chat_prompt 等约 70 行代码)。这违反了 MNN LLM 引擎的设计原则——chat template 应该通过 Jinja 模板在导出时配置到 config.json,而不是在 C++ runtime 中为特定模型写特殊分支。
建议:将 ASR 的 prompt 模板写到 config.json 的 jinja.chat_template 中,如果模型本身没有合适的模板,在 Python 导出脚本中构造一个。这样不需要修改 C++ 代码就能支持未来类似的 ASR 模型。
其他需要关注的点
- Omni 继承链变更 (Omni : Embedding : Llm):架构上合理,但需要确认 JNI/iOS bridge 等外部调用方不受影响
- Debug 信息过多:forwardRaw 中新增了大量 dims 打印,建议精简或用 MNN_DEBUG 宏控制
- shapeMutable 改为配置驱动:is_weight_eager_release() 这个逻辑需要确保默认行为不变
可以直接合入的部分
- bicubic 插值 (ImageProcessFunction.cpp)
- Jinja join filter (jinja.hpp)
- tokenizer post_processor 导出和加载 (tokenizer.cpp/py)
- vision 缓存复用优化
- null 检查和错误日志增强
请先解决 chat template 硬编码问题后再合入。
| std::vector<Express::VARP> outputs = selectModule->onForward(inputs); | ||
|
|
||
| if (outputs.empty()) { | ||
| MNN_ERROR("[Error]: onForward returned no outputs. seqLen=%d, inDecode=%d, inputs=%zu, moduleKey=(%d,%d)\n", |
| return contains_audio_tag(text); | ||
| } | ||
|
|
||
| static inline std::string build_qwen3_asr_prompt(const std::string& audio_prompt, |
There was a problem hiding this comment.
build_prompt不要加到这里;还是直接用jinja来实现吧
| MNN::Express::ExecutorScope s(mExecutor); | ||
| auto prompt = user_content; | ||
| if (mConfig->use_template()) { | ||
| if (use_qwen3_asr_audio_template(mConfig, user_content)) { |
There was a problem hiding this comment.
不要针对某个模型单独写template 还是写到config.json里,如果模型本身没有,就在导出时Python脚本里构造一下
| } | ||
| auto prompt = apply_chat_template(chat_prompts); | ||
| std::string prompt; | ||
| bool use_asr_prompt = try_build_qwen3_asr_chat_prompt(mConfig, chat_prompts, true, &prompt); |
| // asymmetry: the template adds <think> to the LAST assistant message only | ||
| // when true. Using false renders all messages consistently. | ||
| auto prompt_for_compare = mTokenizer->apply_chat_template(chat_prompts, false); | ||
| std::string prompt_for_compare; |
| } | ||
|
|
||
| bool asr_use_audio_template() const { | ||
| return config_.value("asr_use_audio_template", false); |
| self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True, use_fast=False) | ||
| except: | ||
| self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True, use_fast=True) | ||
| prefer_fast = model_type in ('qwen3_vl', 'qwen3_vl_moe') |
There was a problem hiding this comment.
这里 prefer_fast 直接按照模型类型选择吗
|
|
||
| --- | ||
|
|
||
| ## 13. 外部包模型的注册与复合配置嵌套 |
There was a problem hiding this comment.
这里增加的内容可以简化一下 其实2~3行就能说明
|
|
||
| --- | ||
|
|
||
| ## 15. Audio encoder 导出接口与 C++ runtime 输入约定不一致 |
There was a problem hiding this comment.
这个问题也不用太大篇幅 最多用5~6行就可以了
|
resync the alibaba/MNN master branch. |
Description
This patch adds end-to-end support for Qwen3-ASR-0.6B in MNN across both the Python export pipeline and the C++ runtime. On the export side, it introduces qwen3_asr model registration, mapping, and audio wrapper support so the model can be exported correctly as llm.mnn + audio.mnn. On the runtime side, it adds a dedicated qwen3_asr audio encoder path and fixes the input-shape mismatch between the exported audio encoder and the generic whisper runtime path. It also adds protection for empty input_ids so invalid inputs fail cleanly instead of crashing.
The patch has been validated end to end: build-linux successfully builds llm_demo and MNNConvert, the exported Qwen3-ASR package runs on real audio, and outputs match the Python baseline across multiple samples.
Besides, this PR fixes a set of runtime/export mismatches that caused large divergence between the C++ embedding pipeline and the Python/HuggingFace reference for Qwen3-VL-Embedding, especially on multimodal inputs.
The main issues were not in one place. They spanned:
visual models
After these fixes:
behavior
remaining difference reduced to small numeric drift rather than
structural mismatch
Root Causes Fixed
join('').
became ["hello world"] instead of hello world.
post_processor behavior in tokenizer.json.
C++ path could miss the final special token.
the wrong granularity
incorrectly received sequence-level post-processing, inserting a
trailing special token before image blocks.
multimodal content, then apply post-processing once to the full
sequence.
CV image-process path had no cubic sampler
behavior in the CV image processing path.
which is insufficient for visual embedding models.
embedding mode.
What Changed
Tokenizer / template alignment
TemplateProcessing.
whether to apply post-processing.
Files:
Embedding runtime fixes
embedding models.
module load.
embedding models can reuse embedding APIs while still taking the
multimodal runtime path.
support.
Files:
Qwen3-VL preprocessing alignment
input tensor metadata.
instead of linear.
intermediates:
Files:
CV bicubic implementation
image-process path.
back to nearest.
Files:
Developer / workflow updates
debugging.
for multimodal alignment bugs, compare real C++ runtime inputs/
outputs first instead of inferring from export-side behavior.
Files:
Why This Fix Is Correct
The fixes were validated against the real Python reference path, not
only export-side assumptions.
Text path:
the HuggingFace reference.
content rendering were both reproduced and then fixed.
Multimodal token path:
segment post-processing.
multimodal token length aligned with Python.
Vision preprocessing:
after real cubic sampling was added.
End-to-end effect:
preprocessing bug; it is reduced to small numerical deviation.
Performance Impact
This PR also removes unnecessary overhead in the Qwen3-VL vision
preprocess path:
tensors
It also clarifies the main performance bottleneck observed during
profiling:
preprocessing
Testing
Verified locally:
encode fix
improvements
reference
remained
Build check:
Notes
This PR intentionally keeps the substantive runtime/export fixes and
removes temporary profiling prints used during debugging.
Module
Type
Checklist