Skip to content

[LLM:Feature] Add Qwen3-ASR and Qwen3-VL-Embedding export and runtime support.#4478

Open
huangzhengxiang wants to merge 16 commits into
alibaba:masterfrom
Embedded-AI-Systems:master
Open

[LLM:Feature] Add Qwen3-ASR and Qwen3-VL-Embedding export and runtime support.#4478
huangzhengxiang wants to merge 16 commits into
alibaba:masterfrom
Embedded-AI-Systems:master

Conversation

@huangzhengxiang

@huangzhengxiang huangzhengxiang commented May 28, 2026

Copy link
Copy Markdown
Contributor

Description

This patch adds end-to-end support for Qwen3-ASR-0.6B in MNN across both the Python export pipeline and the C++ runtime. On the export side, it introduces qwen3_asr model registration, mapping, and audio wrapper support so the model can be exported correctly as llm.mnn + audio.mnn. On the runtime side, it adds a dedicated qwen3_asr audio encoder path and fixes the input-shape mismatch between the exported audio encoder and the generic whisper runtime path. It also adds protection for empty input_ids so invalid inputs fail cleanly instead of crashing.

The patch has been validated end to end: build-linux successfully builds llm_demo and MNNConvert, the exported Qwen3-ASR package runs on real audio, and outputs match the Python baseline across multiple samples.

Besides, this PR fixes a set of runtime/export mismatches that caused large divergence between the C++ embedding pipeline and the Python/HuggingFace reference for Qwen3-VL-Embedding, especially on multimodal inputs.

The main issues were not in one place. They spanned:

  • chat template rendering in the C++ tokenizer/Jinja path
  • tokenizer post-processing and segmented multimodal encode behavior
  • vision preprocessing interpolation
  • stale/incorrect assumptions in the embedding runtime path for
    visual models
  • repeated allocations in the Qwen3-VL vision preprocess path

After these fixes:

  • text-only embedding aligns with the Python reference
  • multimodal token structure aligns with the Python processor
    behavior
  • Qwen3-VL vision-side inputs are structurally aligned
  • end-to-end image embedding is much closer to Python, with the
    remaining difference reduced to small numeric drift rather than
    structural mismatch

Root Causes Fixed

  1. C++ chat template rendering did not support join
  • The exported template uses messages | map(attribute='content') |
    join('').
  • The C++ Jinja subset supported map(...) but not join.
  • As a result, arrays were rendered via JSON dump and user content
    became ["hello world"] instead of hello world.
  1. Tokenizer post-processor behavior was not preserved end-to-end
  • The missing trailing special token came from tokenizer
    post_processor behavior in tokenizer.json.
  • Export/runtime only partially preserved tokenizer behavior, so the
    C++ path could miss the final special token.
  1. Multimodal prompt assembly applied tokenizer post-processing at
    the wrong granularity
  • Omni::tokenizer_encode(const MultimodalPrompt&) split text around
    ... and called Tokenizer::encode() on each segment.
  • Once tokenizer post-processing was restored, each segment
    incorrectly received sequence-level post-processing, inserting a
    trailing special token before image blocks.
  • The correct behavior is to raw-encode segments, concatenate
    multimodal content, then apply post-processing once to the full
    sequence.
  1. Qwen3-VL image preprocessing requested cubic interpolation but the
    CV image-process path had no cubic sampler
  • FilterType_BICUBIC previously fell through to nearest-neighbor
    behavior in the CV image processing path.
  • This caused large mismatch in patch tensors versus Python.
  1. Qwen3-VL embedding runtime needed a visual-aware embedding path
  • Embedding::createEmbedding(...) always returned plain Embedding,
    which is insufficient for visual embedding models.
  • The visual embedding path needs Omni behavior even when used in
    embedding mode.

What Changed

Tokenizer / template alignment

  • Added join support to the C++ Jinja implementation.
  • Generalized tokenizer export/runtime handling of single-sequence
    TemplateProcessing.
  • Added a post-processing-aware tokenizer API so callers can choose
    whether to apply post-processing.
  • Changed multimodal prompt assembly to:
    • encode text segments without post-processing
    • assemble multimodal ids
    • apply tokenizer post-processing once at the end

Files:

  • transformers/llm/engine/src/tokenizer/jinja.hpp
  • transformers/llm/engine/src/tokenizer/tokenizer.hpp
  • transformers/llm/engine/src/tokenizer/tokenizer.cpp
  • transformers/llm/export/utils/tokenizer.py

Embedding runtime fixes

  • Embedding::createEmbedding(...) now instantiates Omni for visual
    embedding models.
  • Embedding::load() now sets external weight file explicitly before
    module load.
  • Omni now derives from Embedding, not directly from Llm, so visual
    embedding models can reuse embedding APIs while still taking the
    multimodal runtime path.
  • Added Omni::ids_embedding(...) and embedding-mode forwarding
    support.

Files:

  • transformers/llm/engine/src/embedding.cpp
  • transformers/llm/engine/src/omni.cpp
  • transformers/llm/engine/src/omni.hpp
  • transformers/llm/engine/include/llm/llm.hpp

Qwen3-VL preprocessing alignment

  • Qwen3-VL image preprocessing now uses actual image dimensions from
    input tensor metadata.
  • For Qwen3-VL, the vision resize path now uses cubic interpolation
    instead of linear.
  • Added reusable tensor caches for Qwen vision preprocess
    intermediates:
    • position_ids
    • attention masks
    • window index
    • idx_tensor
    • weight_tensor

Files:

  • transformers/llm/engine/src/omni.cpp
  • transformers/llm/engine/src/omni.hpp

CV bicubic implementation

  • Implemented cubic samplers for C1/C3/C4 image formats in the CPU
    image-process path.
  • Routed FilterType_BICUBIC to real cubic samplers instead of falling
    back to nearest.

Files:

  • source/backend/cpu/compute/ImageProcessFunction.hpp
  • source/backend/cpu/compute/ImageProcessFunction.cpp
  • source/cv/ImageProcessUtils.cpp

Developer / workflow updates

  • Kept the tokenizer demo enhancement for explicit encode-mode
    debugging.
  • Updated skills with the main lesson from this debugging session:
    for multimodal alignment bugs, compare real C++ runtime inputs/
    outputs first instead of inferring from export-side behavior.

Files:

  • transformers/llm/engine/demo/tokenizer_demo.cpp
  • skills/support-new-llm/SKILL.md
  • skills/test-ci/SKILL.md

Why This Fix Is Correct

The fixes were validated against the real Python reference path, not
only export-side assumptions.

Text path:

  • C++ tokenizer/chat-template output was compared directly against
    the HuggingFace reference.
  • The previously missing final special token and malformed user-
    content rendering were both reproduced and then fixed.

Multimodal token path:

  • The previous extra token before the image block was traced to per-
    segment post-processing.
  • After moving post-processing to the final assembled sequence, C++
    multimodal token length aligned with Python.

Vision preprocessing:

  • position_ids, idx_tensor, and weight_tensor align with Python.
  • Patch tensors moved from large mismatch to close numeric agreement
    after real cubic sampling was added.

End-to-end effect:

  • The remaining difference is no longer a structural/tokenization/
    preprocessing bug; it is reduced to small numerical deviation.

Performance Impact

This PR also removes unnecessary overhead in the Qwen3-VL vision
preprocess path:

  • avoids repeated allocation of several fixed-shape preprocess
    tensors
  • keeps the vision preprocess path cheaper for repeated requests

It also clarifies the main performance bottleneck observed during
profiling:

  • the dominant image-path cost is in visual.mnn forward, not in CV
    preprocessing

Testing

Verified locally:

  • rebuilt embedding_demo
  • confirmed text embedding alignment after tokenizer/template fixes
  • confirmed multimodal token structure alignment after segmented-
    encode fix
  • confirmed Qwen3-VL image preprocessing structural alignment
    improvements
  • confirmed end-to-end image embedding moved close to Python
    reference
  • rebuilt after cleanup to ensure no temporary profiling changes
    remained

Build check:

  • make -j8 embedding_demo

Notes

This PR intentionally keeps the substantive runtime/export fixes and
removes temporary profiling prints used during debugging.

Module

  • LLM

Type

  • Feature

Checklist

  • Code compiles without errors
  • Tested on relevant platform(s)

@wangzhaode wangzhaode self-assigned this May 28, 2026
@huangzhengxiang huangzhengxiang changed the title [LLM:Feature] Add Qwen3-ASR-0.6B export and runtime support [LLM:Feature] Add Qwen3-ASR and Qwen3-VL-Embedding export and runtime support. May 29, 2026

@wangzhaode wangzhaode left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

整体代码质量不错,特别是 tokenizer post_processor 对齐、Jinja join filter 支持、bicubic 插值和 vision 缓存优化都很好。

核心问题:ASR chat template 硬编码

最大的问题是 llm.cpp 中为 qwen3_asr 硬编码了 chat template 构建逻辑(build_qwen3_asr_prompt、try_build_qwen3_asr_chat_prompt 等约 70 行代码)。这违反了 MNN LLM 引擎的设计原则——chat template 应该通过 Jinja 模板在导出时配置到 config.json,而不是在 C++ runtime 中为特定模型写特殊分支。

建议:将 ASR 的 prompt 模板写到 config.json 的 jinja.chat_template 中,如果模型本身没有合适的模板,在 Python 导出脚本中构造一个。这样不需要修改 C++ 代码就能支持未来类似的 ASR 模型。

其他需要关注的点

  1. Omni 继承链变更 (Omni : Embedding : Llm):架构上合理,但需要确认 JNI/iOS bridge 等外部调用方不受影响
  2. Debug 信息过多:forwardRaw 中新增了大量 dims 打印,建议精简或用 MNN_DEBUG 宏控制
  3. shapeMutable 改为配置驱动:is_weight_eager_release() 这个逻辑需要确保默认行为不变

可以直接合入的部分

  • bicubic 插值 (ImageProcessFunction.cpp)
  • Jinja join filter (jinja.hpp)
  • tokenizer post_processor 导出和加载 (tokenizer.cpp/py)
  • vision 缓存复用优化
  • null 检查和错误日志增强

请先解决 chat template 硬编码问题后再合入。

std::vector<Express::VARP> outputs = selectModule->onForward(inputs);

if (outputs.empty()) {
MNN_ERROR("[Error]: onForward returned no outputs. seqLen=%d, inDecode=%d, inputs=%zu, moduleKey=(%d,%d)\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

优化一下Debug信息,可以减少一些

Comment thread transformers/llm/engine/src/llm.cpp Outdated
return contains_audio_tag(text);
}

static inline std::string build_qwen3_asr_prompt(const std::string& audio_prompt,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_prompt不要加到这里;还是直接用jinja来实现吧

Comment thread transformers/llm/engine/src/llm.cpp Outdated
MNN::Express::ExecutorScope s(mExecutor);
auto prompt = user_content;
if (mConfig->use_template()) {
if (use_qwen3_asr_audio_template(mConfig, user_content)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要针对某个模型单独写template 还是写到config.json里,如果模型本身没有,就在导出时Python脚本里构造一下

Comment thread transformers/llm/engine/src/llm.cpp Outdated
}
auto prompt = apply_chat_template(chat_prompts);
std::string prompt;
bool use_asr_prompt = try_build_qwen3_asr_chat_prompt(mConfig, chat_prompts, true, &prompt);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上 不要单独写chat_template

Comment thread transformers/llm/engine/src/llm.cpp Outdated
// asymmetry: the template adds <think> to the LAST assistant message only
// when true. Using false renders all messages consistently.
auto prompt_for_compare = mTokenizer->apply_chat_template(chat_prompts, false);
std::string prompt_for_compare;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

}

bool asr_use_audio_template() const {
return config_.value("asr_use_audio_template", false);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是否有必要单独加一个参数

self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True, use_fast=False)
except:
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True, use_fast=True)
prefer_fast = model_type in ('qwen3_vl', 'qwen3_vl_moe')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里 prefer_fast 直接按照模型类型选择吗


---

## 13. 外部包模型的注册与复合配置嵌套

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里增加的内容可以简化一下 其实2~3行就能说明


---

## 15. Audio encoder 导出接口与 C++ runtime 输入约定不一致

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个问题也不用太大篇幅 最多用5~6行就可以了

@huangzhengxiang

Copy link
Copy Markdown
Contributor Author

resync the alibaba/MNN master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants