qnn运行qwen2.5 性能profile瓶颈问题

相比于原始的转换逻辑，做了以下几个方式的变化。目的主要是希望所有在一个qnn的graph运行，
1. lm_head前取logits_index对应特征的逻辑，采用gather的方式去取，使得之前会被split成两个graph，现在就一个了。
2. 采用了max_history_token的方式。

整体性能还是比较差，比纯opencl的方案慢挺多的。
profile了一下性能，主要集中在计算max_token的qkv的matmul跟softmax。而单纯有效的length的matmul跟softmax还是不大的。
想了解一下，这一块有预期增加自定义hexgan算子的支持吗？
我认为不设置max_history_token，直接backup到cpu的attnention op，性能可能类似于写自定义的cpu op，但整体感觉性能如果采用自定义npu算子，是不是性能更高一些。
这个MNN有预期写一下自定义的hexgan的op吗？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

qnn运行qwen2.5 性能profile瓶颈问题 #4489

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

qnn运行qwen2.5 性能profile瓶颈问题 #4489

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions