webgpu: bypass manual mRoPE for text-only Qwen3.5 when GQA fuses RoPE#2245
Draft
qjia7 wants to merge 1 commit into
Draft
webgpu: bypass manual mRoPE for text-only Qwen3.5 when GQA fuses RoPE#2245qjia7 wants to merge 1 commit into
qjia7 wants to merge 1 commit into
Conversation
Text-only mRoPE collapses to standard 1D RoPE because Qwen3_5TextRotaryEmbedding expands a 2D position_ids to 3 identical axes and apply_interleaved_mrope returns freqs[0] unchanged. When GQA can perform fused RoPE we therefore bypass the manual mRoPE subgraph entirely, which removes the Shape -> Memcpy path that blocks WebGPU graph capture.
dd1fcfb to
86440e3
Compare
| self.attention_attrs["q_norm"] = True | ||
| self.attention_attrs["k_norm"] = True | ||
| super().make_attention_init(config) | ||
| super().make_attention_init() |
| super().__init__(config, io_dtype, onnx_dtype, ep, cache_dir, extra_options) | ||
|
|
||
| def make_attention_init(self, config): | ||
| def make_attention_init(self): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
For text-only Qwen3.5, multi-head RoPE (mRoPE) collapses to standard 1D RoPE:
Qwen3_5TextRotaryEmbeddingexpands a 2Dposition_idsinto 3 identical axesand
apply_interleaved_mropereturnsfreqs[0]unchanged. The manual mRoPEsubgraph (Shape → Expand → interleaved cos/sin caches → custom kernel) is
therefore equivalent to a plain fused-RoPE pass inside GQA.
When the GQA operator supports fused RoPE (
use_rope_in_attn=True, e.g. onWebGPU), this PR detects the text-only case and routes through the fused path,
bypassing the manual mRoPE subgraph entirely. This removes the
Shape → Memcpynode that reads a dynamic tensor shape at runtime — the path thatprevents WebGPU graph capture on Qwen3.5 text-only models.
Changes (
src/python/py/models/builders/qwen.pyonly):use_text_only_fused_ropeflag: true whenis_text_onlyanduse_rope_in_attn.make_rotary_embedding_caches()(standard 2D cos/sin for GQA), skip mRoPE config, leaveuse_rope_in_attn=True.make_position_ids_reformatting: early-returnNonewhen fused RoPE is active (noposition_idstensor on the data flow).Test plan
qwen3-0.6b,qwen2.5-0.5b-instruct