optimize for mlu grouprmsnorm by yuer-cn · Pull Request #285 · XPU-Forces/mojo_opset

yuer-cn · 2026-05-08T10:27:38Z

No description provided.

gemini-code-assist

Code Review

This pull request optimizes the RMSNorm kernel by introducing autotuning for BLOCK_M, vectorizing operations with 2D blocks, and adding a tl.dot optimization path for MLU hardware. Feedback suggests constraining the autotuner to prevent hardware resource exhaustion when both block dimensions are large. Additionally, it is recommended to transpose the vector rather than the data matrix in the dot product for better efficiency and to correct a misleading error message in the new input size assertion.

gemini-code-assist · 2026-05-08T10:30:02Z

-@triton.jit
+
+def cfggen():
+    block_m = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]


The autotuning configuration includes BLOCK_M values up to 2048. When combined with BLOCK_N up to 8192, the resulting 2D block size (2048 * 8192 = 16M elements) is likely to exceed the hardware's LRAM capacity or Triton's internal limits for register allocation, which may lead to compilation failures or significant performance degradation due to spilling. Consider capping BLOCK_M or adding a constraint to the autotuner to keep the total block size within reasonable limits (e.g., BLOCK_M * BLOCK_N <= 65536).

mlu需要单次处理的数据量够大才能有效利用片上资源，tuning会选择一个可执行且性能最优的配置

gemini-code-assist · 2026-05-08T10:30:02Z

+
+        if USE_DOT:
+            sq = x_f32 * x_f32
+            acc = tl.reshape(tl.dot(ones, tl.trans(sq), allow_tf32=False), (BLOCK_M,))


In Triton kernels, transposing a large data matrix in registers can be expensive. Since ones is a vector, it is more efficient to transpose ones instead of sq to achieve the same reduction sum. This avoids the overhead of transposing the (BLOCK_M, BLOCK_N) matrix while still leveraging the tl.dot optimization for MLU hardware.

acc = tl.reshape(tl.dot(sq, tl.trans(ones), allow_tf32=False), (BLOCK_M,))

References

On MLU hardware, use tl.dot() for matrix multiplication in Triton kernels, as triton-mlu can utilize tensor cores for acceleration, even for small matrix dimensions.

出于性能考虑，sq需作为权值矩阵参与计算

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

optimize for mlu grouprmsnorm

f0f06a4

yuer-cn force-pushed the opt_grouprmsnorm branch from e3a355f to f0f06a4 Compare May 15, 2026 01:22

修改打印信息

39d4790

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize for mlu grouprmsnorm#285

optimize for mlu grouprmsnorm#285
yuer-cn wants to merge 2 commits into
masterfrom
opt_grouprmsnorm

yuer-cn commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

yuer-cn May 15, 2026

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

yuer-cn May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yuer-cn commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

yuer-cn May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

yuer-cn May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant