[KMCompiler][ttx]Optimize rms_norm for small cols by YangLong114514 · Pull Request #363 · XPU-Forces/mojo_opset

YangLong114514 · 2026-06-15T03:44:59Z

Description

The rms_norm operator has been optimized for the Ascend platform.

Changes

Added _rmsnorm_infer_small_cols_kernel for n_cols <= 2048.
Updated the BLOCK_SIZE_M selection logic for inference.

Performance

Using Ascend 910B and Triton 3.2.x of FlagTree，cann-8.5.0 :

shape	Dtype	befor	after	speedup
(1, 8, 128)	float32	4.1424	2.1648	1.91
(4, 32, 256)	float32	5.3104	3.1248	1.70
(8, 256, 2048)	float32	25.9360	25.3568	1.02
(8, 256, 4096)	float32	38.6848	38.6608	1.00
(1, 16, 4096)	float32	6.6352	4.68	1.42
(4, 32, 2048)	float32	5.944	5.4448	1.09

Accuracy test

platform linux -- Python 3.11.13, pytest-8.3.2, pluggy-1.6.0
rootdir: /data/baai_user_home/jstar/mojo-work/mojo_opset
configfile: pytest.ini
plugins: xdist-3.6.1, anyio-4.10.0
collected 10 items

test_normalization.py::test_rmsnorm[1e-05-dtype0-shape0] PASSED
test_normalization.py::test_rmsnorm[1e-05-dtype0-shape1] PASSED
test_normalization.py::test_rmsnorm[1e-05-dtype0-shape2] PASSED
test_normalization.py::test_rmsnorm[1e-05-dtype0-shape3] PASSED
test_normalization.py::test_rmsnorm[1e-05-dtype0-shape4] PASSED
test_normalization.py::test_rmsnorm[1e-05-dtype1-shape0] PASSED
test_normalization.py::test_rmsnorm[1e-05-dtype1-shape1] PASSED
test_normalization.py::test_rmsnorm[1e-05-dtype1-shape2] PASSED
test_normalization.py::test_rmsnorm[1e-05-dtype1-shape3] PASSED
test_normalization.py::test_rmsnorm[1e-05-dtype1-shape4] PASSED

============================ 10 passed in 24.16s ==========================

gemini-code-assist

Code Review

This pull request introduces a new Triton kernel _rmsnorm_infer_small_cols_kernel to optimize RMSNorm inference for small column sizes, and updates the implementation to dynamically calculate BLOCK_SIZE_M and conditionally dispatch the appropriate kernel. The review feedback highlights critical issues where block sizes (BLOCK_SIZE_M and BLOCK_SIZE_N) may not be powers of two, which would cause Triton compilation failures, and provides actionable suggestions to resolve them.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

[KMCompiler]opt for rms_norm

eddd4e7

YangLong114514 changed the title ~~[KMCompiler]Optimize rms_norm for small cols~~ [KMCompiler][ttx]Optimize rms_norm for small cols Jun 15, 2026

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread mojo_opset/backends/ttx/kernels/npu/rmsnorm.py Outdated

Comment thread mojo_opset/backends/ttx/kernels/npu/rmsnorm.py

review update

652d3e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KMCompiler][ttx]Optimize rms_norm for small cols#363

[KMCompiler][ttx]Optimize rms_norm for small cols#363
YangLong114514 wants to merge 2 commits into
XPU-Forces:masterfrom
YangLong114514:KMCompiler-RmsNorm

YangLong114514 commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

YangLong114514 commented Jun 15, 2026

Description

Changes

Performance

Accuracy test

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant