[KMCompiler][ttx]Optimize rms_norm for small cols#363
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new Triton kernel _rmsnorm_infer_small_cols_kernel to optimize RMSNorm inference for small column sizes, and updates the implementation to dynamically calculate BLOCK_SIZE_M and conditionally dispatch the appropriate kernel. The review feedback highlights critical issues where block sizes (BLOCK_SIZE_M and BLOCK_SIZE_N) may not be powers of two, which would cause Triton compilation failures, and provides actionable suggestions to resolve them.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Description
The rms_norm operator has been optimized for the Ascend platform.
Changes
Added _rmsnorm_infer_small_cols_kernel for n_cols <= 2048.
Updated the BLOCK_SIZE_M selection logic for inference.
Performance
Using Ascend 910B and Triton 3.2.x of FlagTree,cann-8.5.0 :
Accuracy test