[KMCompiler][ttx] Optimize gelu with rowwise nomask kernels by YangLong114514 · Pull Request #364 · XPU-Forces/mojo_opset

YangLong114514 · 2026-06-15T07:57:53Z

Description

The gelu operator has been optimized for the Ascend platform.

Changes

Updated the BLOCK_SIZE_N calculation to use a capped power-of-two tile size. For GELU, the cap is set to 1024 to reduce UB pressure from the tanh approximation path.
Added dynamic grid calculation based on the autotuned BLOCK_SIZE_M, reducing empty programs for small shapes.
Added no-mask forward kernels for fully divisible row/column tiles, removing mask construction and masked tl.load / tl.store overhead.
Added a no-mask single-tile forward kernel for shapes where n_cols == BLOCK_SIZE_N, eliminating the inner column loop for common benchmark shapes.

Performance

Using Ascend 910C and Triton 3.2.x of FlagTree:

shape	before	after	speedup = before/after
(16, 16)	2.560	1.512	1.69
(32, 64)	2.728	1.516	1.80
(128, 128)	3.324	2.036	1.63
(128, 256)	3.7602	2.400	1.57
(256, 256)	4.4282	3.304	1.34
(512, 256)	4.984	4.084	1.22

Accuracy test

The correctness verification of the GeLU operator for all three data types, float32, float16, and bf16, passed in the Mojo correctness verification directory.

gemini-code-assist

Code Review

This pull request optimizes the GELU forward pass on NPU by introducing specialized no-mask kernels (_gelu_fwd_nomask_kernel and _gelu_fwd_nomask_single_kernel) when dimensions align, and refactors grid and block size calculations. The feedback suggests reusing the cached get_num_cores utility from .utils instead of dynamically querying device properties on every grid calculation, and dynamically computing GELU_TANH_MAX_BLOCK_SIZE_M from the autotune configurations to prevent potential out-of-bounds bugs if the configurations are updated in the future.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

[KMCompiler] Optimize gelu with rowwise nomask kernels

4154d44

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread mojo_opset/backends/ttx/kernels/npu/gelu.py Outdated

Comment thread mojo_opset/backends/ttx/kernels/npu/gelu.py Outdated

Comment thread mojo_opset/backends/ttx/kernels/npu/gelu.py Outdated

YangLong114514 changed the title ~~[KMCompiler] Optimize gelu with rowwise nomask kernels~~ [KMCompiler][ttx] Optimize gelu with rowwise nomask kernels Jun 15, 2026

[KMCompiler] Address GeLU review feedback

fce218d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KMCompiler][ttx] Optimize gelu with rowwise nomask kernels#364

[KMCompiler][ttx] Optimize gelu with rowwise nomask kernels#364
YangLong114514 wants to merge 2 commits into
XPU-Forces:masterfrom
YangLong114514:KMCompiler-GeLU

YangLong114514 commented Jun 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

YangLong114514 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Performance

Accuracy test

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YangLong114514 commented Jun 15, 2026 •

edited

Loading