Skip to content

[KMCompiler][ttx] Optimize silu with rowwise nomask kernels#365

Open
YangLong114514 wants to merge 2 commits into
XPU-Forces:masterfrom
YangLong114514:KMcompiler-SiLU
Open

[KMCompiler][ttx] Optimize silu with rowwise nomask kernels#365
YangLong114514 wants to merge 2 commits into
XPU-Forces:masterfrom
YangLong114514:KMcompiler-SiLU

Conversation

@YangLong114514

Copy link
Copy Markdown

Description

The silu operator has been optimized for the Ascend platform.

Changes

  1. Updated the BLOCK_SIZE_N calculation to use a capped power-of-two tile size, reducing unstable tile-size jumps around boundary shapes.

  2. Added dynamic grid calculation based on the autotuned BLOCK_SIZE_M, avoiding unnecessary empty programs for small shapes.

  3. Added no-mask forward kernels for shapes that are fully divisible by the row and column tiles, removing mask construction and masked tl.load / tl.store overhead.

  4. Added a no-mask single-tile forward kernel for shapes where n_cols == BLOCK_SIZE_N, eliminating the inner column loop for common benchmark shapes.

Performance

Using Ascend 910C and Triton 3.2.x of FlagTree:

shape before after speedup = before/after
(16, 16) 2.696 1.436 1.88
(32, 64) 2.536 1.280 1.98
(128, 128) 3.188 1.884 1.69
(512, 256) 3.500 2.656 1.32
(1024, 1024) 6.996 5.280 1.33
(2048, 512) 6.784 5.288 1.28

Accuracy test

The correctness verification of the SiLU operator for all three data types, float32, float16, and bf16, passed in the Mojo correctness verification directory.
image

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the SiLU activation implementation for NPU by introducing autotuned, non-masked forward kernels (_silu_fwd_nomask_kernel and _silu_fwd_nomask_single_kernel) along with helper functions for dynamic grid and block size calculation. The review feedback points out a critical issue where an empty tensor (with zero rows) would result in an invalid grid size of 0, causing a runtime error. Additionally, it recommends reusing the cached get_num_cores utility from .utils instead of querying NPU device properties directly on every kernel launch to eliminate unnecessary CPU overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread mojo_opset/backends/ttx/kernels/npu/silu.py Outdated
Comment thread mojo_opset/backends/ttx/kernels/npu/silu.py Outdated
Comment thread mojo_opset/backends/ttx/kernels/npu/silu.py Outdated
@YangLong114514 YangLong114514 changed the title [KMCompiler] Optimize silu with rowwise nomask kernels [KMCompiler][ttx] Optimize silu with rowwise nomask kernels Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant