[RMSNorm] Fix JIT recompilation by removing tl.constexpr on rows_per_program & Cleanup Block kernel interface by niyunsheng · Pull Request #988 · linkedin/Liger-Kernel

niyunsheng · 2025-12-24T04:34:22Z

Summary

This PR optimizes the JIT compilation behavior for _rms_norm_backward_kernel and cleans up the interface for _block_rms_norm_backward_kernel.

Avoid JIT Recompilation: Removes tl.constexpr from the rows_per_program argument in _rms_norm_backward_kernel.
Interface Cleanup: Removes the unused rows_per_program argument from _block_rms_norm_backward_kernel.

Details

Fix for Dynamic Shapes in _rms_norm_backward_kernel. Currently, rows_per_program is marked as tl.constexpr, but it is used within a standard dynamic range loop (not tl.static_range).

Issue: The tl.constexpr hint provides no loop unrolling benefits in this context because the loop bounds are determined at runtime (dependent on n_rows and program_id). However, Triton still treats the parameter as part of the kernel signature.
Impact: In dynamic shape scenarios (where rows_per_program changes with input size), this unnecessarily triggers JIT recompilation for every new shape, causing severe cache thrashing and CPU overhead without any performance gain.
Fix: Removing tl.constexpr allows the compiled kernel to be reused across different rows_per_program values.

Cleanup in _block_rms_norm_backward_kernel. The rows_per_program argument was unused in the block-wise implementation. It has been removed to avoid signature pollution and confusion.

Testing Done

Verified that the changes do not introduce performance regressions. The benchmark shows stable latency across different hidden sizes.

Performance Benchmark:

Hidden Size	Latency (ms)	P50 (ms)
1024.00	0.13	0.11
2048.00	0.12	0.12
4096.00	0.12	0.12
8192.00	0.12	0.11
16384.00	0.18	0.18
32768.00	1.37	1.39

Hardware Type: NVIDIA A100-SXM4-80GB
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Tcc0403

great catch!

niyunsheng added 2 commits December 24, 2025 11:07

rm useless param rows_per_program for block_rms_norm_backward_kernel

2b9e593

rm tl.constexpr for rows_per_program in _rms_norm_backward_kernel

4ab50aa

Tcc0403 approved these changes Dec 24, 2025

View reviewed changes

Tcc0403 merged commit 77949e0 into linkedin:main Dec 24, 2025
3 of 7 checks passed

niyunsheng deleted the rms_norm_block_backward branch December 25, 2025 00:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RMSNorm] Fix JIT recompilation by removing tl.constexpr on rows_per_program & Cleanup Block kernel interface#988

[RMSNorm] Fix JIT recompilation by removing tl.constexpr on rows_per_program & Cleanup Block kernel interface#988
Tcc0403 merged 2 commits intolinkedin:mainfrom
niyunsheng:rms_norm_block_backward

niyunsheng commented Dec 24, 2025

Uh oh!

Tcc0403 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

niyunsheng commented Dec 24, 2025

Summary

Details

Testing Done

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants