[ttx/npu] Optimize lightning_indexer_kernel by deferring k_scale to post-dot by lyujheng · Pull Request #317 · XPU-Forces/mojo_opset

lyujheng · 2026-05-21T03:42:37Z

Description

Optimize lightning_indexer_kernel by deferring k_scale application from pre-scaling K tensor
to post-dot product scaling, fully leveraging AIC and AIV core memory bandwidth.

Changes

Remove per-element K tensor pre-scaling (k = k * k_scale[:, None])
Apply k_scale after QK dot product as a lightweight broadcast multiply

Performance

Benchmark on Ascend 910C NPU and triton-ascend 3.2.0 (device latency):

Shape (B, M/N, K)	dtype	Before (us)	After (us)	Speedup
(128, 256, 128)	bf16	23041.37	3775.62	6.10x
(128, 256, 128)	fp16	23086.30	3776.70	6.11x
(128, 256, 128)	fp32	12731.75	4275.99	2.98x
(24, 1024, 128)	bf16	81850.27	22634.35	3.62x
(24, 1024, 128)	fp16	81893.52	22645.16	3.62x
(24, 1024, 128)	fp32	65051.82	28828.05	2.26x
(24, 16384, 128)	bf16	1410.02	420.78	3.35x
(24, 16384, 128)	fp16	1402.41	420.66	3.33x
(24, 16384, 128)	fp32	1041.73	502.65	2.07x

Average speedup: ~3.6x (bf16/fp16 gains significantly higher than fp32).

Accuracy

All accuracy tests passed across all test shapes and dtypes (bf16/fp16/fp32).

…st-dot scaling Move k_scale from pre-scaling K to post-dot application to better utilize AIC/AIV core memory bandwidth.

gemini-code-assist

Code Review

This pull request optimizes the lightning_indexer_kernel by moving the k_scale multiplication to occur after the dot product instead of before. This change reduces the computational overhead of element-wise multiplications by performing the scaling on the result of the dot product. I have no further feedback to provide.

lyujheng · 2026-06-01T03:30:54Z

Hi @wwens7, could you help review this PR when you have a chance?

This PR optimizes lightning_indexer_kernel by fusing k_scale into post-dot scaling, achieving ~3.6x average speedup on Ascend 910C NPU. All accuracy tests passed locally across bf16/fp16/fp32.

The change is straightforward — moving the scaling from pre-dot element-wise multiply to post-dot broadcast, reducing AIC-AIV data transfer overhead by avoiding global memory round-trips.

Thanks!

Neuromancer42

LGTM

…indexer

[ttx/npu] optimize lightning_indexer_kernel by fusing k_scale into po…

a23efda

…st-dot scaling Move k_scale from pre-scaling K to post-dot application to better utilize AIC/AIV core memory bandwidth.

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

Neuromancer42 requested review from mazixuan-byted and zhangjihang-BD June 4, 2026 02:59

Neuromancer42 approved these changes Jun 4, 2026

View reviewed changes

Neuromancer42 requested a review from Minghui-BD June 4, 2026 05:49

Merge branch 'XPU-Forces:master' into lvzheng/ttx-optimize-lightning-…

bfae95d

…indexer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ttx/npu] Optimize lightning_indexer_kernel by deferring k_scale to post-dot#317

[ttx/npu] Optimize lightning_indexer_kernel by deferring k_scale to post-dot#317
lyujheng wants to merge 2 commits into
XPU-Forces:masterfrom
lyujheng:lvzheng/ttx-optimize-lightning-indexer

lyujheng commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

lyujheng commented Jun 1, 2026

Uh oh!

Neuromancer42 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lyujheng commented May 21, 2026

Description

Changes

Performance

Accuracy

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

lyujheng commented Jun 1, 2026

Uh oh!

Neuromancer42 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants