Skip to content

[ttx/npu] Optimize lightning_indexer_kernel by deferring k_scale to post-dot#317

Open
lyujheng wants to merge 2 commits into
XPU-Forces:masterfrom
lyujheng:lvzheng/ttx-optimize-lightning-indexer
Open

[ttx/npu] Optimize lightning_indexer_kernel by deferring k_scale to post-dot#317
lyujheng wants to merge 2 commits into
XPU-Forces:masterfrom
lyujheng:lvzheng/ttx-optimize-lightning-indexer

Conversation

@lyujheng

Copy link
Copy Markdown

Description

Optimize lightning_indexer_kernel by deferring k_scale application from pre-scaling K tensor
to post-dot product scaling, fully leveraging AIC and AIV core memory bandwidth.

Changes

  • Remove per-element K tensor pre-scaling (k = k * k_scale[:, None])
  • Apply k_scale after QK dot product as a lightweight broadcast multiply

Performance

Benchmark on Ascend 910C NPU and triton-ascend 3.2.0 (device latency):

Shape (B, M/N, K) dtype Before (us) After (us) Speedup
(128, 256, 128) bf16 23041.37 3775.62 6.10x
(128, 256, 128) fp16 23086.30 3776.70 6.11x
(128, 256, 128) fp32 12731.75 4275.99 2.98x
(24, 1024, 128) bf16 81850.27 22634.35 3.62x
(24, 1024, 128) fp16 81893.52 22645.16 3.62x
(24, 1024, 128) fp32 65051.82 28828.05 2.26x
(24, 16384, 128) bf16 1410.02 420.78 3.35x
(24, 16384, 128) fp16 1402.41 420.66 3.33x
(24, 16384, 128) fp32 1041.73 502.65 2.07x

Average speedup: ~3.6x (bf16/fp16 gains significantly higher than fp32).

Accuracy

All accuracy tests passed across all test shapes and dtypes (bf16/fp16/fp32).

…st-dot scaling

Move k_scale from pre-scaling K to post-dot application to better utilize AIC/AIV core memory bandwidth.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the lightning_indexer_kernel by moving the k_scale multiplication to occur after the dot product instead of before. This change reduces the computational overhead of element-wise multiplications by performing the scaling on the result of the dot product. I have no further feedback to provide.

@lyujheng

lyujheng commented Jun 1, 2026

Copy link
Copy Markdown
Author

Hi @wwens7, could you help review this PR when you have a chance?

This PR optimizes lightning_indexer_kernel by fusing k_scale into post-dot scaling, achieving ~3.6x average speedup on Ascend 910C NPU. All accuracy tests passed locally across bf16/fp16/fp32.

The change is straightforward — moving the scaling from pre-dot element-wise multiply to post-dot broadcast, reducing AIC-AIV data transfer overhead by avoiding global memory round-trips.

Thanks!

@Neuromancer42 Neuromancer42 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Neuromancer42 Neuromancer42 requested a review from Minghui-BD June 4, 2026 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants