flash norm + blackwell softmax by Lazarus-931 · Pull Request #4 · patrick-toulme/pyptx

Lazarus-931 · 2026-05-02T14:34:00Z

Per req of #2, added flash norm

There is a per kernel gain of 5-7%, didn't test fully as @fm1320 had said, and I do thing the 12% gains materialize when in a 256-token decode with 33 norm call. Per paper, I made it two sep @_kernel calls, one launching xW and another doing RMS(x).

Also included softmax for blackwell, ~2.7x faster than torch, but again std::exp is slower in torch than log2e used in this, so not really a accurate comparison, as @ezyang mentioned.

Blackwell softmax + FlashNorm kernel

add ampere flash_norm + bench entry

patrick-toulme · 2026-05-03T16:03:21Z

            kernel_name = _extract_entry_name(ptx_source)

        err, module = driver.cuModuleLoadData(ptx_source.encode())
-        if err == driver.CUresult.CUDA_ERROR_UNSUPPORTED_PTX_VERSION:


Why are the changes to this file needed?

for blackwell softmax, i kept hitting compiling errors, apparently torch's bundled CUDA driver JIT is built for sm_50–sm_90, so cuModuleLoadData always gave me a CUDA_ERROR_INVALID_PTX.

errors like:

Found GPU0 NVIDIA B200 which is of compute capability (CC) 10.0.
The following list shows the CCs this version of PyTorch was built
for and the hardware CCs it supports:

5.0 which supports hardware CC >=5.0,<6.0

6.0 which supports hardware CC >=6.0,<7.0

7.0 which supports hardware CC >=7.0,<8.0

7.5 which supports hardware CC >=7.5,<8.0

8.0 which supports hardware CC >=8.0,<9.0

8.6 which supports hardware CC >=8.6,<9.0

9.0 which supports hardware CC >=9.0,<10.0

patrick-toulme

Confused about why jax_support.py needs changes

Lazarus-931 and others added 3 commits May 2, 2026 10:20

bw softmax + flash_norm

960601d

Merge pull request #1 from Lazarus-931/softmax

7e3fe4a

Blackwell softmax + FlashNorm kernel

Update generated docs [skip ci]

547bf20

Lazarus-931 mentioned this pull request May 2, 2026

Feature Request: Support weightless RMSNorm (for FlashNorm weight folding trick) #2

Open

Lazarus-931 and others added 3 commits May 2, 2026 13:36

add ampere flash_norm + bench entry

d0862c8

Merge pull request #2 from Lazarus-931/softmax

7818b5b

add ampere flash_norm + bench entry

Update generated docs [skip ci]

424a9f9

patrick-toulme reviewed May 3, 2026

View reviewed changes

patrick-toulme requested changes May 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash norm + blackwell softmax#4

flash norm + blackwell softmax#4
Lazarus-931 wants to merge 6 commits into
patrick-toulme:mainfrom
Lazarus-931:main

Lazarus-931 commented May 2, 2026

Uh oh!

patrick-toulme May 3, 2026

Uh oh!

Lazarus-931 May 3, 2026 •

edited

Loading

Uh oh!

patrick-toulme left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Lazarus-931 commented May 2, 2026

Uh oh!

patrick-toulme May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Lazarus-931 May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrick-toulme left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lazarus-931 May 3, 2026 •

edited

Loading