add softmax kernel 9: online softmax with 4x unrolled loads by RahulPatnaik · Pull Request #841 · karpathy/llm.c

RahulPatnaik · 2026-03-14T12:27:18Z

Add softmax forward kernel 9

Adds a new kernel variant to dev/cuda/softmax_forward.cu that builds on kernel 8 (online softmax with warp-level reductions).

What changed

Kernel 9 unrolls the coarsening loops by a factor of 4, issuing multiple global memory loads together to increase memory-level parallelism. The online softmax algorithm (fused max + sum in a single pass) and warp shuffle reductions are kept from kernel 8. A precomputed reciprocal replaces per-element division in the output pass.

Results (RTX 4060 Laptop GPU)

Kernel	Time	DRAM Throughput
8	22.82 ms	84.45%
9	20.76 ms	92.82%

~9% wall-clock improvement. The gain comes from better memory bus utilization through overlapped loads, not from reduced memory traffic (both kernels make the same number of global memory passes).

Notes

Correctness validated against CPU reference at all block sizes (32, 64, 128, 256, 512, 1024)
Uses the min(C-1, ...) out-of-bounds clamping trick from kernel 7 to keep loads branchless
No shared memory required (warps are independent, one warp per row)

add softmax kernel 9: online softmax with 4x unrolled loads

93f6cc9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add softmax kernel 9: online softmax with 4x unrolled loads#841

add softmax kernel 9: online softmax with 4x unrolled loads#841
RahulPatnaik wants to merge 1 commit intokarpathy:masterfrom
RahulPatnaik:explore-kernel-profiling

RahulPatnaik commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RahulPatnaik commented Mar 14, 2026

Add softmax forward kernel 9

What changed

Results (RTX 4060 Laptop GPU)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant