Skip to content

add softmax kernel 9: online softmax with 4x unrolled loads#841

Open
RahulPatnaik wants to merge 1 commit intokarpathy:masterfrom
RahulPatnaik:explore-kernel-profiling
Open

add softmax kernel 9: online softmax with 4x unrolled loads#841
RahulPatnaik wants to merge 1 commit intokarpathy:masterfrom
RahulPatnaik:explore-kernel-profiling

Conversation

@RahulPatnaik
Copy link
Copy Markdown

Add softmax forward kernel 9

Adds a new kernel variant to dev/cuda/softmax_forward.cu that builds on kernel 8 (online softmax with warp-level reductions).

What changed

Kernel 9 unrolls the coarsening loops by a factor of 4, issuing multiple global memory loads together to increase memory-level parallelism. The online softmax algorithm (fused max + sum in a single pass) and warp shuffle reductions are kept from kernel 8. A precomputed reciprocal replaces per-element division in the output pass.

Results (RTX 4060 Laptop GPU)

Kernel Time DRAM Throughput
8 22.82 ms 84.45%
9 20.76 ms 92.82%

~9% wall-clock improvement. The gain comes from better memory bus utilization through overlapped loads, not from reduced memory traffic (both kernels make the same number of global memory passes).

Notes

  • Correctness validated against CPU reference at all block sizes (32, 64, 128, 256, 512, 1024)
  • Uses the min(C-1, ...) out-of-bounds clamping trick from kernel 7 to keep loads branchless
  • No shared memory required (warps are independent, one warp per row)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant